Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

A Survey of Multimodal Perception Methods for Human–Robot Interaction in Social Environments

Published: 16 October 2024 Publication History

Abstract

Human–robot interaction (HRI) in human social environments (HSEs) poses unique challenges for robot perception systems, which must combine asynchronous, heterogeneous data streams in real time. Multimodal perception systems are well-suited for HRI in HSEs and can provide more rich, robust interaction for robots operating among humans. In this article, we provide an overview of multimodal perception systems being used in HSEs, which is intended to be an introduction to the topic and summary of relevant trends, techniques, resources, challenges, and terminology. We surveyed 15 peer-reviewed robotics and HRI publications over the past 10+ years, providing details about the data acquisition, processing, and fusion techniques used in 65 multimodal perception systems across various HRI domains. Our survey provides information about hardware, software, datasets, and methods currently available for HRI perception research, as well as how these perception systems are being applied in HSEs. Based on the survey, we summarize trends, challenges, and limitations of multimodal human perception systems for robots, then identify resources for researchers and developers and propose future research areas to advance the field.

1 Introduction

Robots and intelligent systems are increasingly deployed in dynamic human social environments (HSEs) [82, 177]. Novel service and social robots are now a common sight in homes, hospitals, schools, airports, museums, and urban operating domains. Even manufacturing robots, which traditionally operate separately from humans, now work alongside humans in corobotic teams. In these domains, robots are being developed to address critical labor shortages and perform repetitive tasks, which has the potential to reshape many aspects of the modern world. However, operation with and among humans in HSEs is complex and challenging. Developing natural, situationally appropriate interaction behaviors for robots and intelligent systems in social environments is a perennial research challenge and places unique demands and limitations on the system’s design and usage.
In this survey, we explore the current state of perceptual systems used for human–robot interaction (HRI) in HSEs. Specifically, we focus on multimodal perceptual interfaces. Multimodal perception (MMP) is the acquisition, processing, and fusion of unimodal data streams to infer quantities of interest. A perceptual interface [157] allows humans to convey information using intuitive methods and without the need to use dedicated interface hardware. An MMP interface typically (but not exclusively) combines information from modalities such as pose, gaze, gesture, or speech using only a system’s onboard sensors like cameras or microphones.
The ability to observe human communication, status, and/or intent is a critical requirement of HRI and human–robot teaming applications in HSEs. Social robots, service robots, assistive robots, smart homes, virtual assistants, and some coBots must perceive humans to accomplish their respective tasks. Furthermore, systems that can effectively perceive and respond to various verbal and nonverbal cues experience a range of improved interaction outcomes. Adding to the challenge is that it generally cannot be assumed that humans in many social environments—such as pedestrians, patrons, hotel guests, and students—will be willing and able to carry or use robot interface hardware. Ultimately, the task of MMP HRI systems in HSEs is to gain as much information about nearby humans as possible, while only using onboard sensors and any additional connected resources.
In this survey, we reviewed various robotics and intelligent systems applications that employ MMP techniques. We note our specific contribution as surveying available options for data acquisition, processing, and fusion methods for HRI in HSEs using MMP interfaces. The remainder of this manuscript is organized as follows: Section 2 identifies characteristics and challenges of HRI in HSEs, defines key vocabulary, and identifies specific applications that employ MMP interfaces; Section 3 provides information on the survey method, survey results, and details of surveyed MMP systems; in Section 4, we summarize research trends in HRI and MMP systems and identify opportunities to advance HRI research in HSEs using multimodal systems.

2 Characterizing HRI in HSEs

In this section, we define key characteristics and challenges unique to HRI in HSEs and identify the scope of our survey. We use the definition of Riek et al. of HSE to mean any environment in which a robot operates proximately with humans [132]. Broadly, HSEs are characterized by limited, unknown, or changing environmental structure, where the same physical space is likely to be used for many different activities by any number of people [117]. Example HSEs include public settings like malls, libraries, museums, schools, city streets and offices, and private settings like residences. For a thorough but dated overview of HRI challenges in social environments, we recommend Goodrich and Schultz’s 2007 review [60], while a more brief and recent summary of HRI challenges can be found in [22]. HRI in HSEs is characterized by one or more of the following:
Multimodal: Human intent, status, or communication may require information from multiple modalities to be correctly interpreted [180].
Multiparty: Multiple humans may be present near the robot, and they may interact with each another or the robot. Multiparty interaction features phenomena such as interruption, turn-taking, entering/exiting the field of regard, and engagement [19, 20, 21, 66, 180, 182, 188].
Mixed-Initiative: The robot’s actions may be influenced by multiple, potentially conflicting, sources [53, 78, 88]. Jiang and Arkin [78] define an initiative as any source that specifies an “element of the mission \(\ldots\) from low-level motion control of the robot to high-level specification of mission goal.” For example, a robot which blends control commands from a human and its onboard obstacle detection algorithm is a mixed-initiative system.
Open: The number of agents that the robot interacts with is not known prior to the interaction and may change during the interaction [19].
Situated or Context-based: Interaction between agents is situationally dependent and is shaped by the current status of the environment and elements within it [7, 6, 22, 93].
Dynamic: The robot, humans, or the environment change their state throughout the course of the interaction.
Perceptual: Direct physical contact cannot be assumed and the use of dedicated interface hardware or wired sensors may not be feasible. Human interaction must be performed primarily with a perceptual interface consisting of the robot’s onboard sensors such as cameras, microphones, or lasers. For example, the robot may interact with pedestrians, bystanders, or children who do not carry specialized hardware to interface with the robot [157].
Additionally, HRI applications may be complicated by limited onboard resources of embedded systems or unfavorable operating conditions, such as high noise or reduced visibility.
Rather than comprehensively survey each aspect of HRI in HSEs, this survey focuses on the current state of multimodal perceptual interfaces for use in HSEs. Perception is a fundamental prerequisite for intelligent agents to interact with humans in HSEs; a robot cannot plan or execute intelligent interaction behaviors without first perceiving nearby humans. To this end, we review means of acquiring, processing, and fusing multiple modes of human perception data, as well as specific applications of MMP systems in HSEs. Although cognitive/planning architectures and action execution are also necessary components for HRI in HSEs, these are outside the scope of this survey.
Our primary goal for this survey is to document the current capabilities and uses of MMP HRI systems in HSEs, including specific techniques and tools available for HRI researchers. We intend for this to be a broad overview of MMP interfaces for intelligent social agents, applicable to various emerging domains, and that provides the reader with information and resources for further investigation. Toward these goals, we surveyed any MMP system that was used to observe humans and did not require the human to carry or use specialized hardware. In other words, we sought to determine what a MMP system can learn about nearby humans using only system hardware. Specifics on the survey method and results are provided in the following section.

3 MMP Survey

In this section, we survey MMP applications across various domains and examine data acquisition, processing, and fusion techniques that enable these applications. Section 3.1 discusses the survey technique, and Section 3.2 provides an overview of the survey results. Section 3.3 discusses MMP HRI capabilities and applications, while Section 3.4 categorizes surveyed works by the type of information they infer. Audio and visual data acquisition and processing are discussed in Sections 3.5 and 3.6, respectfully. Section 3.7 discusses other sensor modalities. Section 3.8 discusses multimodal datasets used in surveyed works, as well as public multimodal datasets available for MMP HRI development. Finally, Section 3.9 categorizes and reviews techniques for data fusion. The survey results are summarized in Tables 2 through 7. Note that while many of these MMP systems provide information to higher-level planning and behavioral control modules, or are used in longer-term learning or interaction studies, the focus of this survey is the perception function of these systems.

3.1 Survey Method

We conducted a review of peer-reviewed journal and conference literature using Google Scholar’s advanced search function. This survey does not include commercial HRI robotic systems, since their system designs are often proprietary and not publicly available. The advanced search function allows a custom search of specific publications, search phrases, and date ranges. Our search consisted of the 15 peer-reviewed publications listed in Table 1(a) that contained any of the exact phrases in Table 1(b) anywhere in the article, published from 2012 to 2023.
Table 1.
(a)
Publications Surveyed
ACM Transactions on Human Robot Interaction
ACM/IEEE International Conference on Human Robot Interaction
IEEE International Conference on Robotics and Automation
IEEE Robotics and Automation Letters
IEEE Transactions on Robotics
IEEE Workshop on Advanced Robotics and its Social Impacts
IEEE/ASME Transactions on Mechatronics
IEEE/RSJ International Conference on Intelligent Robots and Systems
International Conference on Social Robotics
The International Journal of Robotics Research
International Journal of Social Robotics
Robotics and Autonomous Systems
Robotics and Computer-Integrated Manufacturing
Robotics: Science and Systems
Science Robotics
(b)
Search Phrases
Multimodal Detection
Multimodal Fusion
Multimodal Human Robot Interaction
Multimodal Interaction
Multimodal Interface
Multimodal Perception
Multiparty Interaction
Social Human Robot Interaction
Table 1. Publications Surveyed and Search Phrases
At time of writing, this includes the 10 highest impact Robotics publications (as measured by h5-index),1 all publications with “human-robot interaction” in the title (2 total),2 and all publications with “social robotics” in the title (3 total).3 The chosen date range covers approximately 10 years prior to the time of writing. The 10-year timeframe was selected to cover a broad, yet relatively recent range of HRI projects applicable to modern HRI research. Publications that are affiliated with surveyed publications (such as conference workshops) were also surveyed. Additionally, if a survey result was a continuation of prior work, prior publications relating to the project were included.
For information on social robot perception systems before the timeframe of our survey, we recommend Haibin Yan’s 2014 survey, “A survey on Perception Methods for Human–Robot Interaction in Social Robots” [173]. Like our survey, Yan’s survey includes details on perception hardware and algorithms employed on many contemporary social robots, such as Kismet, iCub, and Robovie.
To further focus the survey, we applied a set of inclusion and exclusion rules to the initial search results. To be included in the survey, each research work had to meet all the following criteria:
The project must perform a perception function relevant to HRI
The perception system must combine data from at least two non-contact sensory modalities
The publication included sufficiently detailed information about the perception system
Publications meeting any of the following criteria were excluded from the survey:
The project explored MMP techniques unrelated to HRI (e.g., object detection/classification, material classification, robot localization)
The perception system requires the human to carry or wear interface hardware (e.g., head mounted displays, inertial measurement units (IMU), electromyography, pulse oximetry, or brain-computer interfaces)
The project primarily explores multimodal robot-to-human communication (e.g., studies that explore the effects of fusing a robot’s speech, gaze and/or gesture)
Studies that exclusively use synthetic or simulated data
Multimodal systems or fusion techniques designed for offline use
“Wizard-of-Oz” interaction studies that used humans to process robot sensory data instead of a perception system (as in [171])
The multimodal human observation relates to a specific medical procedure [89, 185]
Finally, the MMP systems of the included projects were analyzed, and details about each MMP HRI system are provided in the following section.

3.2 Survey Results and Taxonomy

The initial Google Scholar advanced search returned over 1,000 unique matches. Because the search terms were broad and the entire article was searched, there were many false positives in the initial results (e.g., some matches were due to citations and references). After subjecting each match to inclusion and exclusion criteria in Section 3.1, the final survey includes a detailed analysis of 65 MMP systems spanning 82 publications. Results spanned many disciplines related to robotics and intelligent systems, including socially assistive robots (SARs), service robots, mobile robots in the wild, collaborative robots, human—child interaction, virtual agents, and smart spaces.
To further organize the survey results, we group them according to the data fusion technique used. For this purpose, we use the data fusion taxonomy of Baltrusaitis et al. who categorize fusion techniques as model-based or model-agnostic depending on how and when data streams are fused [9]. Model-based fusion approaches use machine learning (ML) models to combine different data modalities. This includes the use of kernels, graphical models, or neural networks to fuse data as described below:
Multiple Kernel Fusion: An extension of kernel support vector machines (SVMs) that allows for the use of different kernels to fuse different modalities.
Graphical Model Fusion: The use of HMMs, Bayesian Networks, conditional random fields (CRFs) or other graphical models to fuse multimodal data streams.
Neural Network Fusion: The use of long short-term memory (LSTM), deep neural networks (DNNs), recurrent neural networks (RNNs), convolutional neural networks (CNNs), or Transformers to fuse multimodal data streams.
Model-agnostic fusion methods do not use ML models to combine multimodal data; rather, they fuse the raw data, extracted features, or the processed outputs from different modalities into a single data structure. Model-agnostic fusion techniques can be categorized as follows depending on the phase at which data are fused [25, 128, 180]:
Early (data-level) fusion: Unprocessed data from different modalities are aligned and combined into a single data structure.
Intermediate (feature-level) fusion: Features are extracted from each heterogeneous data mode, then features are fused into a single data structure.
Late (decision-level) fusion: Each individual data mode is completely processed, and the resultant outputs of each mode are fused into a single data structure.
Additional information on fusion techniques, with specific examples from surveyed works, can be found in Section 3.9. Summarily, Tables 27 contain the survey results, organized according to the Taxonomy in Figure 1.
Table 2.
 ApplicationSensor Data \(\rightarrow\)Data Processing \(\rightarrow\)Output \(\rightarrow\)Fusion Technique
Banerjee 2018 [11]Classifying human interruptability (awareness of robot and willingness to engage) on an indoor mobile robotRed–green–blue depth (RGB-D) camera (Microsoft Kinect One)Cascaded deep network face detection and gaze estimationGaze direction (at robot, left, right, down)Comparison of non-temporal (random forest, multilayer perceptron) and temporal (latent-dynamic CRFs) models to estimate interruptability
CPM skeletal pose estimation; joint vector and angle computationHuman joint angles
Irfan 2018 [71] and 2022 [72]Long-term open-set person recognition on Pepper and NAO robots in three real-world long-term HRI studiesRGB-D camera (embedded in Pepper); RGB camera (embedded in NAO)Person detection (NAOqi Recognition Module)Estimated heightIncremental Bayesian Network to estimate human identity based on face and soft biometrics
Face detection (NAOqi Recognition Module)Face recognition and similarity score, gender estimate, age estimate
Joosse 2017 [80] and Triebel 2016 [154]Passenger mobile service robot deployed in an international airport for SPENCER project2x 2D LiDAR (SICK LMS 500)LiDAR leg tracking; multi-hypothesis tracker; social network graphHuman locations and social groupingsSupervision system to process sensory data and guide interaction modes; includes MOMDP collaboration planner to estimate group intentions and adjust robot action plans from fused observations
4x RGB-D cameraDepth object segmentation; near-field normalized depth template torso detection; far-field HoG person detectionHuman locations
Custom head pose estimatorCoarse head pose direction (left, right, front, back)
Stereo cameraUnspecifiedUnspecified
D. Lu 2017 [102]Estimating human interest for KeJia guide robot in museum settingRGB-D camera (Microsoft Kinect)Face tracking (OpenFace); Emotion recognition (Microsoft Emotion Recognition API)Recognized emotionPOMDP-based multimodal dialog management framework
Face tracking (OpenFace); Demographic recognition (Microsoft Face API)Age and gender estimate
Directional microphoneSpeech API (iFlyTek)Recognized speech commands
Table 2. Survey of MMP Capabilities Using Graphical Model-Based Fusion
2D, 2-Dimensional; LIDAR, Light Detection and Ranging; MOMDP, Mixed Observability Markov Decision Process; POMDP, Partially Observable Markov Decision Process.
Table 3.
 ApplicationSensor Data \(\rightarrow\)Data Processing \(\rightarrow\)Output \(\rightarrow\)Fusion Technique
Chen 2023 [29]Human pose estimation robust to rain, smoke, low lighting, and occlusion using ImmFusion networkmillimeter (mm)-Wave Radar Pointcloud (mmBody dataset)HRNet local image feature extraction, CNN global feature extractionLocal and Global image featuresConcatenation of image and mm-Wave pointcloud features; tiny transformer global integration module to fuse global features; fusion transformer module to estimate joint positions and reconstruct pose
mm-Wave Radar pointcloud (mmBody dataset)PointNet++ local pointcloud feature extraction; multilayer perceptron (MLP) global feature extractionLocal and global pointcloud features
Fung 2023 [48]Robust 3D Human detectionRGB-D data (various datasets)End-to-end MYOLOv4 network to extract, fuse, and classify multimodal features; consists of Multimodal Feature Extraction and Fusion (MFEF), path aggregation network (PAN), and YOLOv3 layers3D Location of humansMultimodal features extracted and fused in MFEF network layer
Li 2020 [95]Proposed emotion classification architectureMonochannel audioSpeech recognition and word embedding (Google Cloud speech-to-text)Lexical featuresBidirectional gated recurrent unit (BGRU) layers to estimate hidden acoustic-facial states; two attention layers + fully connected layer to align, fuse, and classify emotions
Acoustic feature extraction (openSMILE)Acoustic features
RGB cameraFacial feature extraction (eMax toolbox)Facial features
Linder 2020 [98]Human localization in dense environments, tested on real-world and synthetic datasetsRGB-D camera (Microsoft Kinect)Modified Darknet53 backend to concatenate RGB and Depth featuresFused RGB-D feature vectorModified YOLO v3 to compute 2D bounding boxes and 3D human centroid regression
G. Liu 2019 [100]Action recognition using datasetsRGB data (NTU and SYSU datasets)Skeleton attention and self-attention layersSpatial featuresConcatenation layer to fuse temporal and spatial features; L2, FC, and Softmax layers to classify action
Skeletal pose data (NTU and SYSU datasets)3x bidirectional LSTM layersTemporal features
Robinson 2023 [134]Robust human activities-of-daily-life (ADL) recognition of labeled datasetsRGB-D video (ETRI-Activity-3D dataset)ResNet RGB video backbone; CSPDarknet + PANet + YOLO object detectionVideo, pose, and object featuresSpatial mid-fusion module network to embed and fuse features; dense neural layer to classify feature embeddings as ADL
Skeletal pose data (ETRI-Activity-3D dataset)GCN and self-attention pose backbone
Yasar 2023 [178]Context-aware human and group motion prediction using datasetsRGB data (NTU RGB+D 60 and CMU Panoptic datasets)GRU encoderContextual scene featuresInteraction module to estimate inter-agent dynamics; multimodal context module to fuse context andagent dynamics features; motion decoder to predict motion for each human and group
Skeletal pose data (NTU RGB+D 60 and CMU Panoptic datasets)Skeletal pose data and motion encodersSpatio-temporal skeletal pose features for each human
Table 3. Survey of MMP Capabilities Using Neural Network Model-Based Fusion
CMU, Carnegie Mellon University; 3D, 3-dimensional; ETRI, Electronics and Telecommunications Research Institute; NTU, Nanyang Technological University; SYSU, Sun Yat-sen University Dataset.
Table 4.
 ApplicationSensor Data \(\rightarrow\)Data Processing \(\rightarrow\)Output \(\rightarrow\)Fusion Technique
Filippini 2021 [41]Emotion recognition during child—robot interaction with Mio AmicoRGB data (ELP 5MP Webcam)HoG face detector and regression tree facial landmark detection; Thermal feature extraction using FIR filter; MLP to classify thermal image emotional contentEstimated emotional stateFusion of thermal and RGB data through offline extrinsic calibration and online linear interpolation and pixel co-registration
Thermal image data (FLIR Lepton)
Table 4. Survey of MMP Capabilities Using Early Fusion
Table 5.
 ApplicationSensor Data \(\rightarrow\)Data Processing \(\rightarrow\)Output \(\rightarrow\)Fusion Technique
Ben-Youssef 2019 [17]Estimating engagement on Pepper robot in a public space and generating User Engagement in Spontaneous HRI (UE-HRI) dataset4-microphone array (embedded in Pepper)Audio feature computation openSMILERecognized speechTemporal pooling of sliding temporal measurement windows to fuse asynchronous speech, face, and proxemic features; comparison of Logistic Regression, DNN, GRU, and LSTM classifiers
RGB imageGaze estimation, head pose, and facial action unit (FAU) computation (OpenFace 2.0)Gaze estimation, head pose, FAUs
SonarDistance estimation (NAOqi ALEngagementZones module)Distance to human
Benkaouar 2012 [18], Vaufreydaz 2016 [164]Detecting a human’s intention to engage with Robosoft Kompaï robot in notional domestic environment2D LiDARFeet detection using adaptive background subtraction and Kalman filteringPosition, velocity of feet; distance between feetUse of neutral feature values and last-known feature values; comparison of artificial neural network (ANN) and multiclass SVM to classify feature vector and estimate user engagement
4-microphone array (Microsoft Kinect)Speech activity detection and Sound source localization (SSL)Speech activity, speaker azimuth
RGB-D camera (Microsoft Kinect)Skeletal pose tracking (Kinect); Schegloff metric computation; distance computation; Haar feature-based cascade classifier facial detection (OpenCV)Schegloff stance and torque angle features, skeletal distance, face size, face location
Islam 2020 [74]Human activity recognition on three human activity datasetsRGB image data University of Texas (UT-Kinect and UTD-MHAD datasets)ResNet50 spatial feature encoder; LSTM temporal feature encoder; self-attention mechanismEmbedded RGB featuresCustom HAMLET multimodal attention-based RGB + pose feature fusion; fully connected network layer to classify activity embedded vector
Skeletal pose data (UT-Kinect and UTD-MHAD datasets)Spatial feature encoder; LSTM temporal feature encoder; self-attention mechanismEmbedded skeletal pose features
H. Liu 2018 [101]Command recognition in notional collaborative robot manufacturing task using unimodal datasetsLabeled mono speech command datasetMFCC computation and CNN speech feature extractionSpeech featuresConcatenation function to align and fuse multimodal features; MLP to classify one of six robot commands
LeapMotion LEAP datasetLSTM gesture feature extractionGesture features
Labeled RGB video datasetTransfer-learning trained MLP network feature extractionBody motion features
Shen 2021 [143]Estimating “big five” personality traits on a Pepper robot in laboratory settingRGB camera (embedded in Pepper)Motion estimation (Pepper SDK)Head motion features, body motion featuresLinear interpolation and truncation to fuse audio and visual features; HMM clustering and classification of feature vector to estimate personality traits
Monochannel microphone (embedded in Pepper)Speech recognition (Nuance); MFCC, pitch, and audio energy computationRecognized speech; audio features
Wilson 2022 [168]Detecting when a user needs assistance in a collaborative game taskRGB cameraGaze estimation (OpenFace)Eye gaze directionVector concatenation of gaze, speech, and semantic features; random forest classifier to estimate if assistance needed
Monochannel microphoneSpeech recognition (\(\psi\) DeepSpeech); naïve Bayes classifier to estimate if assistance needed; linear degradation function to retain speech feature values between spoken segmentsRecognized speech; question/negation semantic features
Table 5. Survey of MMP Capabilities Using Intermediate Fusion
SDK, software development kit.
Table 6.
 ApplicationSensor Data \(\rightarrow\)Data Processing \(\rightarrow\)Output \(\rightarrow\)Fusion Technique
Abioye 2018 [2] and 2022 [1]Speech and gesture commanding for simulated aerial robot motionMonochannel microphoneCMUSphinx automatic speech recognition (ASR)Recognized speech commandsRule-based approach to determine if unimodal commands are sequential or synchronous, and if information is emphatic or complementary
RGB cameraHaar feature-based cascade classifier hand detection, convex hull finger detection (OpenCV)Hand gestures
Ban 2018 [10]Multiple speaker tracking using dataset of various indoor acoustic settings6-Microphone array Audio–Visual Diarization (AVDIAR dataset)Direct-path related Transfer functionSpeaker locationBayesian estimator
RGB image data (AVDIAR dataset)Facial detectionFacial location
Bayram 2015 [12]Multiple speaker tracking and following on indoor mobile robot7-Microphone array (Microcone)Generalized EigenValue Decomposition-Multiple signal classification (MUSIC)Speaker locationParticle filter
2x RGB cameras (Microsoft Kinect)Haar feature-based cascade classifier facial detectionFacial location
Belgiovine 2022 [14] and Gonzalez-Billandon 2020 and 2021 [57, 58]Human localization for recording voice and facial identity data from an iCub robot in a laboratory settingStereo camera (embedded in iCub)Deep network for facial localizationFace locationMultihuman tracking via Kalman Filtering and Hungarian algorithm for data association
Binaural microphones (embedded in iCub)Deep network for speaker localizationCoarse speaker location: to left, to right, in front
Bohus 2009 [20] and 2010 [21]Virtual agent personal assistant and trivia game host; for multiparty interaction research in an indoor social environmentRGB camera (AXIS 212)Face detection and trackingHuman locationScene analysis module to infer human intentions, engagement, and actions; fusion algorithm details unspecified
Facial pose estimationFocus of attention
Clothing RGB variance analysisOrganizational affiliation
4-Microphone linear arrayWindows 7 Speech recognizerRecognized speech
SSLSpeaker location
Chao 2013 [27]Detecting human activity and evaluating interruption behaviors in a mixed-initiative laboratory interactionRGB-D camera Microsoft KinectSkeletal pose tracking; coarse gaze estimation; gesture activity detectionEstimate if human is gazing at robot; gesture activityControl architecture for the dynamics of embodied natural coordination and engagement system; uses timed Petri nets to handle asynchronous human inputs and select action
Monochannel microphonePitch computation (Pure Data module)Speaker pitch
Chau 2019 [28]Simultaneous localization of humans and robot platform8-Microphone array (TAMAGO-03)Generalized singular value decomposition-MUSIC and SNR estimationSpeaker direction-of-arrivalGaussian mixture probability hypothesis density filter
RGB cameraOpenPose keypoint extractionMultitarget human pose estimates
Chu 2014 [34]Determining desire to engage with Curi robot platform in a laboratory setting via contingency detection2-Microphone arraySound cue computationSound cue features for two audio channelsFinite state machine to determine when to process body and audio features; SVM contingency classifier to determine if human wishes to engage or not
RGB-D camera (Asus Xtion Pro Live)Skeletal pose tracking (OpenNI) and body motion cue computationMotion cues for seven joints
Churamani 2017 [35]Identity and speech recognition for personalized interaction with Nico robot in laboratory educational task2x RGB camerasHaar feature-based cascade classifier facial detection; face recognition using LBP histogramsHuman identity and face locationWeighted sum of voice and facial identities; dialog manager to control flow of interaction based on identity and speech
Monochannel audioSpeech recognition (Google Cloud speech-to-text); CNN human identificationHuman identity and recognized speech
Foster 2014 [44] and 2017 [47] and Pateraki 2013 [120]Estimating human interaction status/intentions with JAMES bartending robot in notional bartending environment4-Microphone array (Microsoft Kinect)SSL; ASR (Microsoft Speech API); lexical feature extraction (OpenCCG)Speaker azimuth, recognized speech, and lexical featuresComparison of binary regression, logistic regression, multinomial logistic regression, SVM, k-nearest neighbor (k-NN), decision tree, naïve Bayes, propositional rule learner to estimate bar patron state
RGB-D camera (Microsoft Kinect)Hand and face blob tracking; 2D silhouette and 3D shoulder fitting; least-squares RGB image matchingHuman head and hand location, 3D torso pose and head pose angles
2x Stereo cameras
Foster 2016 [45] and 2019 [46]Human localization, identification, intent recognition, and speech recognition in a large public mall on Pepper mall robot for MuMMER project4-Microphone array (embedded in Pepper)NN to perform simultaneous speech detection and localization; speech recognition (Google Cloud speech-to-text)Speaker location and recognized speechWeighted average of gaze and human distance to estimate if human is interested in interaction
RGB-D camera (Intel RealSense D435)CPM pose recognition; head pose estimation (OpenHeadPose); facial feature computation (OpenFace)Body pose, head pose, and facial features
Gebru 2018 [49]Multiple speaker localization and diarization using dataset of various indoor acoustic settings6-microphone array (AVDIAR dataset)Binaural feature extraction; SSL using trained modelSpeaker locationMAP Bayesian estimator to detect speakers; NN search to attribute speech to speaker
Voice activity detection (VAD)Speech activity
RGB image data (AVDIAR dataset)Visual trackingSpeaker head and torso locations
Glas 2013 [51] and 2017 [52]Sensor network + mobile Robovie robot in shopping mall to detect shopper identity and locations and issue personalized greetings8x 2D LiDARHuman tracking via particle filtering (ATRacker)2D human locationsGeometric model to estimate human height; NN data association to fuse identity with human location
2x RGB camerasFacial recognition (OKAO vision)Human identity
Gomez 2015 [54]Speech-to-speaker association using Hearbo robot in rooms of varying reverberationRGB-D camera (Microsoft Kinect v2)Depth and tracking, head position, and mouth activity detectionVisual azimuth to speaker; mouth activitySpeaker Resolution module which associates acoustic azimuth to valid, speaking visual azimuth
16-Microphone arrayMUSIC SSL (HARK)Acoustic azimuth to speaker
Ishi 2015 [73]Estimating human location and head orientation in laboratory environment2x 2D LiDAR (Hokuyo UTM-30L)Particle filterHuman locationComputation of sound source location from acoustic azimuth vectors. Fused with LiDAR location estimate if within threshold; location estimate used to compute head orientation vector
2X 16-microphone array, 2X 8-microphone arrayMUSIC SSLAcoustic azimuths
Jacob 2013 [75]Command recognition for a surgical scrub nurse assistant robot arm (FANUC LR Mate 200iC) in notional surgical taskRGB-D camera (Microsoft Kinect One)Custom fingertip locator, Kalman Filter smoother, feature extraction, and HMM gesture classifierRecognized gesturesState machine to perform assistive actions (e.g., pick up and pass specific surgical instruments) or enable/disable command modes
Monochannel microphoneSpeech recognition (CMUSphinx)Recognized speech commands
Jain 2020 [76]Detecting user engagement during educational game in a month-long, in-home studyRGB Camera (USB webcam)OpenPose# people in sceneGradient-boosted decision trees to classify engagement/disengagement
OpenFaceFace detection, eye gaze direction, head position, facial expression features
Monochannel audio (USB webcam)Audio feature extraction (Praat)Audio pitch, frequency, intensity, and harmonicity
Kardaris 2016 [83] and Zlatintsi 2017 [187] and 2020 [186]Verbal and gestural command recognition for elderly care robot evaluated on MOBOT multimodal dataset and I-Support assistive bathing robotRGB-D camera (Microsoft Kinect)Dense trajectory feature extraction and SVM classificationRecognized gesturesRanked selection of best available modality
Optical flow activity detection (OpenCV)Activity status
8-Channel MEMS microphone arrayBeamforming denoising; MFCC+\(\Delta\) feature extraction; N-Best grammar-based speech recognition (HTK toolkit)Recognized speech commands
Kollar 2012 [86]Human tracking, speech and gesture recognition in an indoor environment on the Roboceptionist and CoBot service robotsRGB-D Camera (Microsoft Kinect)Skeletal pose tracking; vector computations of skeletal keypointsRecognized gestures and proxemicsRule-based dialog manager
Android tabletAndroid speech recognition; probabilistic graph/naïve Bayes language modelRecognized speech commands
Komatsubara 2019 [87]Estimating child social status using a sensor network embedded in a classroomRGB-D camera (Microsoft Kinect)Head and shoulders detectionHuman locationNN association fuses identity and location; custom social feature extraction module; SVM with RBF classifies social status
6x RGB camera (Omron)Facial recognition (OKAO Vision)Human identity
Linder 2016 [97]Development of multiple human tracker for dense human environments, tested on real-world and synthetic datasets2x 2D LiDAR (SICK LMS 500)OpenCV random forest leg trackerLeg locationsComparison of NN, extended NN, multi-hypothesis, and minimum description length trackers
RGB-D camera (Asus Xtion Pro)Comparison of depth template-based, monocular HoG, and RGB-D HoG upper-body detectorsTorso locations
Linssen 2017 [99] and Theune 2017 [152]Development of R3D3 receptionist robot and pilot testing in day care center4-Microphone array (Microsoft Kinect)ASR using Kaldi deep networkRecognized speechRule-based dialog and action manager
RGB-D Camera (Microsoft Kinect)FaceReader software (Vicar Vision)Emotional state and demographics
Maniscalco 2022 [106] and 2024 [105]Identifying humans, estimating engagement on Pepper robot guide in campus and museum environments4-Microphone arrayRMS signal computation, Google Cloud speech-to-textRecognized speechFinite state automaton; specific sensory inputs activate state transitions for engagement and interaction (smach)
RGB imagePeople detection, face recognition, gaze estimation, and age/gender estimation (Pepper SDK)Human identity, location, gaze direction, and age/gender
SonarFIFO queue to stabilize distance measurementDistance to human
Martinson 2013 [108]Human identification using soft biometrics on Octavia social robot platform in a public interaction studyDepth/time-of-flight camera (Mesa Swissranger SR4000)Segmentation via connected components analysis; computation of 3D face position; height estimation via geometric modelHuman location and heightComputing similarities of soft biometrics with those of known humans; weighted sum of soft biometric similarities to determine identity
RGB camera (Point Grey FIREFLY)Facial detection (Pittsburgh Pattern Recognition SDK); color histogram computation of face and clothingFace and clothing color histograms
Martinson 2016 [109]Human detection in indoor office settingRGB-D cameraAlexnet CNN human detection on RGB image (Caffe)Human detection likelihoodWeighted sum of CNN and layered person detection likelihoods
Alexnet CNN human detection on depth image (Caffe)Human detection likelihood
Layered person detection: Segmentation, geometric feature computation, and GMM classification of depth clustersHuman detection likelihood
Mohamed 2021 [110]Generating whole person model from robot data and providing generic ROS HRI stack via ROS4HRI projectAudio dataVoice activity detection (VAD), feature extraction (openSMILE, HARK); SSL (HARK)Audio features, voice activity, and speaker locationBody/face matcher and person manager to fuse face, voice, and body information
RGB imageFace detection and pose estimation (OpenFace); expression and demographic detection (OpenVINO); face recognition (dlib)Expression, demographics, and identity
RGB-D imageSkeletal pose tracking (Kinect, OpenNI, OpenVINO) and human description generatorHuman skeletal pose description and recognized gestures
Nakamura 2011 [114]Human localization using Hearbo robot in indoor setting8-Microphone arrayGEVD-MUSIC with hGMM Sound Source IdentificationSpeaker direction-of-arrivalParticle filter
Thermal camera (Apiste FSV-1100)Thermal-Distance integrated localization model; binary thermal mask applied to clusters of depth points to localize human headsHuman location (3-dimensional)
Time-of-flight distance camera (Mesa SR4000)
Nieuwenhuisen 2013 [116]Human localization and importance detection on Robothino museum tour guide robot2D LiDAR (Hokuyo URG-04LX)Leg and torso detectionBody locationsFace and body locations fused by Hungarian algorithm data association in a multi-hypothesis tracker; human importance in group estimated by relative distance and angle to robot
2x RGB camerasViola-Jones face detectorFace locations
Directional microphoneSmall-vocabulary speech recognition (Loquendo)Recognized commands
Paez 2022 [126]Emotion recognition on Baxter robot for improved robot mentorship in an indoor collaborative play settingMicrophoneSpeech recognition (unspecified)Recognized speechk-Means emotion classifier of fused body/head movement and facial emotion features
RGB-D camera (Microsoft Kinect)Body gesture and facial movement recognition (unspecified)Recognition of 22 body gestures and 8 Head movements
RGB-D camera (Intel RealSense SR300)Facial emotion recognition (Affdex SDK)Recognized emotion
Pereira 2019 [121]Observing human speech and gaze in joint human-robot game with Furhat robotMonochannel microphoneSpeech recognition (Microsoft cloud speech recognition)Recognized speechState machine to determine robot gaze target; state transitions triggered by prioritized human inputs
2x RGB-D cameraGaze tracking (GazeSense)Gaze direction
Object tracking (ARToolkit)Location of game objects
Portugal 2019 [123]Development of elderly care robot SocialRobot and deployment in elderly care centerHD RGB camera (Microsoft LifeCam Studio)Haar feature-based cascade classifier facial detection; PCA/Eigenface facial recognitionFace locations and identitiesCustomizable XML-based service engine based on human inputs
RGB-D camera (Asus Xtion Pro Live)Depth-augmented face localizationUpdated face location
2x microphone array (Asus Xtion Pro Live)Speech recognition (PocketSphinx); emotion and affect recognition (openEAR)Recognized speech; emotional state
Pourmehr 2017 [124]Mobile robot locating humans interested in interaction; indoor and outdoor2D LiDARLeg detectorHuman location occupancy gridWeighted sum of three unimodal occupancy grids
RGB-D camera (Microsoft Kinect)Torso detectorHuman location occupancy grid
4-Microphone array    (Microsoft Kinect)MUSIC sound source localizationSpeech direction-of-arrival occupancy grid
Prado 2012 [125]Emotion recognition in indoor setting; used to synthesize emotional robot behaviorsRGB cameraHaar feature-based cascade classifier facial detection (OpenCV); PCA to extract FAUs; dynamic Bayesian network (DBN) classifierRecognized emotion from faceDBN classifier to fuse face and voice emotions
MicrophoneExtract pitch, duration, and volume of utterance (Praat); DBN classifierRecognized emotion from voice
Ragel 2022 [127]Interactive play with children in indoor setting using tabletop Haru robot2x RGB cameraFace detection (face_recognition and OpenCV), facial feature extraction (Microsoft Azure Face API), mask detection (TensorFlow network)Estimated gender, emotion, and if wearing facemaskSequential data fusion pipeline using cacheing, filtering, and fusion to combine asynchronous skeletal pose, hand, face, and speech data; incrementally updated as new data available
6-Microphone array (embedded in Haru) 8-microphone array (external)SSL and EMVDR noise filtering (HARK); speech recognition (Google Cloud speech-to-text)Recognized speech
RGB-D (Orbbec Astra in Haru); Azure Kinect and Kinect v2 (external)Hand detection (MediaPipe; skeletal pose tracking (Kinect))Hand and skeletal pose keypoints
Sanchez-Riera 2012 [139]Human command recognition using multimodal dataset for D-META grand challengeBinaural audio (RAVEL dataset)MFCC computation; HMM and SVM to classify MFCCsRecognized speech commandsWeighted sum of speech and gesture command classifications
Stereo RGB vision                           (RAVEL dataset)Scene flow and STIPs feature extraction; HMM and SVM to classify scene flow and STIPsRecognized gestures
Tan 2018 [149]Development of iSocioBot social robot platform and deployment in four public eventsRGB camera (Logitech HD Pro C920)LPQ facial recognitionHuman identityProbabilistic hypothesis testing to fuse face and sound source locations; weighted sum to fuse speaker and facial identities
Facial detection (OpenCV)Face location
4-Microphone array (Microsoft Kinect)Unspecified SSLSpeaker direction-of-arrival
Wireless handheld microphoneSpeech recognition (Google Cloud speech-to-text)Recognized speech
i-vector framework (ALIZE 3.0)Speaker identity
Terreran 2023 [150]Whole-body gesture and activity recognitionRGB-D camera (RGB channel)2D skeletal pose tracking (OpenPose)2D skeletal pose keypoints2D-to-3D projection and lifting fuses 2D keypoints to 3D pointcloud for 3D pose estimation; ensemble classifier predicts activity
RGB-D camera (depth channel)None3D pointcloud
Trick 2022 [153]Receiving human reinforcement training inputs for a manipulator arm action planner in a laboratory settingMicrophoneMFCC feature extraction and CNN keyword classification (Honk)Recognized keyword commandsBayesian independent opinion pool
RGB-D camera (Intel RealSense D435)Skeletal pose tracking (OpenPose) and SVM gesture classification (Scikit-learn)Recognized gestures
Tsiami 2018 [156]Simultaneous indoor speaker localization, speech recognition and gesture recognition for children in an indoor play setting3x 4-microphone array                  (Microsoft Kinect)SRP-PHAT SSL; GMM-HMM speech recognitionSpeaker location; recognized speechHighest average probability of gestures; majority voting for speech recognition; NN to fuse audio and visual speaker locations
3x RGB-D camera                           (Microsoft Kinect)Bag-of-words and SVM gesture classifierRecognized gestures
1x RGB-D camera                           (Microsoft Kinect)Skeleton trackingHuman location
Whitney 2016 [167]Recognizing human object references on a Baxter robot in a laboratory setting4-Microphone array (Microsoft Kinect)Unigram speech modelRecognized wordBayesian estimator using uni-,bi-,and tri-gram state models to estimate object human is referring to
RGB-D camera (Microsoft Kinect)Skeletal pose recognition (OpenNI) and elbow-wrist vector computationObject being pointed to
Yan 2018 [176]Multi-human tracking to train a 3D LiDAR human detector in indoor public areaRGB-D camera (ASUS Xtion Pro Live)Shoulder and head detectionTorso locationsMulti-hypothesis tracker to fuse leg and torso locations
2D LiDAR (Hokuyo UTM-30LX)Leg trackerLeg locations
Table 6. Survey of MMP Capabilities Using Late Fusion
GMM-HMM, Gaussian mixture model-hidden Markov model; HARK, Honda Robot Institute-Japan Audition for Robots with Kyoto University; MOBOT, mobility robot, PCA, principal component analysis; ROS4HRI, Robot Operating System for Human-Robot Interaction; SDK, software development kit; SRP-PHAT, steered response power with phase transform.
Table 7.
 ApplicationSensor Data \(\rightarrow\)Data Processing \(\rightarrow\)Output \(\rightarrow\)Fusion Technique
D’Arca 2016 [36]Indoor speaker detection, tracking and identification in a laboratory environmentRGB cameraOptical flow feature extraction; object detectionSpeaker location and heightCanonical correlation analysis to fuse optical and audio features (Intermediate); Kalman Filter to fuse audio and visual location (late); NN to fuse speaker identity with location (late)
8-Microphone arrayGeneralized cross correlation with phase transformSpeaker location
MFCC feature extraction and classificationSpeaker identity
Efthymiou 2022 [38]Child location, activity, and speech recognition in ChildBot intelligent playspace4x RGB-D camera (Microsoft Kinect)Skeletal pose trackingChild locationsVarious combinations of visual feature concatenation/classification (intermediate); NN audiovisual data association (late); dialog flow manager to sequence interaction (late)
Dense trajectory, HoF, and motion boundary histogram feature extraction of RGB data; encoding via bag-of-visual words and VLAD; SVM classificationRecognized actions
4x 4-microphone array (Microsoft Kinect)Delay-and-sum beamforming; MFCC extraction and GMM-HMM triphone speech classificationSpeaker location and recognized speech
Filntisis 2019 [42]Child emotion recognition in intelligent playspace; recording of BabyRobot Emotional Database datasetRGB cameraFacial detection (OpenFace 2), feature extraction (ResNet 50), temporal max pooling, FC classifierFacial emotionConcatenation of body and face features (intermediate); classification and concatenation of whole body, body, and head emotion recognition scores (late)
Skeletal pose tracking (OpenPose), DNN feature extraction, temporal average pooling, FC classifierBody emotion
L. Lu 2021 [103] and Reily 2022 [131]Recognizing group activity in a human-robot search and rescue datasetRGB image (CAD and MUSRT datasets)HoG computationRGB HoG features for each human in sceneBag-of-words computation to fuse HoG features for each human (intermediate); weighted sum of human features to estimate overall group activity (late)
Depth image (CAD and MUSRT datasets)HoG computationDepth HoG features for each human in scene
Thermal image (CAD and MUSRT datasets)HoG computationThermal HoG features for each human in scene
3D LiDAR (MUSRT dataset)HoG computationLiDAR HoG features for each human in scene
Nigam 2015 [117]Classifying social context on a mobile robot platform in a university settingRGB-D camera (Microsoft Kinect)Grayscale pixel value computation; PCA and concatenationFused, reduced audiovisual feature vectorConcatenation and reduction of audiovisual features (intermediate); SVM, naïve Bayes, and decision tree social context classifiers (late)
Microphone (voice recorder)Amplitude computation; PCA and concatenation
Yumak 2014 [181, 182] and 2016 [180]Virtual agent + humanoid robot testbed for multiparty interaction research in indoor setting2x RGB-D camera (Microsoft Kinect)Skeleton trackingHuman locationLBP and HSV concatenation and k-NN classifier to determine identity (intermediate); NN fusion to assign speech and ID to human location (late); NN fusion to determine addressee from gaze (late)
LBP and HSV extraction; NN classificationHuman identity
ML deformable model fittingHead pose
8-Microphone arraySpeech harmonicity SSLSpeaker direction-of-arrival
Close-talk microphoneSpeech recognitionSpeech content and confidence
Table 7. Survey of MMP Capabilities Using Hybrid Fusion
Fig. 1.
Fig. 1. Survey results, organized by fusion method.

3.3 MMP Applications in HRI

The survey analyzed perception systems of robots and intelligent systems operating in a variety of domains and HSEs. Service robots, elderly care robots, educational robots, and social robots were among the most common domains surveyed. This coincides with the broader trend of aging populations in many developed countries and the potential for robots to address critical skill shortages for service tasks.
Surveyed service robotics systems include museum robots such as R3D3, Robotinho, and KeJia [99, 102, 116, 152] which can guide patrons to appropriate museum exhibits; Sacarino, the hotel concierge robot capable of guiding guests through the hotel, and providing information and services [122]; the Social Situation-Aware Perception and Action for Cognitive Robots (SPENCER) airport robot which can guide groups of passengers to their gate [80, 154], the MultiModal Mall Entertainment Robot (MuMMER) [45, 46] and KeJia [32], mall robots designed to provide guidance and information to mall patrons; Roboceptionist which provides information about office locations and building amenities [86]; and bartending robots as part of the Joint Action for Multimodal Embodied Social Systems (JAMES)/Joint Action Science and Technology (JAST) [44, 47, 120] and BRILLO [137] projects.
Surveyed elderly care robots include Mini, a tabletop robot designed to provide safety, entertainment, personal assistance, and stimulation [138], Hobbit, a domestic care robot which can clear the floor, learn to bring objects, detect falls, and initiate video calls from its screen [43]; and the SocialRobot project in which a mobile robot platform is developed to monitor routines, assist with memory and reminders, and initiate calls to maintain social engagement [123]. The I-Support system uses a non-contact multimodal interface to direct a bathing robot to perform specific assistive bathing actions, like “wash back” [186].
Education, socialization, or play is another common application for robots and intelligent systems operating in HSEs. Providing continued interaction is crucial to the social development of children with autism spectrum disorder (ASD) and is addressed by the Kiwi robot which played games and provided expressive feedback to children with ASD during a month-long, in-home study [76]. Other surveyed systems played games jointly with children [41] or educated them on how to play board games [121, 126]. Maggie is a tabletop robot which reads to a human companion based on their gesture and speech inputs [129]. Additionally, Efthymiou [38] and Komatsubara [87] deployed sensor arrays in children’s recreational and classroom settings, respectively, to perceive children’s status and inform more tailored play and education experiences.
Our survey also covered MMP systems employed in assistant tasks, such as robotic scrub nurses to hand instruments in a notional surgical task [75] and a pantry pick-and-place robot that hands food ingredients to a human collaborator in a notional household setting [167]. Multimodal interfaces can also be used to detect when a human collaborator needs assistance in human–robot collaboration tasks.

3.4 Information Provided by MMP HRI Systems

In this section, we present the results of the surveyed MMP systems in terms of the information they provide (in contrast to Section 3.3 which presented them in terms of application). Each of the surveyed systems utilized information about humans in HSEs—and in some cases, information about the environment itself—to more effectively interact with humans. For the purposes of this survey, we categorize information as attributes, status, communication, or context. Attributes refer to permanent or long-term characteristics of the person being observed which are unlikely to change during the course of the interaction. Status or state variables refer to quantities expected to vary over time, such as position, affect, or activity. Communication refers to observable quantities that explicitly convey information, like gesture or speech. Lastly, context refers to environmental information that is relevant to the interaction.
Example attributes that can be inferred are human identity, affiliation/role, demographic information, and soft biometrics. Facial recognition is the most commonly surveyed means of identifying humans [35, 52, 71, 72, 87, 105, 106, 110, 149]. D’Arca et al. [36] and Churamani et al. [35] perform voice identification by extracting and classifying audio features to infer the speaker’s identity. Multiple surveyed works [99, 102, 105, 110, 152] used visual data to estimate demographic information such as age and gender expression. To infer affiliation (e.g., employee or guest), Bohus et al. [20] visually analyze a human’s clothing. Likewise, Martinson et al. [108] use clothing recognition and height as soft biometrics to identify humans. Irfan et al. [71, 72] use a combination of age, gender, and height soft biometrics fused with facial recognition to infer identity. Shen et al. [143] estimate big five personality traits (Extroversion, Openness, Emotional Stability, Conscientiousness, and Agreeableness) with the intent of providing a more tailored HRI experience.
Common human status observations in the literature include tracking human location and movement [38, 45, 46, 123] and pose detection. Human localization can be accomplished through visual detection of the face or head [10, 11, 12, 14, 17, 18, 35, 42, 44, 45, 46, 47, 54, 59, 76, 102, 105, 106, 108, 110, 116, 114, 120, 122, 123, 149, 164], body or torso detection through vision or depth [36, 44, 47, 49, 80, 87, 97, 98, 101, 109, 120, 124, 154, 176], skeletal pose tracking [11, 18, 27, 28, 29, 34, 38, 42, 45, 46, 74, 100, 110, 150, 153, 156, 164], sound source (speech) localization [10, 12, 14, 18, 28, 38, 44, 45, 46, 47, 49, 54, 59, 73, 114, 120, 124, 149, 164], or by detecting legs or torso with a Light Detection and Ranging (LiDAR) sensor [18, 52, 80, 97, 116, 124, 154, 164, 176]. Multiple humans can be tracked simultaneously, as in [10, 28, 52, 80, 97, 98, 109, 116, 149, 154, 176], and multiple modes can be combined to increase the efficacy of the human tracking system. For example, LiDAR [97, 98, 124] and sound source localization (SSL) systems [10, 12] generally have a much larger spatial measurement range than cameras but lower measurement precision; however, a fused multimodal tracking system can leverage the best properties of each individual mode, providing increases in precision or range compared to unimodal systems.
Human emotion, intentions, and activity are other statuses that can be inferred through exteroceptive modalities. A human’s emotional state can be estimated by classifying facial expressions [99, 102, 110, 126, 152], speech audio features [123], or both [95, 104, 125]. Similarly, body pose and facial movement features can be fused to classify the emotional states of adults [126] and children [42]. One surveyed project used thermal imaging to estimate the peripheral neuro-vegetative activity of children and infer their emotional state [41]. Several surveyed systems were capable of estimating the intentions and engagement level of nearby humans, for example, to determine if the human is interested in initiating or ending an interaction with a robot. Pose, gaze, proxemics, speech activity, and gesture activity can all serve as cues to indicate intentions or engagement [11, 17, 18, 21, 27, 34, 44, 45, 46, 47, 105, 106, 110, 124, 164], or if a human teammate needs assistance in a collaborative task [168]. Activity recognition was estimated by classifying visual, pose, or gesture cues for individuals [38, 74, 83, 100, 110, 134, 150, 186, 187] and groups [103, 131], or to predict the motion of a group of individuals [178].
Another category of information that robots and intelligent systems can obtain from humans is communication, and includes information exchange via commands or declarations. Automatic speech recognition (ASR) was the most commonly used communication modality, and a variety of ASR options exist depending on the application (as discussed in Section 3.6) [45, 46, 75, 99, 102, 110, 123, 126, 149, 152]. Gestures were the second most common communication modality, as in [38, 75, 110, 126], or a combination of gestures and speech [1, 2, 101, 139]. Touch screens are another communication modality suitable for HSEs, since humans can exchange communication intuitively using the robot’s onboard hardware as in [37, 43, 45, 46, 122, 123]. In surveyed works, communication was primarily used to issue commands to the robot (e.g., to perform a specific action) [1, 2, 75, 101, 139, 153] or to identify objects in the environment (also called referring expressions) [167].
Lastly, context is another type of information relevant to HRI in HSEs, since humans are often situated among other agents and objects in HSEs. Examples include Nigam et al., who classify the overall social context in a university environment (studying, eating, or waiting) to determine if it is a good time to initiate an interaction [117]. The SPENCER robot estimates human group affiliations in an airport, which enables the robot to more accurately predict individual movements within the group since they travel together [80, 154]. The Robotinho museum guide robot considers the distances and poses of all humans in the tour group to determine who the robot should address and only tracks a single individual within the group instead of the entire group [116]. Object detection also provides contextual information. In collaborative human–robot tasks in HSEs, monitoring the status of task-specific objects give insight to the human teammate’s performance. For example, Pereira et al. [121] monitor the status of game pieces in a joint human–robot game. Object detection can also provide contextual clues about human activities [27, 35, 131]. As the authors note, a sitting individual could be either eating or drinking, and the detection of food or drink, respectively, helps differentiate the two activities.
In practice, most human interaction systems simultaneously observe a combination of information types. The situated nature of human interaction means that multiple quantities must often be observed as a prerequisite for the application. Both [156] and [49] fuse simultaneous location and spoken command data, while [36] combines identity (an attribute), location (a status), and spoken communication. The virtual assistant/trivia game host in [20, 21] serves as a testbed for situated multiparty human dialog research and incorporates identity, role, attention/focus, goals, and spoken communication to properly interact with the human. Similarly, Yumak’s virtual assistant/robot system [180, 181, 182] is a testbed for multiparty HRI research and observes human identity (an attribute), location and gaze direction (statuses), and spoken communication to study complex HRI phenomena like turn-taking, focus of attention, and spatial formations.
Fusing multiple categories of information about a human can yield improved perceptual accuracy, since humans can convey mutual information implicitly through their status and explicitly through spoken words or gestures. An example is [101], where observing the human’s body motion (a status) improved command recognition accuracy (communication).
Sections 3.53.9 provide more technical detail about how data from each perception modality is acquired, processed, and fused to generate information.

3.5 Visual Data Acquisition and Processing

Vision was the most commonly used modality in the surveyed literature and provides a wealth of information relevant to HRI in HSEs. This section considers the vision components of multimodal systems in the survey, but for a more comprehensive survey of only robotic vision systems in HRI, please see the survey of Robinson et al. and review [135].

3.5.1 Vision Sensors.

Red–green–blue (RGB) [1, 2, 20, 21, 28, 36, 42, 52] and RGB-depth (RGB-D) cameras [29, 44, 47, 80, 109, 120, 121, 150, 154] were the most commonly used visual sensors in surveyed works. Universal Serial Bus webcams (e.g., the Logitech HD Pro C920 [149]) can provide RGB data [76], while other commercial RGB cameras are available in the Microsoft LifeCam Studio HD [123], Point Grey FIREFLY [108], and from Omron [87]. Additionally, many of the social robot platforms surveyed have embedded RGB cameras, such as Pepper [17, 71, 72, 105, 106, 143], Nao [71, 72], Nico [35], and Robotinho [116]. The Microsoft Kinect RGB-D camera was the most commonly used sensor in the entire survey [11, 12, 18, 27, 38, 43, 54, 75, 86, 87, 98, 117, 124, 126, 156, 164, 180, 181, 182]). Other RGB-D sensors in the surveyed literature include the ASUS Xtion Pro and its variants [34, 97, 123, 176], the Intel RealSense D435 [45, 46, 153], the Intel RealSense SR300 [126], and the Orbbec Astra Embedded S. Less common visual sensors include stereo cameras [44, 47, 120] (one of which is embedded in the iCub robot [14, 57]), time-of-flight depth cameras [114], the FLIR Lepton long-wave infrared camera [41], and the LeapMotion LEAP controller—a stereo infrared camera system designed specifically for hand tracking used in [101, 127].

3.5.2 Visual Data Processing.

Surveyed works used various commercial, open-source, and custom techniques to process visual data and compute relevant HRI information.
Body, head, and hand pose estimation were common functions in surveyed works. Skeletal pose tracking is offered natively through the Kinect API as in [18, 27, 38, 86, 110, 127, 156, 164, 180, 181, 182] and provides spatial locations of selected joints (torso, shoulders, head). OpenPose [24] is another means of estimating human poses, gestures, and gaze via keypoint extraction. OpenPose is open-source, platform-agnostic and accepts any RGB or RGB-D image as input, as used in [28, 42, 153]. The Pepper robot’s software development kit (SDK) supports human localization and motion estimation and was used for body and head motion estimation in [143]. The OpenNI framework is another open-source pose recognition option and provides skeletal pose tracking and hand point tracking as used in [34, 43, 110, 167]; however, its last official update was in 2012. Banerjee et al. [11] and Foster et al. [45, 46] implement pose recognition using convolutional pose machines (CPMs). Head pose can be computed with the open-source OpenFace framework as in [17, 76, 110] or OpenHeadPose as in [45, 46]. Foster et al. use RGB least squares matching to estimate head position [44, 47, 120]. Ragel et al. use the MediaPipe tracker to get hand pose keypoints from RGB-D data in [127]. Lastly, the LEAP provides hand pose tracking natively as in [101, 127].
Pose information can be used to locate humans, recognize gestures, and infer their focus of attention. Chau et al. [28] use OpenPose to compute spatial pose keypoints of nearby humans from a mono RGB camera input; they compute the human’s 2-dimensional (2D) location by projecting the 3-dimensional (3D) keypoints into a 2D plane around the sensor platform. Likewise, Terreran et al. [150] compute skeletal pose keypoints of an RGB-D image with OpenPose, then use 2D-to-3D lifting and projection to project 2D keypoints onto the depth channel, providing a 3D pose estimate. Notably, OpenPose also provides a confidence measure which can be used to weight the visual pose detection in multimodal fusion applications as in [28]. The Hobbit elderly care robot computes a 3D bounding box around skeletal pose keypoints, then uses this spatial information to detect if the human has fallen [43]. An even simpler use of skeletal pose tracking is to estimate the number of humans in the scene by counting the detected skeletal poses, as Jain et al. did [76]. Simple gesture recognition can be accomplished via geometric modeling. For example, Kollar et al. [86] recognize the “hand raised” gesture if the wrist is above the shoulder, and Whitney et al. [167] compute the shoulder-to-wrist vector to recognize pointing gestures. OpenNI includes basic gesture recognition (swipe left/right, pointing) [43]. Supervised learning techniques can also be used for gesture recognition; Trick et al. use an SVM to classify OpenPose joint trajectories as gestures [153]. Paez et al. use skeletal pose recognition to classify “emotional gestures” such as stretching arms, hands on chin, or rubbing hands together [126]. Chao et al. use a geometric model to compute the angle between a human’s head and shoulders relative to hips, then use this to infer if the human is facing the robot or not [27]. Features extracted from pose data can also infer emotional state; Filntisis et al. use a DNN to extract affective features from children’s skeletal pose data, which is fused with facial features to classify their emotional state [42].
Detection of human centroids is another visual processing technique found in the literature [10, 12, 20, 21, 36, 44, 47, 49, 80, 87, 97, 98, 120, 124, 154, 176] and typically involves supervised learning of labeled datasets while leveraging both depth and color from RGB-D data. Models such as DarkNet53 can fuse RGB and Depth features, then YOLOv3 can be used to detect 3D human centroids or 2D bounding boxes from the fused RGB-D features as in [98]. Fung et al. [48] use a multimodal YOLOv4 (MYOLOv4) implementation to extract and fuse RGB and Depth features to robustly detect 3D human body part positions in offline datasets. Similarly, Martinson et al. use AlexNet to localize humans in RGB, plus a custom depth-based segmentation, geometric feature extraction, and Gaussian mixture model (GMM) classifier to localize humans from the Depth channel [109]. They fuse RGB and Depth detections using a weighted sum and compare the fused performance with individual modes. Segmenting the depth channel into foreground and background, then using a depth-template or histogram of oriented gradients (HoG) upper-body detector on candidate regions was done in [80, 97, 176, 154] for near-field torso detections; far-field detections used a HOG detector on RGB data since depth data is unreliable beyond approximately 5 m [80, 97, 176, 154]. Komatsubara et al. [87] use a custom head-and-shoulders shape detection from RGB-D data; likewise, the JAMES bartending robot [47, 44, 120] uses 2D RGB silhouettes and 3D depth information to estimate 3D shoulder locations.
Nakamura et al. [114] localize humans using thermal and depth data in a Thermal-Distance integrated localization model which uses separate modalities to reduce false positives. First, a binary mask is applied to the thermal points to separate hot (above temperature threshold and presumably human) from cold pixels. Hot pixel regions are then mapped onto the depth image, and the background depth is subtracted. The resultant clusters of 3D points are considered human.
Once detected, body centroid locations can be used to localize and track humans or estimate the person’s focus of attention. For example, Linder et al. [97] and Yan et al. [176] project 3D torso detections onto a 2D plane surrounding a mobile robot, then fuse it with a LiDAR leg detections in a multihypothesis tracker to localize and track multiple humans around the robot. The JAMES bartending robot [44, 47, 120] computes the angle between a human’s torso and the robot to infer if the human is facing the robot.
Facial detection on RGB images is another common function in surveyed literature [54, 122]. Face detection provides a 2D estimate of face locations in the RGB frame and is supported by open-source packages such as OpenCV as in [18, 149, 164], OpenFace as in [42, 76, 102, 110], face\(\_\)recognition4 as in [127], or the Pittsburgh Pattern Recognition (PittPatt) SDK as in [108]. The most common facial detection algorithm was the Haar feature-based cascade classifier, also called the Viola-Jones algorithm [12, 35, 96, 116, 123, 165]. Haar-like features are simple geometric classifier patterns which, when cascaded into stages of increasing complexity, can rapidly search an image and detect complex visual patterns like faces. The algorithm is popular for real-time usage on robots and embedded systems because of its speed and simplicity. Other facial detection techniques include deep networks [14, 59] (which can also infer gaze direction as in [11]) or custom blob tracking of facial regions [44, 47, 120]. Both Martinson et al. [108] and Portugal et al. [123] project the 2D facial detection onto a 3D depth image to estimate the 3D location of the detected face. The Pepper robot’s SDK supports facial detection natively [106, 105].
Detected faces can be further processed to infer relevant HRI information such as identity, emotion, gaze, or soft biometrics. Commercial options for identification via face recognition include SoftBank Robotics’ NAOqi recognition module in [71, 72], and OKAO Vision, which computes Gabor wavelet transform features of each face and classifies them using an SVM [52, 87]. Dlib is an offline, open-source facial recognition option [110]. Other surveyed works implemented facial recognition by classification of local phase quantization [149] or local binary pattern (LBP) features [35, 118]. Commercial options for facial emotion recognition include FaceReader from Vicar Vision [99, 152], Microsoft Cognitive Services’ Emotion Recognition API [102], the Affdex SDK [126], and the eMax face analysis toolbox [95], which require a cloud connection. Supervised classifiers can also estimate emotion; Prado et al. [125] use principal component analysis (PCA) to extract features from facial keypoints (eyebrow, cheek, chin, mouth, etc.) and classify emotion via a dynamic Bayesian network (DBN), and Paez et al. use facial gestures (turn head, flexion, close left/right/both eyes) as features to indicate emotional state. Cloud-based age and gender demographic estimation are available through FaceReader, Microsoft Cognitive Services’ Face API [127], Pepper’s SDK [105, 106], and SoftBank Robotics’ NAOqi framework recognition module. OpenVINO has pre-trained models to estimate emotion and age/gender demographics without a cloud connection, as described in [110]. Gaze estimation is available offline via OpenFace [17, 76, 110, 168] and GazeSense [121]. Banerjee et al. use a cascaded deep network to estimate gaze and face location [11]; the SPENCER robot [80, 154] also uses a supervised classifier—a SVM that classifies difference of oriented Gaussians features in an RGB image—to estimate approximate gaze direction (left, right, back, front). The virtual assistant of Bohus et al. [20, 21] uses a gaze model trained on labeled images to infer if nearby humans are looking at or away from the system. Face detections can also be used to infer soft biometrics such as a human’s approximate height as in [71, 72, 108], although the location and orientation of the camera relative to ground plane must be known. Martinson et al. also use color histograms of the face as soft biometrics [108].
Miscellaneous facial image processing in the surveyed literature include facial action unit computation, which can be done by OpenFace [17, 45, 46, 76] and used to classify emotion. Filntisis et al. use ResNet 50 to extract facial emotion/affective features, which they fuse with DNN-extracted features from skeletal pose data and classify via a fully connected neural network (FCNN) layer to estimate child emotional state [42]. Gomez et al. compute mouth movement activity to estimate if a human is speaking [54]. In one of the more unique applications of early fusion, Filippini et al. [41] used a fused RGB-Thermal image to locate a face using an RGB HOG detector and a regression tree ensemble to locate RGB facial keypoints; a FIR filter extracted the associated thermal features of the keypoints, which were then classified using a multi-layer perceptron (MLP) to estimate children’s emotional states.
A less common visual processing technique was hand detection and hand gesture recognition. Abioye et al. use an OpenCV Haar cascade to locate hands visually, then a convex hull to count fingers [1, 2], where each gesture indicates a specific robot command. Jacob et al. use a custom fingertip detector; the trajectory of each fingertip is smoothed using a Kalman Filter, and the trajectory’s speed and curvature are classified using a hidden Markov model (HMM) classifier [75], where each gesture corresponds to a specific command to a robotic assistant. Foster et al. use a custom blob tracker on the JAMES bartending robot to locate nearby humans’ hands, which are used as features to estimate the human’s overall status. Chao et al. compute relative motion of tracked hands and use this to estimate gesture activity (but not specific gestures). Gesture activity is then used to estimate overall human activity and engagement [27].
Aside from the capabilities listed above, visual data features can be classified via supervised learning techniques to recognize gestures and activity or fused with features from other modalities for multimodal classification. Feature extraction techniques in the surveyed works consisted of scene flow [139], space-time interest points [91, 139], or visual embedded vectors [70]. Processed visual data can be classified as a specific gesture using HMM and SVMs [139], LSTM networks, or transfer-learning trained networks [101]. Similarly, multiple surveyed works compute dense trajectory features, using histogram of optical flow and motion boundary histogram descriptors, which are then encoded via a bag-of-visual words or vector of locally aggregated descriptors framework and classified via an SVM [38, 83, 156, 187, 186] to detect gestures for children and the elderly. Liu et al. use attention layers to extract RGB spatial features and LSTM layers to extract skeletal pose features, which are concatenated and classified with an FCNN to estimate human activity [100]. Chu et al. use motion cues as features; motion cues are a simple geometric computation between successive joint positions and angles and are used to detect activity [34]. Islam et al. use ResNet 50 to extract spatial features from RGB and Depth images and LSTMs to embed the temporal features. The features are fused in a custom multimodal self-attention mechanism which classifies the activity seen in the RGB-D image [74]. Similarly, Robinson et al. [134] use a transformer network to extract, fuse, and classify activities of daily living (ADL) of elderly persons in RGB-D video. Reily and Lu compute HoG features for RGB, depth, and thermal images, then fuse them in a bag-of-words representation for each human. The individual and group activities are then inferred based on a weighted linear combination of these features [103, 131]. The mobility robot (MOBOT) uses OpenCV to compute optical flow variables in an RGB image, which is used to detect regions that contain activity [83, 187, 186]. Chen et al. [29] extract local RGB features using an HRNet backbone and global features using a CNN; these features are concatenated with a LiDAR feature vector to robustly estimate human pose in visually obscured conditions.
Supervised classification of visual features can also infer engagement or group affiliation. Vaufreydaz [164] and Benkaouar [18] use skeletal pose data to compute the distance from human to robot and Schegloff metrics—pose features based on computing the relative positions between hips, torso, and shoulders keypoints—then fuse these features with other modalities to classify if the human intends to engage with the robot. Banerjee et al. compute body position and orientation from tracked pose keypoints, then fuse these with head orientation, gaze direction, facial bounding box features, and audio cues. These fused features are input to multiple supervised classifiers that estimate a human’s interruptability, or if they are willing to engage with the robot [11]. Nigam et al. simply use grayscale values from RGB pixels and fuse them with audio features to estimate the overall social context and if humans are willing to engage with the robot [117]. Bohus et al. analyze the variance in the RGB pixel values of each human’s clothing to infer group affiliation; business attire (e.g., suits) resulted in higher RGB variance, and the person was assumed to be a visitor since members of their organization dressed casually [20, 21]. Likewise, RGB color histograms of clothing can be used as soft biometrics to help discern identity as in [108].
Most of the surveyed systems employed multiple visual processing techniques simultaneously. For example, the virtual agent/humanoid robot system of Yumak et al. [180, 181, 182] uses several visual processing techniques to localize humans, identify them, and estimate their gaze direction. For localization, they use Kinect’s skeletal tracking which locates joint keypoints. Similar to Bohus et al. [20, 21], Yumak’s system uses feature extraction and classification of a set of visual pixels to identify humans. LBP and hue saturation value (HSV) color features [166] are extracted for a patch of pixels around each skeletal joint. The LBP and HSV features are concatenated into a single vector (feature-level or intermediate fusion) for each person; a k-nearest neighbor (k-NN) classifier matches the visual features to an identity. To estimate gaze direction, they use a regularized maximum likelihood deformable head model fitting algorithm [23]. The rotation matrix and the translation vector of the head model are iteratively optimized using a method similar to iterative closest point, and then used to estimate the pose of the head. Head pose is then used to infer focus of attention as described in Section 3.9.

3.6 Audio Data Acquisition and Processing

Audio perception (or audition) is another common modality in MMP HRI applications and includes functions such as SSL, sound source separation (SSS), ASR, and voice activity detection (VAD). Despite being a relatively mature field, designing computer audition systems that can handle noise, reverberation, and the presence of multiple sound sources in noisy, reverberant environments remains a challenge in applied robotics research [61, 63, 64, 112]. This section discusses the audition systems present in fused multimodal systems; for more information on unimodal audition systems, refer to Rascon and Meza’s SSL survey [130] or to the review of Zhang et al. of nonverbal sound in HRI [183].

3.6.1 Audio Sensors.

In surveyed MMP HRI applications, a computer audition hardware setup is typically comprised of an array of multiple microphones when SSL is performed. This includes commercial products like the Microsoft Kinect’s 4-channel microphone array, used in [12, 18, 38, 44, 47, 124, 156, 164], the 2-channel array in the Asus Xtion Pro Live [123], the 7-channel Microcone [12], or the 8-channel TAMAGO-03 [28]. Other surveyed works used custom arrays of 4 linear microphones [20, 21], custom 6-channel arrays [127], or 6–8 microphones organized into pairs [10, 36, 49], circular arrays [180, 181, 182], or 8-channel linear arrays [83, 187].
Some surveyed social robot platforms contain embedded arrays in support of audition functions. Pepper has a 4-channel array embedded in the head [17, 45, 46, 143], and Curi has two microphones embedded in the body [34], while Honda Research Institute’s Hearbo has a 16-channel array [54, 114]. Binaural microphone arrays are another common configuration and are usually found on humanoid or anthropomorphic robots such as the SIG2, NAO, or iCub systems [14, 57, 148, 179]. Eight to sixteen total audio channels may be desirable if a high degree of noise cancellation or SSL accuracy is required [67, 69, 140, 159]. The largest array in surveyed works was comprised of two 8-channel arrays and two 16-channel arrays, used for precise SSL in an indoor setting [73].
Mono-channel microphone systems were used in surveyed works where SSL was not required, since sound sources cannot be localized using a single audio channel. This configuration was used in applications when the robot is interacting with a single person, such as: the UAV control interface of Abioye et al. [1, 2]; the robot of Churamani et al., which uses a single audio for voice identification and speech recognition; the Curi robot, which uses audio cues from a single human to determine engagement [27]; robots which play joint games with a single human, as in [76, 121, 168]; or Nigam et al., who analyze monochannel features to classify the audio scene [117]. The virtual agent/humanoid robot system of Yumak et al. [180, 181, 182] also used a single standalone microphone. A unique monochannel audio implementation is the Roboceptionist robot [86], which used an Android tablet’s microphone for speech recognition.

3.6.2 Audio Data Processing.

SSL was a common computer audition function in surveyed works, and various SSL algorithms exist for practical use on robots and intelligent systems. Several surveyed works [12, 28, 54, 73, 114, 124, 127] used variants of multiple signal classification (MUSIC) techniques (such as generalized eigenvalue decomposition-MUSIC or generalized singular value decomposition-MUSIC) for SSL in their applications. All were implemented using HARK5 middleware (Honda Robot Institute—Japan Audition for Robots with Kyoto University) [113]. HARK includes ambient noise filtering via extended minimum variance distortionless response, as in [127]. Note that MUSIC requires an estimate of the transfer function between the audio source and microphones, which becomes increasingly complicated for multidimensional SSL. Hence, several works [20, 28, 114, 124] computed the sound source direction of arrival (i.e., 1-dimensional (1D) SSL) in 5–10° increments. Notable modifications include Nakamura et al. [114], who include sound source identification via a hierarchal Gaussian mixture model to ignore non-speech sources, and Chau et al. [28] who extract features via DNN-based spectral mask estimation to increase noise robustness.
Other SSL techniques include the one used by D’Arca et al. [36], who compute the intra-channel time differences of arrival using generalized cross-correlation with phase transform (GCC-PHAT) and then use beamforming techniques to localize speakers indoors. Beamforming techniques were used for other indoor speech localization applications, such as the child–robot interaction systems in Tsiami et al. [156] and Efthymiou et al. [38] that used a steered response power with phase transform (SRP-PHAT) beamformer, and the MOBOT’s perception platform [83, 187]. Likewise, Yumak et al. [180, 181, 182] use a speech harmonicity beamforming algorithm [170] to compute speech direction-of-arrival from an 8-channel circular microphone array. Beamforming SSL and SSS can be implemented using open embedded audition system, another free, open-source computer audition package [62]. Gebru et al. [49] use a supervised learning approach; they extract the auto- and cross-spectral density features between two binaural channels, then train a localization model using a labeled multimodal dataset. Likewise, Foster et al. incorporate SSL on a Pepper robot by using a supervised neural network [45, 46]. Belgiovine and Gonzalez-Billandon et al. also use a DNN for coarse 1D SSL, classifying the direction-of-arrival (DoA) (or azimuth) into one of three zones in front of the robot [14, 58]. Lastly, the Microsoft Kinect can compute the sound source DoA through its API, as done in [18, 44, 47, 164], or used within a networked array of multiple sensors as in [186]. For additional details on the categories, algorithms, and implementation of SSL, we recommend Rascon and Meza’s survey [130].
ASR is the second most common audio processing function in the surveyed literature, with a range of techniques surveyed. For dialog-heavy, cloud-connected applications, commercial options include Microsoft Cloud speech recognition [20, 21, 121], Loquendo (now Dragon from Nuance) [122, 143, 116], iFlyTek Speech API [102], and Google Cloud’s speech-to-text API [35, 45, 46, 95, 105, 106, 127] (which has a Python software wrapper [149]). Offline, open-source ASR options are available in the Kaldi deep learning model6 [99, 152], DeepSpeech7 [168], or in the Carnegie Mellon University (CMU)Sphinx framework8 [1, 2, 75, 123]. Julius9 is another open-source ASR framework which can be used offline and, along with Kaldi, features HARK integration. Audio feature extraction and supervised classification can also be used to generate custom acoustic and language models for ASR. These types of ASR models are typically used on embedded systems where large vocabulary recognition is not required and a smaller set of words or commands is sufficient. The most commonly used audio features are mel-frequency cepstral coefficients (MFCCs), which approximate the nonlinearities of human hearing in the frequency domain. MFCC computation is a well-documented process, included natively by HARK and other open-source audio processing software like Torchaudio. A 2D data tensor with MFCCs in one axis and time in the other can be visualized as a spectrogram and classified using supervised learning methods to infer the phonemes spoken. Sanchez-Riera et al. [139] extract MFCCs from speech, then classify the MFCC vector as a discrete robot command using HMMs and SVMs. Likewise, Liu et al. [101] classify MFCCs as a discrete set of spoken commands using a CNN. Similarly, Trick et al. compute the MFCCs of speech audio, then use the PyTorch-based Honk framework—a supervised CNN classifier—for small vocabulary speech recognition [153]. Tsiami et al. [156] use extracted MFCCs and their first temporal derivative (MFCC+\(\Delta\)) features as inputs to a Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) to phonemes in a speech recognition system. Kardaris and Zlatintsi [83, 186, 187] use a similar approach, extracting MFCC+\(\Delta\) features and classifying them using an N-best grammar-based language model using the HTK toolkit.10
Audio feature extraction and analysis can also be used to infer emotion, speech activity, personality traits, or identity. Open Emotion and Affect Recognition toolkit provides an online means of classifying emotional content in audio [123]. Offline options include Ma et al. [104], who extract and classify MFCCs to detect emotional content of speech using a trained supervised classifier model. The Praat toolbox can extract signal information like pitch, frequency, intensity/volume, harmonicity, and duration; Prado et al. use Praat-extracted features as input to a DBN which classifies the emotional content of the speech. The openSMILE toolbox can also extract audio features for emotional classification as in [95, 110], or for engagement detection as in [17]. Jain et al. use Praat-extracted audio features to classify engagement level. Sound cues—the difference in amplitude between successive samples—are a relatively simple audio feature which can also be used to infer human engagement as in [34]. Chao et al. use a Pure Data module to compute pitch [27]; Maniscalco et al. perform root mean square analysis of the audio signal [106]; both then use these features to detect speech activity. The Pepper robot used in the MuMMER project uses a supervised neural network to detect speech activity [45, 46]. Shen et al.’s Pepper implementation extracts MFCCs, voice energy and voice pitch, which are fused with visual features to estimate the human’s “Big Five” personality traits. Speaker identification (also called voice identification) through audio data can be accomplished via use of a CNN classifier as in [35], an implementation of i-vector feature classification in the ALIZE 3.0 framework [149], or by MFCC computation and GMM classification, trained on a 60-second speech sample for each speaker [36]. In [35], the authors note that MFCCs may be unreliable in noisy HSEs, so learned CNN features may be more suitable and also avoid the additional preprocessing step of computing MFCCs.
Lesser-used audio processing techniques include social context classification and lexical feature extraction. Nigam et al. use the signal amplitude in the time domain as a feature to estimate social context in HSEs; i.e., if the HSE is used to eat, study, or socialize [117]. Lexical feature extraction can be accomplished with OpenCCG, which provides embeddings of recognized speech as in [44, 47]; embeddings can reduce a large vocabulary of spoken words into the most likely command or words relevant to a service robotics domain.

3.7 Other Data Modality Acquisition and Processing Techniques

Beyond audio and vision, other robot sensor modalities can be used to infer human attributes, states, and communication. Leg detection and tracking through LiDAR scanning rangefinder data was one common perception technique. 2D LiDAR scanning rangefinders were common inclusions on mobile robots or in indoor sensor arrays [18, 73, 52, 164], while specific models used include the SICK LMS 500 [80, 97, 154], the Hokuyo URG-04LX [116], and the Hokuyo UTM-30LX [176]. Raw range data can be clustered, and the clusters classified as legs. Linder et al. use a random forest classifier implemented in OpenCV [97] to detect legs; the SPENCER robot segments LiDAR range data, clusters, extracts cluster features, and classifies legs using a boosted classifier [80, 154]. LiDAR leg detection was used in several other mobile robot and indoor sensor arrays surveyed [18, 52, 124, 164]. Trunk detection can also be accomplished via laser range cluster classification as in [116]. Individual LiDAR detections can be tracked using commercial software like ATRacker [52], by Particle Filtering [73], or by Kalman Filtering [18, 164] which provides the smoothed position and velocity of the detected person. However, a drawback of LiDAR leg detection is that humans must be standing up to be detected, and furniture or other obstacles can be misclassified as humans; adaptive background subtraction [18, 164] can remove non-human scan data which could otherwise be misclassified. For multiple human tracking, LiDAR scan data is typically fused with other modalities in specialized trackers as discussed in Section 3.9.
Sonar, or sonic rangefinders, are another sensory modality implemented on the Pepper platform [17, 105, 106]. While sonic rangefinders do not offer the same precision as laser rangefinders, they can still compute distance to humans and infer engagement zone information (e.g., public zone, social distance) through Pepper’s ALEngagementZones toolkit as in [17].
Touch screens are another option that can enable intuitive information transfer in HSEs. Touch screens can work well for applications where a robot has a small set of actions or functions and humans do not carry specialized interface hardware, like the Sacarino concierge robot [122], or elderly care bots [43, 123]. Additionally, the Pepper platform and Bohus et al.’s virtual assistant [20, 21] both have touch screens to receive tactile inputs from nearby humans. In Fischinger’s et al.’s Hobbit robot user study, they found that humans preferred the touch screens over gestures to send discrete commands [43]. Touch screens can also offer privacy for confidential information like airport reservations or hotel bookings and avoid inaccuracies that are inherent to speech or gesture commands [122]. However, touch screens are not ideal in scenarios when the robot is moving excessively [32].
Less commonly used sensors in the surveyed works include thermal cameras, where features were extracted and fused with other visual features to recognize activity in Lu and Reily’s MMP system [102, 131], and capacitive touch sensors, which were embedded in the Mini tabletop robot to recognize tactile interaction [138]. Chen et al. were the only surveyed project which used millimeter (mm)-Wave Radar pointcloud data (from the mmBody dataset); PointNet++ extracts local pointcloud features, and a multilayer perceptron extracts global pointcloud features, which are fused with visual features to robustly estimate human pose in visually obscured positions [29]. Both thermal and radar were primarily motivated by use-cases in domains where vision is expected to be degraded, such as search-and-rescue or subterranean operations.

3.8 Multimodal Datasets

Several of the surveyed works [10, 17, 42, 49, 70, 74, 83, 100, 103, 104, 131, 139] used labeled datasets instead of live sensor data. Labeled datasets enable the benchmarking and comparison of different perception algorithms against a common standard and may be especially useful during development of MMP HRI systems and algorithms. Multimodal HRI datasets used by surveyed systems include:
The Audio-Visual Diarization dataset: 6-channel audio and stereo camera recordings of various groups of humans conversing in an indoor setting, along with annotations of their conversations. Used in [10, 49].
BabyRobot Emotion Database: Labeled audiovisual dataset in which children express one of six emotions. Developed by and used in [42].
The Collective Activity Dataset [33]: Labeled RGB video of group activities, such as crossing a street, talking, jogging, and dancing. Used in [131, 103].
CMU Panoptic11: 480X RGB cameras, 10X RGB-D Kinect sensors with skeletal pose annotations and speaker diarizations. Sixty-five sequences (5.5 hours) of various social interactions recorded from various angles in a panospheric dome. Used in [178].
Electronics and Telecommunications Research Institute (ETRI)-Activity-3D [77]: RGB-D and a skeletal pose data of 55 activities performed by 50 younger and 50 older adult subjects (112,620 total samples). Used in [134].
mmBody [30]: Millimeter-wave annotated 3D pointclouds, RGB-D data, and ground-truth motion capture of various human poses in obscured visual conditions (rain, smoke, poor lighting, occluded). Used in [29].
MOBOT-6a corpus [136]: Gestural and verbal commands (in German) performed by eight elderly subjects, used in [83].
Multi-modal Long-term User Recognition Dataset: Images of 200 users, with name, age, gender, height labels, and interaction times. Developed by and used in [72].
Multisensory Underground Search and Rescue Teamwork Dataset: RGB-D, thermal, and 3D LiDAR data recorded from a Clearpath Husky Unmanned Ground Vehicle (UGV) of a rescue team in an underground mine setting. Group actions are annotated as donning protective equipment, team movement, team stop, patient care, and installing mine shaft supports. Used in [103, 131].
Nanyang Technological University RGB+D [142]: Skeletal pose, depth, infrared, and RGB videos of annotated individual actions. Used in [100] and [178].
Robots with Auditory and Visual AbiLities Corpora [5]: Stereo vision synchronized with 4-channel audio and activity annotations. Used in [139].
Sun Yat-sen University (SYSU) Dataset [68]: RGB-D data of subjects performing actions with objects. Activities and object labels are annotated. Used in [100]
User Engagement in Spontaneous HRIs [16]: used in [17]
University of Texas Kinect [172]: RGB-D and skeletal pose data of ten indoor activities of daily life (e.g., walking, standing up). Used in [74].
University of Texas at Dallas–Multimodal Human Activities Dataset [31]: RGB-D, skeletal pose, and inertial sensor data with labeled human activities. Used in [74]
Although not used in any of the surveyed works, the following datasets may also be useful resources to develop MMP systems for HRI in HSEs:
Artificial Intelligence for Robots [85]: 3X RGB-D Kinect sensors, depth maps, body indexes/labels, and skeletal pose data. Annotated human–human interaction activities (e.g., handshakes, hugs, entering/leaving area) with elderly subjects.
AudioVisual Emotional Challenge (2014) [160]: RGB facial images and audio data. Annotated with emotional state.
AveroBot [107]: 8X different camera + monochannel microphone combinations spread across three floors inside a building. Human identities are labeled for long-term re-identification using different hardware setups.
Learning to Imitate Social Human–Human Interaction [158]: 3X RGB-D cameras and 1X monochannel microphone to capture scene data, plus motion capture of body joints, eye tracking, and egocentric RGB data for human participants in the scene. Comprised of 32 dyadic interactions, with 5 sessions per dyad.
Multimodal Human–Human–Robot Interaction [26]: 2X static RGB cameras, 2X dynamic RGB cameras, and two biosensors recording information during dyadic human–human interaction and triadic human–human–robot interaction. Engagement and personality are annotated.
Dataset of Children Storytelling and Listening in Peer-to-Peer Interactions (P2PSTORY) [145]: 3X RGB cameras and monochannel audio of children reading and listening to one another. Interactive behaviors (e.g., nods, smiles, frowns) are annotated, and socio-demographic information is recorded for each subject.
Play Interaction for Social Robotics (PInSoRo) [92]: 2X RGB-D and monochannel audio recordings (Intel RealSense SR300) of two children playing, 1X RGB-D recording (Microsoft Kinect One) of the environment. Annotated with task engagement, social engagement, and social attitude for each child.
Penn Subterranean Thermal 900 (PST900) [144]: 894 fused RGB-thermal images; synchronized and calibrated with per-pixel annotations of four classes from DARPA Subterranean Challenge.
Universidad de Málaga Search and Rescue [111]: Overlapping RGB and thermal infrared monocular data, 3D LiDAR, GPS, and IMU recorded from a UGV during an underground SAR exercise. Includes persons, vehicles, debris, and SAR activity in unstructured terrain.
While multimodal datasets are highly practical for perception model development, models trained on datasets should not be used if the dataset differs significantly from the application data. For example, vision models trained on an indoor camera may not be applicable for use on a mobile robot, since differences in perspectives or hardware may result in warping or image distortion, rendering the model unusable [8]. Likewise, language models and speech recognition systems designed for adults do not transfer well to children [15, 152].

3.9 Data Fusion Techniques

Fusing information from multiple unimodal streams can be accomplished using a variety of techniques, depending on the availability and modalities of data for the specific HRI application. In this section, we discuss surveyed fusion techniques. Recall from Section 3.2 that we use the multimodal fusion taxonomy of Baltrusaitis et al., who categorize fusion techniques as model-based or model-agnostic [9]. Model-based approaches are more recent and less commonly used in robotics and use kernels, graphical models, or neural networks (including Transformers) to combine different data modalities and seek underlying patterns in the data. Model-agnostic methods fuse the raw data, extracted features, or the processed outputs of different modalities without the use of trained models. Model-agnostic methods are more generic and more commonly used in robotics [13].

3.9.1 Model-Based Fusion Techniques.

Model-based fusion techniques are categorized by the type of model used to fuse data streams [9]. Recall from Section 3.2 that this includes Multiple Kernel Fusion, Graphical Model Fusion, and Neural Network Fusion. Despite being derived from well-established ML algorithms,12 model-based fusion techniques are less common in robotics; model-based fusion techniques often require large amounts of relevant data and onboard compute to be used in real time on embedded systems. However, advances in onboard computing and ML research have resulted in model-based fusion techniques being employed on robots in recent years.
Graphical Model Fusion can combine asynchronous data from different streams, and both training and inference are faster than neural network model fusion [72]. Additionally, graphical models can continue to provide a state estimate based on prior information, even if updated information is sparse. Graphical models can also model states that change temporally. These are all desirable qualities for onboard robotic sensor fusion. Examples of Graphical Model Fusion include Irfan et al., who use a multimodal incremental Bayesian network to fuse facial identity, height, age, and gender information to estimate a human’s identity incrementally over time [71, 72]. Banerjee et al. compare the performance of graphical models such as HMMs, CRFs, hidden CRFs, and latent-dynamic CRFs to categorize interruptability based on pose, gaze, and audio signals [11]. Lastly, Markov decision process (MDP) graph models can be used as dialog management or task planning engines. Lu et al. use a Partially Observable Markov Decision Process (POMDP)-based state estimator and event reasoner to fuse multimodal inputs [102], while the SPENCER robot uses a Mixed Observability Markov Decision Process (MOMDP)-based collaboration planner to estimate human intentions from asynchronous multimodal inputs [80, 154]. Notably, the MOMDP collaborative planner is just one component of SPENCER’s supervision system, an integrated perception and planning system that affects specific actions based on user inputs and perceptual data.
Neural Network Fusion was also used in surveyed works to align and fuse multimodal features. Neural Networks are often used within an end-to-end fusion and classification/prediction framework, where one layer is used to fuse data or features, and another is used to classify or decode the feature representation. Linder et al. use the DarkNet-53 network to fuse RGB and Depth features; the fused feature vector is then classified with a modified YOLOv3 network to compute human centroid location in the Depth image and 2D bounding box for the RGB image [98]. Fung et al. [48] use a custom YOLOv4 variant—MYOLOv4—to individually extract RGB and Depth features and fuse them to robustly detect people and body parts. Robinson et al. [134] use ResNet to extract RGB-D video features, a graph convolutional network (GCN) + Self-Attention network to extract pose features, and a CSPDarknet + PANet + YOLO network to extract object location features. A Spatial Mid-Fusion Module then embeds each modality’s features into a feature vector, which is classified by a Dense Neural Layer to estimate which ADL is occurring in the RGB-D video. Li et al. use a bidirectional gated recurrent unit (BGRU) to fuse audio, lexical, and facial features before classifying the fused feature vector with a FCNN to infer the human’s emotion [95]. Chen et al. [29] use a Transformer integrated module to fuse global RGB image and mm-Wave Radar pointcloud features into a multimodal feature vector. The feature vector is then classified by a fusion Transformer module, which can selectively weight the most salient features for more robust human detection when vision is occluded. G. Liu et al. [100] compute spatial RGB features using cross-modal and self-attention mechanisms and temporal features from skeletal pose data via bidirectional LSTM networks. The spatial RGB features and temporal pose features are fused via a concatenation layer and classified via a FCNN to estimate the human’s activity. Yasar et al.’s IMPRINT network [178] uses Neural Networks and Transformer models within an end-to-end feature extraction, fusion, and decoding network on RGB-D data. Unimodal feature extractors compute spatio-temporal features for skeletal poses in the scene, and a GRU encoder extracts contextual features from the scene. The pose and contextual features are fused with an Interaction Module that also estimates group affiliations. A multimodal context module fuses the processed group, individual, and context features. Lastly, the fused and processed feature embedding is decoded by a Motion Decoder to estimate where each person in the scene will move. In each case, Neural Network Fusion can synchronize asynchronous data streams and leverages mutual information between individual modalities.
Our survey found no instances of Multiple Kernel Learning models used in multimodal fusion.

3.9.2 Model-Agnostic Fusion Techniques.

As discussed in Section 3.2, model-agnostic fusion techniques can be categorized as Early (Data) Fusion, Intermediate (Feature) Fusion, or Late (Decision) Fusion depending on when the data are fused [25, 128, 180].
Early fusion involves the synchronization and alignment of raw, independent data streams into a single structure. Many robots or embedded systems feature asynchronous, heterogeneous data streams, making early fusion challenging; it is unclear how a 2D image can be merged with a 1D time-indexed dataset. As noted in [13], data synchronization and alignment is relatively expensive computationally. As such, early fusion is generally not feasible for real-time applied robotics or embedded systems applications, and we only found one project that employed early fusion: Filippini et al. [41] performed extrinsic calibration to find the correspondence between an RGB and a thermal camera. They generate a fused RGB-thermal image by converting thermal pixels into corresponding RGB pixels and linearly interpolating the sampled value if thermal sample time falls between RGB sample times. Similarly, the PST900 dataset [144] uses an extrinsic calibration technique to align RGB and thermal images.
Intermediate or feature-level fusion is generally used when mutual information is conveyed across multiple heterogeneous modalities simultaneously. Specifically, separate audio and vision sensors may detect mutual information about the same object or event. Hence, a fused audiovisual feature may be more informative than separate unimodal audio and visual features.
Intermediate fusion is typically accomplished through alignment, summation, and/or concatenation of unimodal feature vectors. H. Liu et al. [101] use intermediate fusion to perform multimodal command classification for a collaborative human–robot manufacturing task. They use artificial neural networks (ANNs) to extract features from the human’s speech, gesture, and body movements. The features are fused into a single vector, then classified using a MLP as one of six potential commands. Ma’s emotion recognition network [104] functions similarly; CNNs are used to extract features from audio and visual data, which are then fused into a single tensor via an MLP which is classified using an SVM. Ben-Youssef et al. record asynchronous proxemic, gaze, head, and speech measurements, extract features, and fuse the features via temporal pooling of sliding temporal measurement windows [17]. Benkaouar and Vaufreydaz’ domestic robot [18, 164] fuses various spatial, speech, and skeletal pose features from asynchronous sensors; they use neutral values when one of the features is unavailable, and they use the last known feature value for features that are not computed at the current timestep. The fused feature vector is classified via an ANN and SVM to estimate engagement. Lu and Reily [102, 131] compute HoG features for multiple sensor modalities, then represent the fused features in a bag-of-words for each person in a group. Each human’s fused feature vector is multiplied with a weighted matrix and summed to estimate the activity of the group. Vector concatenation is another intermediate fusion technique, used by Filntisis et al. [42]. They extract facial and pose features using Resnet 50 and DNN, respectively, then concatenate both into a single feature vector that is classified by a FCNN to estimate human emotion. Similarly, Islam et al. [74] extract features from RGB and skeletal pose modalities using a self-attention mechanism; they evaluate intermediate fusion using both concatenation and summation of the feature vectors. The fused vector is then classified by a FCNN to determine the human’s activity. Using PCA and data concatenation, Nigam et al. fuse grayscale values from RGB data and signal intensity from the audio channel; they then use SVMs, decision trees, and naïve Bayes classifiers to infer social context [117]. Shen et al. [143] use linear interpolation and truncation to align variable-sized visual and audio feature vectors temporally, which are then concatenated into a single vector and classified. Wilson et al. [168] employ intermediate fusion to combine gaze and speech features into a single feature vector. Since speech commands are sparse and asynchronous, they use a linear degradation function to progressively reduce speech feature weight over time if speech features are not readily available. The fused feature vector is then classified with a random forest to estimate if the user needs assistance.
Model-agnostic late fusion techniques were the most common category of fusion in the surveyed works. We found examples of late fusion implemented using simple arithmetic or logical comparison, probabilistic and Bayesian methods, action engines and state machines, data association, and supervised classifiers as described below.
Simple late fusion can be accomplished through mathematical operations on unimodal data streams. Weighted summation was one common late fusion technique, which was used for human identification and command/intent recognition. Tan et al. and Churamani et al. compute human identity through a weighted sum of voice and facial identification confidence values [35, 149]. Similarly, Martinson et al. compute how closely a human’s face, clothing, and height match a database of known values; the weighted sum of these similarity metrics is used to select the most likely human identities [108]. Pourmehr et al. [124] independently track humans using Kalman Filters for visual, audio, and LiDAR input data, generating three separate occupancy grids each representing the likelihood of humans present. The fused output is simply a weighted average of the three independent modalities’ occupancy grids. Sanchez-Riera et al. [139] also used weighted averages of individual modes to classify human commands. A visual gesture classifier and a verbal speech SVM classifier each estimate which command was issued (“Hello,” “Bye,” “Yes,” “No,” Stop,” “Turn”); the two modes are weighted and summed to generate a posterior estimate of the command. The MuMMER robot computes an “interest likelihood” from head pose, gaze, and location data, then computes a weighted sum to infer who is likely to interact with the robot [45, 46].
Logical operations on unimodal data streams are another way of affecting simple late data fusion. For example, the speech and gestural interface in [1, 2] will fuse the two command inputs only if they arrive within 0.5 seconds of one another and contain complementary information; redundant commands are discarded. Ishi et al.’s LiDAR and microphone system will only fuse human location measurements if they are within a distance threshold [73]. The audiovisual speaker localization system in [54] takes sound source location, face location, and mouth movement as inputs; the speaker location is only returned if mouth activity is detected near the sound source. Lastly, the speech and gestural interface in [83, 187] computes confidence values for recognized speech and gesture commands, then ranks all received commands and selects the command with the highest overall confidence.
Projection and/or lifting is another means of late fusion and can be used to combine processed data from one modality with raw data from another. Terreran et al. detect 2D pose keypoints of an RGB-D image using OpenPose, then use 2D-to-3D lifting and 2D-to-3D projection to fuse the 2D pose keypoints to the depth channel of the RGB-D image. The result is a 3D pose estimate. Due to the inherent processing time in the OpenPose process, this is best suited for offline operations, and the authors of [150] use this method on datasets.
Action engines and managers are another means of handling processed outputs of multiple modalities and work well for HRI applications when actions, states, and state transitions can be explicitly defined. The simplest case is to execute actions based on any commanded input as they are received—such as processed touch screen, gesture, or verbal commands—as done by the Hobbit service robot in [43]. More involved dialog, action, and service managers implement execution rules and reason about potential courses of action before execution. Roboceptionist’s dialog manager uses rules to determine when to process verbal and gestural commands and when to respond. It also parses the semantic content of human speech to select the most appropriate response given the current dialog state [86]. Based on visual and speech inputs, R3D3 also uses a dialog manager to manage initiative and conversation flow and determine which museum exhibit a human wants information about [99, 152]. The SocialRobot platform allows developers flexibility by allowing them to implement custom action descriptions and behaviors in an XML file that is managed by the robot’s action engine [123]. Efthymiou et al. use a dialog flow manager, which monitors for specific speech or gesture inputs to select the next robot action in a series of child–robot interaction games and scenarios [38]. Chao et al. use a timed Petri net architecture to asynchronously handle visual and audio stimuli to determine when to interact with a nearby human [27]. Many surveyed works implemented action engines or dialog managers using state machines, where state transitions are affected by specific perceptual inputs. For example, Maniscalco et al. implement a behavior manager on the Pepper robot using SMACH13 state machine software; Pepper will not interact with a human if they are not engaging with the robot and will not respond to a person unless the person has been identified and initiated interaction with the robot [105, 106]. Churamani et al. also implement a dialog manager in SMACH [35], which uses multisensory cues from nearby humans to select its spoken responses. In surveyed works, state machines were used to generate hotel robot service behaviors [122] and robot social gaze [121]; Jacob et al. use a verbal- and gestural-controlled state machine to pick and pass surgical instruments and enable or disable specific sensory modalities [75]. One advantage of state machines is that they can be programmed to better utilize limited onboard compute resources in mobile robots; for example, Chu et al. do not perform some human perception functions if there are no humans in the scene [34].
Ragel et al. [127] use a sequential data fusion pipeline to combine skeletal pose, face, hand, and speech data into a single Person instance. Each mode is processed individually using a variety of techniques, and the Person instance is incrementally updated as new data from each modality becomes available. The system handles asynchronous messages via standardized robot operating system (ROS) data structures and middleware.
Probabilistic and Bayesian techniques comprise another broad category of late data fusion; in surveyed literature, they were used for human localization and tracking, as well as command, intent, and emotion recognition. Ban [10] and Gebru [49] represent visual and audio localization results as probability distributions, then fuse the two modes via Bayesian estimators to compute the posterior location estimate. Likewise, Bayram and Ince [12] and Nakamura [114] use particle filtering to fuse visual and audio location measurements into a single posterior estimate. Tan et al. use probabilistic hypothesis testing to fuse sound source and face locations and distinguish them from background sound sources [149]. Chau’s Audio-Visual Simultaneous Localization and Mapping algorithm [28] is designed specifically for simultaneous localization of the robot platform and multiple nearby humans. They use a Gaussian Mixture Probability Hypothesis Density filter to fuse audio and visual location measurements,14 a multitarget tracking filter that models intermittent sensor visibility and human entry/exit from the field of view as random finite sets. Linder [97], Yan [176], and Nieuwenhuisen [116] use multi-hypothesis tracking—another common multiobject tracking method—to fuse visual and LiDAR human position measurements for multiple human tracking in dense social environments. Bohus et al. [20, 21] use probabilistic models to represent quantities like human engagement, activity, and initiating or terminating contact. Both Trick et al. and Whitney et al. fuse speech and gesture to recognize the object that a human is referring to (referring gestures). A prior estimate of the referring gesture can be recursively updated with incoming speech and pointing gestures as in [167], or a categorical probability distribution of the most likely gesture can be computed via an independent opinion pool, as in [153]. Prado et al. use DBNs to classify facial and speech emotional content, and then use an additional DBN to fuse the unimodal results [125].
Data association techniques can also be employed for late fusion. NN association can be used as a late fusion technique, suitable for assigning information from an imprecise spatial measurement to a second, more precise, spatial measurement. For example, [49] use NN fusion to attribute coarsely localized speech activity to precise visual human location; the fused result contained precise location and speech information. Similarly, D’Arca et al. [36] used NN fusion to attribute speaker identity from audio data to a more precise, fused audio-visual human location. Yumak et al. [180, 181, 182] use NN fusion to assign speech to a visually localized human. They compute the speech direction-of-arrival vector, then fuse the associated speech content to the person nearest to the vector. Komatsubara et al. [87], Glas et al. [52], and the Robot Operating System for Human-Robot Interaction (ROS4HRI) fusion module [110] use NN association to fuse facial identity with a detected body. NN and extended NN association can also be used to assign position measurements to humans that are currently tracked by Kalman Filtering as in [97]. The Hungarian algorithm (also called the Munkres algorithm) is another data association technique, used in the literature for associating multiple measurements with multiple existing detections in late fusion. Belgiovine and Gonzalez-Billandon et al. use the Hungarian algorithm to assign multiple face detections to existing faces that are tracked via Kalman Filtering [14, 57], while Nieuwenhuisen et al. use the Hungarian algorithm to assign visual and LiDAR position measurements to humans tracked via multi-hypothesis tracking [116].
The last category of late fusion found in our survey is supervised classification, which was used for several HRI functions that infer multilabel quantities given features or labels from multiple unimodal streams. (Note that this is strictly different from model-based fusion techniques discussed in Section 3.9.1, which use trained models to fuse data instead of classify it.) Engagement and status recognition were common use-cases for supervised classification in the literature, and projects typically evaluated several supervised classification techniques; this is recommended because the results of supervised classifiers can vary significantly based on the application and available data. For the JAMES bartending robot, Foster et al. compute head and hand positions, torso and head poses, and speech activity, then classify them to infer the bar patron’s current state. They evaluate binary regression, logistic regression, multinomial logistic regression with ridge estimator, k-NN, decision tree, naïve Bayes, SVM, and propositional rule classifiers [44, 47]. Ben-Youssef et al. compute gaze, head motion, facial expression, gestures, and speech features from separate perception modules then evaluate logistic regression, DNN, LSTM, and GRUs to classify engagement [17]. Vaufreydaz et al. compute position, velocity, speech, sound source, and pose features via separate perception modules, then classify engagement using an SVM and ANN [164]. Jain et al. compute facial features and audio signal features while measuring performance in a joint child–robot game; these are used to classify the child’s engagement using naïve Bayes, KNN, SVM, NN, logistic regression, random forest, and gradient boosted decision trees [76]. Banerjee et al. also compute head orientation, gaze, body position, body orientation, and audio signals, then classify the human’s interruptability (i.e., willingness to engage) using SVMs and random forest classifiers. They also compute engagement using model-based fusion techniques like CRFs, as discussed previously [11]. Paez et al. estimate a human’s emotional state using k-means classification of recognized body pose, facial gestures, and facial emotion [126]. Lastly, Terreran et al. [150] individually classify pose keypoints of the body and hands with the Shift-GCN architecture, then use ensemble averaging to estimate the person’s overall activity.
More complex HRI systems employed combinations of intermediate and late fusion, known as hybrid fusion. Yumak et al.’s system [180, 181, 182] used intermediate fusion to identify humans by combining and classifying LBP and HSV visual features. They also used NN late fusion to assign speech content to localized humans and infer their focus of attention. They also employ a world model and a rule-based dynamic user entrance/leave mechanism in their fusion module to infer when humans enter or exit the field of regard. The final, fused result contains each nearby human’s identity, location, speech content, and when they entered or left the area.
D’Arca et al. [36] used a hybrid of feature-level and decision-level fusion methods along with state estimation. Kalman and particle filters fuse audio and visual localization results to track speakers in the field of regard. Speaker detection is accomplished by canonical correlation analysis, a feature-level fusion technique that combines audio MFCCs and visual optical flow features to detect who is speaking. The final, fused result contains speaker location and identity and is robust to visual occlusions. The resulting system is comparatively complex—it is comprised of an overall fusion network and subnetworks for tracking, speaker recognition, and audiovisual feature extraction—and is used in a static, indoor setting instead of being implemented on a mobile platform.

4 Survey Summary, Analysis, and Future Research

We reviewed a range of MMP applications and their supporting data acquisition, processing, and fusion methods, seeking techniques that can be used by robots and intelligent agents to perceive humans in HSEs. Here, we identify trends and challenges in the surveyed research, provide guidelines and resources for the design and use of MMP systems for HRI in HSEs, and suggest future research areas.

4.1 Trends in Current Multimodal HRI Perception Systems

Surveyed systems spanned a wide range of applications and use-cases which we summarize here. In each case, MMP in HSEs can detect nuanced human communication or help provide more robust perception by leveraging mutual information across multiple data streams. We categorize patterns within surveyed research applications in Table 8, noting characteristics, advantages, disadvantages, and representative examples.
Table 8.
MMP ApplicationTypical Design CharacteristicsAdvantagesDisadvantagesExamples
HRI research testbed (mobile)Mobile base; uses graph or late fusion; operates in dynamic public environment, possibly for long durationCaptures relevant interaction phenomena in representative environmentMay not detect subtle interaction cues for all participantsDetecting interruptability, engagement, or social context [11, 16, 18, 117, 164] and long-term identification [72]
HRI research testbed (static)Fixed or static indoor base; can use cloud resources and networked sensors; Supervised classification of multimodal featuresPrecise human interaction data and cues; enables controlled development of HRI perception systemsMay not capture relevant interaction phenomena in representative environmentActivity, emotion, engagement, identity, initiative, or personality trait estimation via games with tabletop or standing robots [14, 20, 21, 27, 34, 35, 41, 76, 126, 127, 143, 168]
Human-aware navigation capability developmentMobile base; uses graph or late fusion; operates in dynamic public environmentEnables human-aware planning and navigation for robots in HSEsMay not detect subtle interaction cues for all participantsHuman localization, tracking, and following [12, 28, 97, 176]
Multimodal human cue recognitionUses labeled datasets; neural network fusion or supervised classificationCaptures subtle multimodal interaction cues and stimuli by finding mutual informationPrimarily demonstrated on offline datasetsActivity, emotion, gesture, or intent recognition [74, 95, 134, 150, 178]
Multimodal command interface for robot or intelligent systemMultiple synchronized sensors in controlled workspace; can use cloud resources; late fusionEnables more intuitive or convenient control of complex systemsInteraction limited to task-specific actionsHands-free or intuitive control of assistive robots, collaborative robots, or UAVs [1, 2, 75, 83, 101, 186, 187]
Robust MMPSingle sensor with multiple data streams, or multiple synchronized sensors; Neural Network Fusion or supervised classificationCan be robust to occlusions and single-modality limitationsPrimarily demonstrated on offline datasetsPerson detection or pose recognition using fused RGB-D or RGB-radar systems [29, 48, 98]
Service robot (mobile)Mobile base; Can be augmented with environmental sensors Uses Graph or Late Fusion (state machine or action manager)Could alleviate labor shortfalls; can create more engaging interaction experienceMay not detect subtle interaction cues for all participants; interaction limited to task-specific actionsAirport, campus, elderly care, mall, and museum robots [45, 46, 51, 52, 80, 102, 105, 106, 123, 154]
Service robot (static)Fixed or static indoor base; can use cloud resources and networked sensors; uses graph or late fusion (state machine or action manager)Could alleviate labor shortfalls; can create more engaging interaction experience; precise human interaction data and cuesinteraction limited to task-specific actions; integration can be complexBartending, concierge, or receptionist robots [20, 21, 44, 47, 86, 99, 120, 152, 180182]
Smart spacesMultiple networked sensors in controlled space, optionally with robots; uses cloud resources; Hybrid fusionPrecise interaction information for humans in the spaceIntegration can be complexSmart classrooms and playspaces for improved child education or child-robot interaction [36, 38, 42, 87, 156]
Table 8. Summary of MMP System Trends in HSEs
Smart classrooms and playspaces for improved child education or child—robot interaction (Komatsubara et al., 2019; Tsiami et al., 2018; Efthymiou et al., 2022; Filntisis et al., 2019; D’Arca et al., 2016)
Service robots were heavily represented in surveyed works, which are largely motivated by demographic shifts in their respective nations of origin (such as aging populations), the shortage of specialized workers, and the associated economic impacts. Elderly care robots [43, 83, 123, 187, 186], educational or engagement robots and smart spaces for children (including those with ASD) [38, 42, 87], reception, concierge, and airport/museum guides [45, 46, 52, 80, 86, 99, 116, 152, 154], and manual service tasks like bartending [44, 47] were common use-cases. In general, these projects had a few specialized tasks and needed to infer the needs or status of humans in HSEs, often leveraging cloud resources to perform specific interaction and perception functions.
Robust perception was also common in surveyed works. These projects developed and evaluated critical functions necessary for natural, long-term HRI in HSEs by leveraging cues from different modalities. This includes longstanding research areas like human detection and localization using visual, audio, and laser range modalities [54, 73, 109, 114], with specific projects using multiple modalities to localize and track multiple humans in dense HSEs [97, 98, 176]. Multimodal interruptability, interest, and engagement detection was another common research focus, with consistent contributions throughout the timeframe of the survey [11, 17, 18, 27, 34, 76, 102, 105, 106, 121, 124, 164]. Less common—and generally, more recent—MMP research has addressed activity detection [74, 100, 103, 131], emotion recognition [95, 125, 126], and identification/personalization for long-term or recurring interactions [14, 35, 52, 57, 72, 108]. More recent contributions leveraged trained ML models to fuse separate modalities and generate a fused estimate that is robust to occlusions and limitations of individual modes [29, 48].
Lastly, another trend in surveyed MMP HRI applications involved the use of MMP interfaces to provide intuitive human inputs to a robotic system. This includes multimodal speech-gestural command of UAVs as in [1, 2], commanding of a robotic scrub nurse [75], commanding a collaborative robot [101], or identifying specific objects to train a robot’s object detection system [153, 167]. This demonstrates that MMP interfaces can be used to intuitively command robotic systems in human–robot collaborative tasks, to provide flexibility in applications when one communication modality may be unavailable (e.g., a surgeon can verbally command if they are performing a manual task and cannot gesture), or to intuitively train a robot’s reinforcement learning system, thereby leveraging the insights of a human teammate to improve robot performance.

4.2 Challenges and Limitations for Multimodal HRI Perception

HRI has many grand challenges, and to recount them, all would be beyond the scope of this survey. However, our survey found specific challenges for MMP researchers and developers in HSEs, and we highlight them here. Resources for addressing these issues are presented in Section 4.3.

4.2.1 HRI Applications Are Highly Specialized.

While HRI in HSEs has many common characteristics, specific applications and environments can vary significantly from one another. A robot’s tasks and perceptual requirements are often highly application-specific, and an algorithm or component that is ideal for one setting may not work well in another. This has significant implications for the design, evaluation, and usage of MMP systems.
Ideally, HRI researchers should evaluate several components for usage in HSEs. Many surveyed works [11, 17, 18, 44, 47, 97, 120] evaluated various fusion techniques or ML classifiers, for example, and selected the best overall method for their application. Likewise, a ready-made solution may not exist for the specific perception or fusion function, meaning that researchers may need to implement their own custom solutions. Some surveyed works developed custom SSL [49], hand pose tracking [75], head pose estimation [80, 154] or facial recognition algorithms [12], for example, despite the open-source availability of versions of these algorithms, and Yumak et al. developed a custom Integrated Interaction Platform [180, 181, 182]. Note that this is especially evident in earlier surveyed works that pre-date open-source middleware and modern ML frameworks.
Because HRI applications in HSEs can vary, this makes it difficult to provide a solid set of community guidelines or standards for MMP system design; indeed, the HRI community does not appear to have coalesced around a specific set of tools or framework for HRI development and integration. These factors culminate in an increased development time for multimodal systems compared to unimodal perception systems and require increased effort by the developer.

4.2.2 Delay between Unimodal Perception Development and Integration in Multimodal Systems.

Related to Section 4.2.1, the added overhead of multimodal system design, evaluation, and integration means that state-of-the-art ML and unimodal perception capabilities may take several years before they are used in a MMP system in an HSE. This posed a challenge when conducting this survey: many surveyed fusion and perception techniques (especially in the ML domain) can be more effectively accomplished by using more modern techniques. On the other hand, unimodal ML and perception techniques are rapidly advancing and many methods have not yet been published in a peer-reviewed journal, let alone incorporated into a multimodal system. As we discuss in Sections 4.3 and 4.4, readers are encouraged to evaluate novel perception and fusion capabilities—perhaps that are not yet published—rather than reproduce systems used in past works.

4.3 Design Guidelines and Developer Resources for MMP in HSEs

Here, we provide recommended design guidelines based on trends in the surveyed research. We also provide a summary of available hardware and software resources to aid MMP system developers and HRI researchers.

4.3.1 Design Guidelines.

As mentioned in Section 4.2, it is difficult to define a single set of design guidelines for MMP systems in HSEs. However, Table 8 can provide a rough starting point for a system design based on the application. Additionally, we encourage developers to consider the following factors when designing and implementing a MMP system for use in HSEs:
Modularity: The ability to implement and evaluate multiple components and algorithms allows developers to quickly find a suitable system design, as in [11, 17, 18, 44, 47, 97, 120].
Internet connectivity: The ability to use cloud-based services can potentially reduce development time for interaction functions like speech recognition, natural language processing, soft biometric estimation, or emotion recognition, as in [20, 21, 35, 45, 46, 95, 99, 102, 105, 106, 116, 121, 122, 126, 127, 143, 152].
External sensors: Certain HSEs like classrooms, hospitals, shops, homes, service environments, and collaborative workspaces can be outfitted with sensors. These sensors can be networked with the robot to provide additional data and increase overall sensor coverage, as in [36, 44, 47, 51, 83, 156, 187].
Safety, robustness, and real-time considerations: Mobile robots that interact with humans in the wild mostly used established late fusion techniques and comparatively simpler detectors. Novel detectors and fusion frameworks were primarily used on datasets and static robots.
Evaluation method: In surveyed works, developers evaluated their perception systems using datasets, longitudinal studies, grand challenges, user studies, and demonstrations. Each method differs in the resources required, study reproducibility, fidelity, and relevance to the HRI application; developers should consider these factors when selecting an evaluation method.
Existing resources: The use of existing hardware platforms and software datasets, frameworks, and SDKs can reduce development time and, in general, increases the commonality and reproducibility of the project [65]. Example resources are listed in Section 4.3.2.

4.3.2 MMP System Development Resources.

Here, we summarize software and hardware resources discussed in the survey and introduce additional resources for MMP system development.
State-of-the-art HRI perception techniques are continually advancing, and many were outside of the scope of this survey because they were in pre-print, unpublished status or have not yet been incorporated into a MMP system. In addition to the datasets discussed in Section 3.8, the following websites provide resources for training novel HRI perception models:
PapersWithCode15 provides labeled datasets, pre-trained ML models, and associated publications for many relevant HRI functions. State-of-the-art models are continually published for activity recognition, speech recognition, face recognition, pose estimation, person re-identification, emotion recognition, gesture recognition, gaze estimation, speech emotion classification, facial expression classification, age estimation, and hand pose estimation.
HuggingFace16 features recent tutorials and models for multimodal feature extraction, VAD, and ASR using Transformer models in the PyTorch framework.
Integration of MMP systems poses another challenge, and the following software frameworks may be relevant for some HRI applications:
Platform for Situated Intelligence (PSI)17: Debuted by Microsoft Research in 2019 [6, 7] and using the C# programming language, PSI includes specialized data processing techniques to align, reshape, and synchronize heterogeneous data streams and track latency for operation on real-time embedded systems. Developers can employ a combination of components—device drivers and wrappers for algorithms and ML models—into a runtime instance. Includes tutorials for VAD, audio energy computation, and skeletal pose recognition.
ROS4HRI18 aims to create a set of common HRI functions and data types in ROS. ROS4HRI defines a ROS standard for HRI (ROS Enhancement Proposal 15519), implements HRI-specific message types,20 and provides baseline HRI functions like skeletal pose tracking, facial recognition, data fusion to match faces to bodies, and human reidentification in ROS. REP-155 was completed in early 2022; at the time of writing, ROS4HRI’s impact or community adoption is not yet discernable and it is still under active development.
Human And Robot Modular and OpeN Interactions (HARMONI)21 [147] is another modular, ROS-based HRI framework. At time of writing, HARMONI supports speech recognition via DeepSpeech and face detection using dlib. Users can compose custom interaction flows consisting of event-driven loops and sequences.
Several surveyed works used social robot platforms for research and development. Additionally, our initial literature search found many publications that—while they did not meet final inclusion criteria for the survey—used common robot hardware platforms for HRI research, and therefore may be of interest for MMP HRI system development. Some are commercially available development platforms with APIs, while others offer case studies in robot system development or were used to conduct HRI studies.
Child–robot interaction: Kaspar [169]; Luka reading robot22 [184]; Mio Amico [41]
Elderly/home care: GARMI mobile manipulator [155]; Hobbit mobile manipulator [43]
Service robots: BRILLO bartending robot [137]; Sacarino concierge robot [122]; TOOMAS shopping assistant robot [37]
Social robots: iSocioBot [149], IVO [90], Kejia [102, 32], Quori [146] and Pepper23 [17, 46, 72, 105, 106] mobile social robot platforms; iCub [14, 57] and NAO24 [133, 38, 72] humanoid robots
Conversational robots: Furhat tabletop robot25 [3, 4, 39, 40, 50, 79 121]; Haru tabletop robot [55, 56, 81, 115, 127, 162, 163]; Mini tabletop elderly care and engagement robot [129, 138]; Nadine humanoid robot [151].

4.4 Future Research Areas

Multimodal HRI in HSEs has been an active area of robotics research for over two decades [60], yet it is still a rapidly evolving field. In light of the trends and challenges discussed above, here we discuss promising research areas in multimodal HRI.

4.4.1 Integration and Deployment of Novel Perception Techniques.

A robot’s ability to sense and interact with humans is a subset of its overall perception capability, and robot perception has advanced considerably in the past several years. These advances are largely driven by new software frameworks, ML architectures, and improvements in commercially available hardware. For example, the Transformer architecture [161] has seen widespread usage in ML tasks since its inception in 2017. Specifically, Transformer components like encoders, decoders, and attention mechanisms can be used to perform time-series fusion and classification tasks that are crucial to multimodal HRI perception. Many unimodal datasets and pre-trained Transformer models exist for baseline HRI functions like gesture recognition, speech recognition, emotion recognition, and VAD.
Although we found several examples of Transformers used for multimodal HRI within the past 3 years [74, 95 100], this research effort has primarily been performed on labeled datasets. As such, we believe that there are many more opportunities to use Transformer-based approaches to accomplish multimodal HRI functions mentioned in this survey, specifically: deictic/referring gesture fusion, multimodal emotion classification, intent recognition, and asynchronous/sparse data fusion. Additionally, deploying these models within multimodal embedded systems in HSEs appears to be relatively unexplored research area. An evaluation of Transformer perception and fusion model efficacy in HSEs could further guide multimodal HRI research.

4.4.2 Flexible Interaction and Perception Architectures.

In the same way that interaction policies can be personalized to improve HRI outcomes [14, 17, 40, 52, 57, 123], we believe that HRI perception systems can be similarly adjusted to fit the current interaction needs. This is partially motivated by surveyed works that used state machine architectures to affect actions based on perceptual stimuli [34, 35, 75, 105, 106, 121, 122] but could be extended to other types of robots operating in HSEs. In particular, a state machine or behavior tree architecture could use cues about the social context, human engagement, and human emotion to select which perception resources and modalities to use, and when. In the same way that a human does not deeply examine the facial expressions of every person they pass on the street, a robot could selectively choose when and how to perceive humans by employing methods found in this survey. For example, if no humans are present, a robot could simply perform human detection using vision and sound event detection. While operating in crowded, public HSEs, human tracking and social navigation using vision and LiDAR would likely be sufficient. If the robot detects that a person wishes to engage, resource-intensive HRI functions like facial recognition, emotional recognition, and speech recognition could be activated as appropriate. This has the added benefit of computational resource management [34], since many perception methods—especially those that use large models for real-time inference—may use a disproportionate share of onboard processing capability for mobile robots and embedded systems.

4.4.3 Perception Systems for Long-Term Interaction.

Many surveyed works employed MMP systems in support of long-term, personalized interaction [14, 52, 57, 72]. This typically involved perceiving soft biometrics, face, and voice identification for open-set identification and fusing these observations with Bayesian or Graph Fusion techniques. One study [52] also recorded metadata about each person, such as when, how long, how often they visited a particular location. These systems required specialized data storage, learning, and inference/retrieval mechanisms which we did not cover in this survey. However, understanding the type of HRI data to record, and how to store, process, and retrieve it, is crucial for long-term interaction and we believe this is an area worthy of further investigation. The references included in this survey can provide a starting point for further long-term HRI perception system development. While there are surveys of long-term learning for neural networks [119], robots [94, 141], and robot perception and manipulation [84], we found no surveys of perception systems or cognitive architectures specific to long-term HRI.

4.4.4 Integrated HRI Perception and Planning.

In HSEs, interaction is situated within a broader environmental context and the most ideal course of action is tightly coupled with the current environmental state [22]. Architectures that can fuse multimodal stimuli to simultaneously perceive the status of nearby humans and plan a suitable action may be especially desirable for HRI in HSEs, however we only found two examples in the survey: the POMDP dialog manager of Lu et al. for the Kejia robot, and the POMDP collaborative planner used in the SPENCER project [80, 154]. Both employ POMDP graphs on mobile robots in public environments to fuse multimodal inputs and select a suitable mode of interaction.
Additionally, the Workshop on Integrated Perception, Planning, and Control for Physically and Contextually Aware Robot Autonomy26 addresses this topic more generally. Contributions in 2023 address multimodal semantic segmentation and robust perception, techniques that could be employed for MMP in HRI.

4.4.5 New Sensory Modalities.

The surveyed research in perceptual interfaces is largely dominated by audiovisual systems [10, 12, 20, 21, 28, 36, 49, 139, 156, 180, 181, 182], but introducing new sensor modalities could increase the accuracy or robustness of perceptual interfaces while offering new semantic information.
Although 3D LiDAR detection has seen widespread use in autonomous vehicle perception, it was relatively uncommon for HRI perception; Yan et al.’s 3D LiDAR human tracker was the only example in this survey [176]. 3D LiDAR detection [174] could provide valuable complementary information to a MMP HRI systems beyond existing 2D LiDAR leg detectors. Namely, a 3D system could detect humans even while they are not standing and would reduce false positives from furniture legs [175]. Several pre-trained 3D LiDAR human detectors are available as mentioned in Section 4.3 and could be implemented on a robot platform with 3D LiDAR.
Thermal cameras and mmWave-radar also offer promising sensor modalities to complement HRI but were relatively uncommon in surveyed works. Mainly, these modes were used in applications where vision may be degraded [29, 103, 131], but they offer many functional benefits: a human’s thermal signature can help segment human and non-human clusters [114] and identify emotional state [41], while mm-Wave Radar can penetrate visual obscurants like fog, smoke, precipitation, and low-light conditions. Using thermal or radar data in a multimodal human perception system could be especially useful for segmenting humans in cluttered social scenes, potentially improving robustness or accuracy even when vision is not degraded.

4.4.6 Adopting Best Practices and Community Standards.

Finally, adopting community standards for HRI researchers could reduce duplication of effort and lower the barrier to entry for HRI researchers. HRI in HSEs can often be unstructured and difficult to control, hence reproducibility can also be a concern for HRI experiments and systems. Gunes et al. [65] provide useful guidelines for conducting HRI experiments and for HRI system design. They recommend that HRI researchers be explicit about SDKs, middleware, and datasets used, and note that the usage of common frameworks aids in reproducibility, commonality, and development for HRI systems and experiments.
At time of writing, ROS4HRI appears to be the most concentrated effort toward creating an HRI-specific community developer standard and includes a protocol and ROS message types. Implementation tutorials are under active development. Developers are invited to integrate additional hardware and software components into ROS4HRI in accordance with REP-155 standard.27 Currently, ROS4HRI features facial detection, skeletal pose recognition, and facial body fusion via assignment, which is a subset of the many techniques discussed in this survey. Incorporating more perceptual and fusion techniques into the ROS4HRI framework would enable more rapid development and evaluation of HRI perception systems. Similarly, PSI’s website28 allows external collaborators to develop components for the PSI framework in accordance with the software’s API.

4.5 Concluding Remarks

As we explored in this article, MMP systems have the potential to offer more robust, flexible, and complete information for HRI compared to unimodal perception systems. The sensory acquisition, processing, and fusion techniques surveyed here are potentially relevant to a broad range of robotics and autonomous systems—including, but not limited to surgical robots, teleoperated systems, collaborative robots, and autonomous vehicles. However, we primarily focused on MMP methods that could be employed on robot platforms in dynamic HSEs. Although HRI in HSEs remains a substantial challenge, novel tools, techniques, and standards offer a promising path forward.

Acknowledgments

The authors would like to acknowledge Christian Claudel, Joydeep Biswas, Ann Majewicz-Fey, and James Sulzer of The University of Texas at Austin for their guidance of this line of research. Additionally, John Duncan would like to thank Emmanuel Akita, Christina Petlowany, Christopher Suarez, and Srinath Tankasala of the Nuclear & and Applied Robotics Group for their assistance in preparing this survey.

Footnotes

12
For a survey of model-based fusion techniques outside of robotics, we recommend Tadas Baltrusaitis et al.’s “Multimodal Machine Learning: A Survey and Taxonomy.
14
Note that the authors use the term “Early Fusion,” but the audio and visual sensor modalities are independently processed before unimodal localization results are fused.

References

[1]
Ayodeji Opeyemi Abioye, Stephen D. Prior, Peter Saddington, and Sarvapali D. Ramchurn. 2022. The performance and cognitive workload analysis of a multimodal speech and visual gesture (mSVG) UAV control interface. Robotics and Autonomous Systems 147 (Jan. 2022), 103915. DOI:
[2]
Ayodeji O. Abioye, Stephen D. Prior, Glyn T. Thomas, Peter Saddington, and Sarvapali D. Ramchurn. 2018. The multimodal speech and visual gesture (mSVG) control model for a practical patrol, search, and rescue aerobot. In Towards Autonomous Robotic Systems. Manuel Giuliani, Tareq Assaf, and Maria Elena Giannaccini (Eds.), Lecture Notes in Computer Science, Vol. 10965, Springer International Publishing, Cham, 423–437. DOI:
[3]
Samer Al Moubayed, Jonas Beskow, and Gabriel Skantze. 2014. Spontaneous spoken dialogues with the furhat human-like robot head. In Proceedings of the 2014 ACM/IEEE International Conference on Human-Robot Interaction. ACM, New York, NY, 326–326. DOI:
[4]
Samer Al Moubayed, Jonas Beskow, Gabriel Skantze, and Björn Granström. 2012. Furhat: A back-projected human-like robot head for multiparty human-machine interaction. In Proceedings of the Cognitive Behavioural Systems: COST 2102 International Training School, Revised Selected Papers. Springer, Berlin, Heidelberg, 114–130.
[5]
Xavier Alameda-Pineda, Jordi Sanchez-Riera, Johannes Wienke, Vojtech Franc, Jan Cech, Kaustubh Kulkarni, Antoine Deleforge, and Radu Horaud. 2012. RAVEL: An annotated corpus for training robots with audiovisual abilities. Journal on Multimodal User Interfaces 7 (Mar. 2012), 79–91. DOI:
[6]
Sean Andrist and Dan Bohus. 2020. Accelerating the development of multimodal, integrative-AI systems with platform for situated intelligence. In Proceedings of the AAAI Fall Symposium on Artificial Intelligence for Human-Robot Interaction: Trust & Explainability in Artificial Intelligence for Human-Robot Interaction. Retrieved from https://www.microsoft.com/en-us/research/publication/accelerating-the-development-of-multimodal-integrative-ai-systems-with-platform-for-situated-intelligence/
[7]
Sean Andrist, Dan Bohus, and Ashley Feniello. 2019. Demonstrating a framework for rapid development of physically situated interactive systems. In Proceedings of the 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 668–668. DOI:
[8]
Pablo Azagra, Florian Golemo, Yoan Mollard, Manuel Lopes, Javier Civera, and Ana C. Murillo. 2017. A multimodal dataset for object model learning from natural human-robot interaction. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 6134–6141. DOI:
[9]
Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2019. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2 (Feb. 2019), 423–443. DOI:
[10]
Yutong Ban, Xiaofei Li, Xavier Alameda-Pineda, Laurent Girin, and Radu Horaud. 2018. Accounting for room acoustics in audio-visual multispeaker tracking. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Piscataway, NJ, 6553–6557. DOI:
[11]
Siddhartha Banerjee, Andrew Silva, and Sonia Chernova. 2018. Robot classification of human interruptibility and a study of its effects. ACM Transactions on Human-Robot Interaction 7, 2 (Jul. 2018), 1–35. DOI:
[12]
Baris Bayram and Gökhan Ince. 2015. Audio-visual multi-person tracking for active robot perception. In Proceedings of the 2015 IEEE/SICE International Symposium on System Integration (SII). IEEE, Piscataway, NJ, 575–580. DOI:
[13]
Michal Bednarek, Piotr Kicki, and Krzysztof Walas. 2020. On robustness of multi-modal fusion—Robotics perspective. Electronics 9, 7 (Jul. 2020), 1152. DOI:
[14]
Giulia Belgiovine, Jonas Gonzlez-Billandon, Alessandra Sciutti, Giulio Sandini, and Francesco Rea. 2022. HRI framework for continual learning in face recognition. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, Piscataway, NJ, 8226–8233. DOI:
[15]
Tony Belpaeme, Paul E. Baxter, Robin Read, Rachel Wood, Heriberto Cuayáhuitl, Bernd Kiefer, Stefania Racioppa, Ivana Kruijff-Korbayová, Georgios Athanasopoulos, Valentin Enescu, Rosemarijn Looije, Mark Neerincx, Yiannis Demiris, Raquel Ros-Espinoza, Aryel Beck, Lola Cañamero, Antione Hiolle, Matthew Lewis, Ilaria Baroni, Marco Nalin, Piero Cosi, Giulio Paci, Fabio Tesser, Giacomo Sommavilla, and Remi Humbert. 2013. Multimodal child-robot interaction: Building social bonds. Journal of Human-Robot Interaction 1, 2 (Jan. 2013), 33–53. DOI:
[16]
Atef Ben-Youssef, Chloé Clavel, Slim Essid, Miriam Bilac, Marine Chamoux, and Angelica Lim. 2017. UE-HRI: A new dataset for the study of user engagement in spontaneous human-robot interactions. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, New York, NY, 464–472. DOI:
[17]
Atef Ben-Youssef, Giovanna Varni, Slim Essid, and Chloé Clavel. 2019. On-the-fly detection of user engagement decrease in spontaneous human–robot interaction using recurrent anddeep neural networks. International Journal of Social Robotics 11, 5 (Dec. 2019), 815–828. DOI:
[18]
Wafa Benkaouar and Dominique Vaufreydaz. 2012. Multi-sensors engagement detection with a robot companion in a home environment. In Proceedings of the Workshop on Assistance and Service Robotics in a Human Environment at IEEE International Conference on Intelligent Robots and Systems (IROS ’12), 45–52.
[19]
Chiara Bodei, Linda Brodo, and Roberto Bruni. 2013. Open multiparty interaction. In Recent Trends in Algebraic Development Techniques. Narciso Martí-Oliet and Miguel Palomino (Eds.). Springer, Berlin, 1–23.
[20]
Dan Bohus and Eric Horvitz. 2009. Dialog in the open world: Platform and applications. In Proceedings of the 2009 International Conference on Multimodal Interfaces (ICMI-MLMI ’09). ACM, New York, NY, 31. DOI:
[21]
Dan Bohus and Eric Horvitz. 2010. Facilitating multiparty dialog with gaze, gesture, and speech. In Proceedings of the International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction (ICMI-MLMI ’10). ACM, New York, NY, 1. DOI:
[22]
Dan Bohus, Ece Kamar, and Eric Horvitz. 2012. Towards situated collaboration. In Proceedings of the NAACL Workshop on Future Directions and Challenges in Spoken Dialog Systems: Tools and Data. Retrieved from https://www.microsoft.com/en-us/research/publication/towards-situated-collaboration/
[23]
Qin Cai, David Gallup, Cha Zhang, and Zhengyou Zhang. 2010. 3D deformable face tracking with a commodity depth camera. In Proceedings of the Computer Vision – ECCV 2010. Kostas Daniilidis, Petros Maragos, and Nikos Paragios (Eds.). Springer, Berlin, 229–242.
[24]
Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. 2019. OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 1 (2019), 172–186.
[25]
Federico Castanedo. 2013. A review of data fusion techniques. The Scientific World Journal 2013 (2013), 1–19. DOI:
[26]
Oya Celiktutan, Efstratios Skordos, and Hatice Gunes. 2019. Multimodal human-human-robot interactions (MHHRI) dataset for sudying personality and engagement. IEEE Transactions on Affective Computing 10, 4 (2019), 484–497. DOI:
[27]
Crystal Chao and Andrea Thomaz. 2013. Controlling social dynamics with a parametrized model of floor regulation. Journal of Human-Robot Interaction 2, 1 (Mar. 2013), 4–29. DOI:
[28]
Aaron Chau, Kouhei Sekiguchi, Aditya Arie Nugraha, Kazuyoshi Yoshii, and Kotaro Funakoshi. 2019. Audio-visual SLAM towards human tracking and human-rRobot interaction in indoor environments. In Proceedings of the 2019 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). IEEE, Piscataway, NJ, 1–8. DOI:
[29]
Anjun Chen, Xiangyu Wang, Kun Shi, Shaohao Zhu, Bin Fang, Yingfeng Chen, Jiming Chen, Yuchi Huo, and Qi Ye. 2023. ImmFusion: Robust mmWave-RGB fusion for 3D human body reconstruction in all weather conditions. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, Piscataway, NJ, 2752–2758. DOI:
[30]
Anjun Chen, Xiangyu Wang, Shaohao Zhu, Yanxu Li, Jiming Chen, and Qi Ye. 2022. mmBody benchmark: 3D body reconstruction dataset and analysis for millimeter wave radar. In Proceedings of the 30th ACM International Conference on Multimedia. ACM, New York, NY, 3501–3510. DOI:
[31]
Chen Chen, Roozbeh Jafari, and Nasser Kehtarnavaz. 2015a. UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), 168–172. DOI:
[32]
Yingfeng Chen, Feng Wu, Wei Shuai, Ningyang Wang, Rongya Chen, and Xiaoping Chen. 2015b. KeJia robot–An attractive shopping mall guider. In Social Robotics. Adriana Tapus, Elisabeth André, Jean-Claude Martin, Francois Ferland, and Mehdi Ammi (Eds.), Lecture Notes in Computer Science, Vol. 9388. Springer International Publishing, Cham, 145–154. DOI:
[33]
Wongun Choi, Khuram Shahid, and Silvio Savarese. 2009. What are they doing? Collective activity classification using spatio-temporal relationship among people. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, 1282–1289. DOI:
[34]
Vivian Chu, Kalesha Bullard, and Andrea L. Thomaz. 2014. Multimodal real-time contingency detection for HRI. In Proceedings of the 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, Piscataway, NJ, 3327–3332. DOI:
[35]
Nikhil Churamani, Paul Anton, Marc Brügger, Erik Fließwasser, Thomas Hummel, Julius Mayer, Waleed Mustafa, Hwei Geok Ng, Thi Linh Chi Nguyen, Quan Nguyen, Marcus Soll, Sebastian Springenberg, Sascha Griffiths, Stefan Heinrich, Nicolás Navarro-Guerrero, Erik Strahl, Johannes Twiefel, Cornelius Weber, and Stefan Wermter. 2017. The impact of personalization on human-robot interaction in learning scenarios. In Proceedings of the 5th International Conference on Human Agent Interaction. ACM, New York, NY, 171–180. DOI:
[36]
Eleonora D’Arca, Neil M. Robertson, and James R. Hopgood. 2016. Robust indoor speaker recognition in a network of audio and video sensors. Signal Processing 129 (Dec. 2016), 137–149. DOI:
[37]
Nicola Doering, Sandra Poeschl, Horst-Michael Gross, Andreas Bley, Christian Martin, and Hans-Joachim Boehme. 2015. User-centered design and evaluation of a mobile shopping robot. International Journal of Social Robotics 7, 2 (Apr. 2015), 203–225. DOI:
[38]
Niki Efthymiou, Panagiotis P. Filntisis, Petros Koutras, Antigoni Tsiami, Jack Hadfield, Gerasimos Potamianos, and Petros Maragos. 2022. ChildBot: Multi-robot perception and interaction with children. Robotics and Autonomous Systems 150 (Apr. 2022), 103975. DOI:
[39]
Olov Engwall, Ronald Cumbal, José Lopes, Mikael Ljung, and Linnea Maansson. 2022. Identification of low-engaged learners in robot-led second language conversations with adults. ACM Transactions on Human-Robot Interaction 11, 2 (Jun. 2022), 1–33. DOI:
[40]
Olov Engwall, José Lopes, and Anna Åhlund. 2021. Robot interaction styles for conversation practice in second language learning. International Journal of Social Robotics 13, 2 (Apr. 2021), 251–276. DOI:
[41]
Chiara Filippini, Edoardo Spadolini, Daniela Cardone, Domenico Bianchi, Maurizio Preziuso, Christian Sciarretta, Valentina Del Cimmuto, Davide Lisciani, and Arcangelo Merla. 2021. Facilitating the child–robot interaction by endowing the robot with the capability of understanding the child engagement: The case of Mio Amico Robot. International Journal of Social Robotics 13, 4 (Jul. 2021), 677–689. DOI:
[42]
Panagiotis Paraskevas Filntisis, Niki Efthymiou, Petros Koutras, Gerasimos Potamianos, and Petros Maragos. 2019. Fusing body posture with facial expressions for joint recognition of affect in child–robot interaction. IEEE Robotics and Automation Letters 4, 4 (Oct. 2019), 4011–4018. DOI:
[43]
David Fischinger, Peter Einramhof, Konstantinos Papoutsakis, Walter Wohlkinger, Peter Mayer, Paul Panek, Stefan Hofmann, Tobias Koertner, Astrid Weiss, Antonis Argyros, and Markus Vincze. 2016. Hobbit, a care robot supporting independent living at home: First prototype and lessons learned. Robotics and Autonomous Systems 75 (Jan. 2016), 60–78. DOI:
[44]
Mary Ellen Foster. 2014. Validating attention classifiers for multi-party human-robot interaction. In Proceedings of the 2014 ACM/IEEE International Conference on Human-Robot Interaction: Workshop on Attention Models in Robotics. ACM, New York, NY.
[45]
Mary Ellen Foster, Rachid Alami, Olli Gestranius, Oliver Lemon, Marketta Niemelä, Jean-Marc Odobez, and Amit Kumar Pandey. 2016. The MuMMER project: Engaging human-robot interaction in real-world public spaces. In Social Robotics. Arvin Agah, John-John Cabibihan, Ayanna M. Howard, Miguel A. Salichs, and Hongsheng He (Eds.), Lecture Notes in Computer Science, Vol. 9979. Springer International Publishing, Cham, 753–763. DOI:
[46]
Mary Ellen Foster, Bart Craenen, Amol Deshmukh, Oliver Lemon, Emanuele Bastianelli, Christian Dondrup, Ioannis Papaioannou, Andrea Vanzo, Jean-Marc Odobez, Olivier Canévet, Yuanzhouhan Cao, Weipeng He, Angel Martínez-González, Petr Motlicek, Rémy Siegfried, Rachid Alami, Kathleen Belhassein, Guilhem Buisan, Aurélie Clodic, Amandine Mayima, Yoan Sallami, Guillaume Sarthou, Phani-Teja Singamaneni, Jules Waldhart, Alexandre Mazel, Maxime Caniot, Marketta Niemelä, Päivi Heikkilä, Hanna Lammi, Antti Tammela. 2019. Mummer: Socially intelligent human-robot interaction in public spaces. arXiv:1909.06749. Retrieved from https://arxiv.org/pdf/1909.06749
[47]
Mary Ellen Foster, Andre Gaschler, and Manuel Giuliani. 2017. Automatically classifying user engagement for dynamic multi-party human–robot interaction. International Journal of Social Robotics 9, 5 (Nov. 2017), 659–674. DOI:
[48]
Angus Fung, Beno Benhabib, and Goldie Nejat. 2023. Robots autonomously detecting people: A multimodal deep contrastive learning method robust to intraclass variations. IEEE Robotics and Automation Letters 8, 6 (Jun. 2023), 3550–3557. DOI:
[49]
Israel D. Gebru, Sileye Ba, Xiaofei Li, and Radu Horaud. 2018. Audio-visual speaker diarization based on spatiotemporal Bayesian fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 5 (May 2018), 1086–1099. DOI:
[50]
Sarah Gillet, Ronald Cumbal, André Pereira, José Lopes, Olov Engwall, and Iolanda Leite. 2021. Robot gaze can mediate participation imbalance in groups with sifferent skill levels. In Proceedings of the 2021 ACM/IEEE International Conference on Human-Robot Interaction. ACM, Boulder CO, 303–311. DOI:
[51]
Dylan F. Glas, Satoru Satake, Florent Ferreri, Takayuki Kanda, Hiroshi Ishiguro, and Norihiro Hagita. 2013. The network robot system: Enabling social human-robot interaction in public spaces. Journal of Human-Robot Interaction 1, 2 (Jan. 2013), 5–32. DOI:
[52]
Dylan F. Glas, Kanae Wada, Masahiro Shiomi, Takayuki Kanda, Hiroshi Ishiguro, and Norihiro Hagita. 2017. Personal greetings: Personalizing robot utterances based on novelty of observed behavior. International Journal of Social Robotics 9, 2 (Apr. 2017), 181–198. DOI:
[53]
Matthew Gombolay, Anna Bair, Cindy Huang, and Julie Shah. 2017. Computational design of mixed-initiative human–robot teaming that considers human factors: Situational awareness, workload, and workflow preferences. The International Journal of Robotics Research 36, 5–7 (Jun. 2017), 597–617. DOI:
[54]
Randy Gomez, Levko Ivanchuk, Keisuke Nakamura, Takeshi Mizumoto, and Kazuhiro Nakadai. 2015. Utilizing visual cues in robot audition for sound source discrimination in speech-based human-robot communication. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, Validating att, 4216–4222. DOI:
[55]
Randy Gomez, Alvaro Paez, Yu Fang, Serge Thill, Luis Merino, Eric Nichols, Keisuke Nakamura, and Heike Brock. 2022. Developing the bottom-up attentional system of a social robot. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA). IEEE, Piscataway, NJ, 7402–7408. DOI:
[56]
Randy Gomez, Deborah Szapiro, Kerl Galindo, and Keisuke Nakamura. 2018. Haru: Hardware design of an experimental tabletop robot assistant. In Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction. ACM, New York, NY, 233–240. DOI:
[57]
Jonas Gonzalez, Giulia Belgiovine, Alessandra Sciutti, Giulio Sandini, and Rea Francesco. 2021. Towards a cognitive framework for multimodal person recognition in multiparty HRI. In Proceedings of the 9th International Conference on Human-Agent Interaction. ACM, New York, NY, 412–416. DOI:
[58]
Jonas Gonzalez-Billandon, Giulia Belgiovine, Matthew Tata, Alessandra Sciutti, Giulio Sandini, and Francesco Rea. 2021. Self-supervised learning framework for speaker localisation with a humanoid robot. In Proceedings of the 2021 IEEE International Conference on Development and Learning (ICDL). IEEE, Piscataway, NJ, 1–7. DOI:
[59]
Jonas Gonzalez-Billandon, Alessandra Sciutti, Matthew Tata, Giulio Sandini, and Francesco Rea. 2020. Audiovisual cognitive architecture for autonomous learning of face localisation by a Humanoid Robot. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, Piscataway, NJ, 5979–5985.
[60]
Michael A. Goodrich and Alan C. Schultz. 2007. Human-robot interaction: A survey. Foundations and Trends® in Human-Computer Interaction 1, 3 (2007), 203–275. DOI:
[61]
Francois Grondin and James Glass. 2019. Fast and robust 3-D sound source localization with DSVD-PHAT. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, Piscataway, NJ, 5352–5357. DOI:
[62]
Francois Grondin, Dominic Létourneau, Cédric Godin, Jean-Samuel Lauzon, Jonathan Vincent, Simon Michaud, Samuel Faucher, and Francois Michaud. 2021. ODAS: Open embedded audition system. (Mar. 2021). Retrieved from https://www.frontiersin.org/articles/10.3389/frobt.2022.854444/full
[63]
Francois Grondin and Francois Michaud. 2016. Noise mask for TDOA sound source localization of speech on mobile robots in noisy environments. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA). IEEE, Piscataway, NJ, 4530–4535. DOI:
[64]
Francois Grondin and Francois Michaud. 2018. Lightweight and optimized sound source localization and tracking methods for open and closed microphone array configurations. (Nov. 2018). DOI:
[65]
Hatice Gunes, Frank Broz, Chris S. Crawford, Astrid Rosenthal-von Der Pütten, Megan Strait, and Laurel Riek. 2022. Reproducibility in human-robot interaction: Furthering the science of HRI. Current Robotics Reports 3, 4 (Oct. 2022), 281–292. DOI:
[66]
Raoul Harel, Zerrin Yumak, and Frank Dignum. 2018. Towards a generic framework for multi-party dialogue with virtual humans. In Proceedings of the 31st International Conference on Computer Animation and Social Agents (CASA ’18). ACM, New York, NY, 1–6. DOI:
[67]
Kotaro Hoshiba, Osamu Sugiyama, Akihide Nagamine, Ryosuke Kojima, Makoto Kumon, Kazuhiro Nakadai. 2017. Design and assessment of sound source localization system with a UAV-embedded microphone array. Journal of Robotics and Mechatronics 29, 1 (Feb. 2017), 154–167. DOI:
[68]
Jian-Fang Hu, Wei-Shi Zheng, Jianhuang Lai, and Jianguo Zhang. 2015. Jointly learning heterogeneous features for RGB-D activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5344–5352.
[69]
Jwu-Sheng Hu, Chen-Yu Chan, Cheng-Kang Wang, Ming-Tang Lee, and Ching-Yi Kuo. 2011. Simultaneous localization of a mobile robot and multiple sound sources using a microphone array. Advanced Robotics 25, 1–2 (Jan. 2011), 135–152. DOI:
[70]
Ruihan Hu, Songbing Zhou, Zhi Ri Tang, Sheng Chang, Qijun Huang, Yisen Liu, Wei Han, and Edmond Q. Wu. 2021. DMMAN: A two-stage audio–visual fusion framework for sound separation and event localization. Neural Networks 133 (Jan. 2021), 229–239. DOI:
[71]
Bahar Irfan, Natalia Lyubova, Michael Garcia Ortiz, and Tony Belpaeme. 2018. Multi-modal open-set person identification in HRI. In Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction Social Robots in the Wild Workshop. ACM. Retrieved from http://socialrobotsinthewild.org/wp-content/uploads/2018/02/HRI-SRW_2018_paper_6.pdf
[72]
Bahar Irfan, Michael Garcia Ortiz, Natalia Lyubova, and Tony Belpaeme. 2022. Multi-modal open world user identification. ACM Transactions on Human-Robot Interaction 11, 1 (Mar. 2022), 1–50. DOI:
[73]
Carlos T. Ishi, Jani Even, and Norihiro Hagita. 2015. Speech activity detection and face orientation estimation using multiple microphone arrays and human position information. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, Piscataway, NJ, 5574–5579. DOI:
[74]
Md Mofijul Islam and Tariq Iqbal. 2020. HAMLET: A hierarchical multimodal attention-based human activity recognition algorithm. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, Piscataway, NJ, 10285–10292. DOI:
[75]
Mithun G. Jacob, Yu-Ting Li, and Juan P. Wachs. 2013. Surgical instrument handling and retrieval in the operating room with a multimodal robotic assistant. In Proceedings of the 2013 IEEE International Conference on Robotics and Automation. IEEE, Piscataway, NJ, 2140–2145. DOI:
[76]
Shomik Jain, Balasubramanian Thiagarajan, Zhonghao Shi, Caitlyn Clabaugh, and Maja J. Matarić. 2020. Modeling engagement in long-term, in-home socially assistive robot interventions for children with autism spectrum disorders. Science Robotics 5, 39 (Feb. 2020), eaaz3791. DOI:
[77]
Jinhyeok Jang, Dohyung Kim, Cheonshu Park, Minsu Jang, Jaeyeon Lee, and Jaehong Kim. 2020. ETRI-activity3D: A large-scale RGB-D dataset for robots to recognize daily activities of the elderly. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, Piscataway, NJ, 10990–10997. DOI:
[78]
Shu Jiang and Ronald C. Arkin. 2015. Mixed-initiative human-robot interaction: Definition, taxonomy, and survey. In Proceedings of the 2015 IEEE International Conference on Systems, Man, and Cybernetics. IEEE, Piscataway, NJ, 954–961. DOI:
[79]
Martin Johansson, Gabriel Skantze, and Joakim Gustafson. 2013. Head pose patterns in multiparty human-robot team-building interactions. In Social Robotics. David Hutchison, Takeo Kanade, Josef Kittler, Jon M. Kleinberg, Friedemann Mattern, John C. Mitchell, Moni Naor, Oscar Nierstrasz, C. Pandu Rangan, Bernhard Steffen, Madhu Sudan, Demetri Terzopoulos, Doug Tygar, Moshe Y. Vardi, Gerhard Weikum, Guido Herrmann, Martin J. Pearson, Alexander Lenz, Paul Bremner, Adam Spiers, and Ute Leonards (Eds.), Lecture Notes in Computer Science, Vol. 8239. Springer International Publishing, Cham, 351–360. DOI:
[80]
Michiel Joosse and Vanessa Evers. 2017. A guide robot at the airport: First impressions. In Proceedings of the Companion of the 2017 ACM/IEEE International Conference on Human-Robot Interaction. ACM, New York, NY, 149–150. DOI:
[81]
Swapna Joshi, Sawyer Collins, Waki Kamino, Randy Gomez, and Selma Šabanović. 2020. Social robots for socio-physical distancing. In Social Robotics. Alan R. Wagner, David Feil-Seifer, Kerstin S. Haring, Silvia Rossi, Thomas Williams, Hongsheng He, and Shuzhi Sam Ge (Eds.), Lecture Notes in Computer Science, Vol. 12483. Springer International Publishing, Cham, 440–452. DOI:
[82]
Malte Jung and Pamela Hinds. 2018. Robots in the wild: A time for more robust theories of human-robot interaction. ACM Transactions on Human-Robot Interaction (THRI) 7 (May 2018), 1–5. DOI:
[83]
Nikolaos Kardaris, Isidoros Rodomagoulakis, Vassilis Pitsikalis, Antonis Arvanitakis, and Petros Maragos. 2016. A platform for building new human-computer interface systems that support online automatic recognition of audio-gestural commands. In Proceedings of the 24th ACM International Conference on Multimedia. ACM, New York, NY, 1169–1173. DOI:
[84]
S. Hamidreza Kasaei, Jorik Melsen, Floris van Beers, Christiaan Steenkist, and Klemen Voncina. 2021. The state of lifelong learning in service robots: Current bottlenecks in object perception and manipulation. Journal of Intelligent & Robotic Systems 103 (2021), 1–31.
[85]
Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, and Jaehong Kim. 2021. AIR-Act2Act: Human–human interaction dataset for teaching non-verbal social behaviors to robots. The International Journal of Robotics Research 40, 4–5 (2021), 691–697.
[86]
Thomas Kollar, Anu Vedantham, Corey Sobel, Cory Chang, Vittorio Perera, and Manuela Veloso. 2012. A multi-modal approach for natural human-robot interaction. In Social Robotics. David Hutchison, Takeo Kanade, Josef Kittler, Jon M. Kleinberg, Friedemann Mattern, John C. Mitchell, Moni Naor, Oscar Nierstrasz, C. Pandu Rangan, Bernhard Steffen, Madhu Sudan, Demetri Terzopoulos, Doug Tygar, Moshe Y. Vardi, Gerhard Weikum, Shuzhi Sam Ge, Oussama Khatib, John-John Cabibihan, Reid Simmons, and Mary-Anne Williams (Eds.), Lecture Notes in Computer Science, Vol. 7621. Springer, Berlin, 458–467. DOI:
[87]
Tsuyoshi Komatsubara, Masahiro Shiomi, Thomas Kaczmarek, Takayuki Kanda, and Hiroshi Ishiguro. 2019. Estimating children’s social status through their interaction activities in classrooms with a social robot. International Journal of Social Robotics 11, 1 (Jan. 2019), 35–48. DOI:
[88]
David Kortenkamp, R. Peter Bonasso, Dan Ryan, and Debbie Schreckenghost. 1997. Traded control with autonomous robots as mixed initiative interaction. In Proceedings of the AAAI Symposium on Mixed Initiative Interaction, Vol. 97, 89–94.
[89]
Arkadiusz Kwasigroch, Agnieszka Mikolajczyk, and Michal Grochowski. 2017. Deep neural networks approach to skin lesions classification—A comparative analysis. In Proceedings of the 2017 22nd International Conference on Methods and Models in Automation and Robotics (MMAR). IEEE, Piscataway, NJ, 1069–1074. DOI:
[90]
Javier Laplaza, Nicolas Rodriguez, J. E. Dominguez-Vidal, Fernando Herrero, Sergi Hernandez, Alejandro Lopez, Alberto Sanfeliu, and Anais Garrell. 2022. IVO robot: A new social robot for human-robot collaboration. In Proceedings of the 2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, Piscataway, NJ, 860–864. DOI:
[91]
Ivan Laptev. 2005. On space-time interest points. International Journal of Computer Vision 64, 2 (Sept. 2005), 107–123. DOI:
[92]
Séverin Lemaignan, Charlotte E. R. Edmunds, Emmanuel Senft, and Tony Belpaeme. 2018. The PInSoRo dataset: Supporting the data-driven study of child-child and child-robot social dynamics. PLOS ONE 13, 10 (Oct. 2018), 1–19. DOI:
[93]
Séverin Lemaignan, Mathieu Warnier, E. Akin Sisbot, Aurélie Clodic, and Rachid Alami. 2017. Artificial cognition for social human–robot interaction: An implementation. Artificial Intelligence 247 (Jun. 2017), 45–69. DOI:
[94]
Timothée Lesort, Vincenzo Lomonaco, Andrei Stoian, Davide Maltoni, David Filliat, and Natalia Díaz-Rodríguez. 2020. Continual learning for robotics: Definition, framework, learning strategies, opportunities and challenges. Information Fusion 58 (2020), 52–68.
[95]
Yuanchao Li, Tianyu Zhao, and Xun Shen. 2020. Attention-based multimodal fusion for estimating human emotion in real-world HRI. In Proceedings of the Companion of the 2020 ACM/IEEE International Conference on Human-Robot Interaction. ACM, New York, NY, 340–342. DOI:
[96]
Rainer Lienhart, Alexander Kuranov, and Vadim Pisarevsky. 2003. Empirical analysis of detection cascades of boosted classifiers for rapid object detection. In Pattern Recognition. Gerhard Goos, Juris Hartmanis, Jan van Leeuwen, Bernd Michaelis, and Gerald Krell (Eds.), Lecture Notes in Computer Science, Vol. 2781. Springer, Berlin, 297–304. DOI:
[97]
Timm Linder, Stefan Breuers, Bastian Leibe, and Kai O. Arras. 2016. On multi-modal people tracking from mobile platforms in very crowded and dynamic environments. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA). IEEE, Piscataway, NJ, 5512–5519. DOI:
[98]
Timm Linder, Kilian Y. Pfeiffer, Narunas Vaskevicius, Robert Schirmer, and Kai O. Arras. 2020. Accurate detection and 3D localization of humans using a novel YOLO-based RGB-D fusion approach and synthetic training data. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, Piscataway, NJ, 1000–1006. DOI:
[99]
Jeroen Linssen and Mariët Theune. 2017. R3D3: The rolling receptionist robot with double Dutch dialogue. In Proceedings of the Companion of the 2017 ACM/IEEE International Conference on Human-Robot Interaction. ACM, New York, NY, 189–190. DOI:
[100]
Guiyu Liu, Jiuchao Qian, Fei Wen, Xiaoguang Zhu, Rendong Ying, and Peilin Liu. 2019. Action recognition based on 3D skeleton and RGB frame fusion. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, Piscataway, NJ, 258–264. DOI:
[101]
Hongyi Liu, Tongtong Fang, Tianyu Zhou, and Lihui Wang. 2018. Towards robust human-robot collaborative manufacturing: Multimodal fusion. IEEE Access 6 (2018), 74762–74771. DOI:
[102]
Dongcai Lu, Shiqi Zhang, Peter Stone, and Xiaoping Chen. 2017. Leveraging commonsense reasoning and multimodal perception for robot spoken dialog systems. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, Piscataway, NJ, 6582–6588. DOI:
[103]
Lyujian Lu, Hua Wang, Brian Reily, and Hao Zhang. 2021. Robust real-time group activity recognition of robot teams. IEEE Robotics and Automation Letters 6, 2 (Apr. 2021), 2052–2059. DOI:
[104]
Yaxiong Ma, Yixue Hao, Min Chen, Jincai Chen, Ping Lu, and Andrej Kosir. 2019. Audio-visual emotion fusion (AVEF): A deep efficient weighted approach. Information Fusion 46 (Mar. 2019), 184–192. DOI:
[105]
Umberto Maniscalco, Aniello Minutolo, Pietro Storniolo, and Massimo Esposito. 2024. Towards a more anthropomorphic interaction with robots in museum settings: An experimental study. Robotics and Autonomous Systems 171 (Jan. 2024), 104561. DOI:
[106]
Umberto Maniscalco, Pietro Storniolo, and Antonio Messina. 2022. Bidirectional multi-modal signs of checking human-robot engagement and interaction. International Journal of Social Robotics 14, 5 (Jul. 2022), 1295–1309. DOI:
[107]
Mirko Marras, Pedro A. Marín-Reyes, José Javier Lorenzo Navarro, Modesto Fernando Castrillón Santana, and Gianni Fenu. 2019. AveRobot: An audio-visual dataset for people re-identification and verification in human-robot interaction. In Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods (ICPRAM ’19). 255–265. DOI:
[108]
Eric Martinson, Wallace Lawson, and J. Gregory Trafton. 2013. Identifying people with soft-biometrics at fleet week. In Proceedings of the 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, Piscataway, NJ, 49–56. DOI:
[109]
E. Martinson and V. Yalla. 2016. Augmenting deep convolutional neural networks with depth-based layered detection for human detection. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, Piscataway, NJ, 1073–1078. DOI:
[110]
Youssef Mohamed and Severin Lemaignan. 2021. ROS for human-robot interaction. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, Piscataway, NJ, 3020–3027. DOI:
[111]
Jesús Morales, Ricardo Vázquez-Martín, Anthony Mandow, David Morilla-Cabello, and Alfonso García-Cerezo. 2021. The UMA-SAR Dataset: Multimodal data collection from a ground vehicle during outdoor disaster response training exercises. The International Journal of Robotics Research 40, 6–7 (Jun. 2021), 835–847. DOI:
[112]
Kazuhiro Nakadai, Gökhan Ince, Keisuke Nakamura, and Hirofumi Nakajima. 2012. Robot audition for dynamic environments. In Proceedings of the 2012 IEEE International Conference on Signal Processing, Communication and Computing (ICSPCC ’12). IEEE, Piscataway, NJ, 125–130. DOI:
[113]
Kazuhiro Nakadai, Hiroshi G. Okuno, Hirofumi Nakajima, Yuji Hasegawa, and Hiroshi Tsujino. 2008. An open source software system for robot audition HARK and its evaluation. In Proceedings of the Humanoids 2008 - 8th IEEE-RAS International Conference on Humanoid Robots. IEEE, Piscataway, NJ, 561–566. DOI:
[114]
Keisuke Nakamura, Kazuhiro Nakadai, Futoshi Asano, and Gökhan Ince. 2011. Intelligent Sound Source Localization and its application to multimodal human tracking. In Proceedings of the 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, Piscataway, NJ, 143–148. DOI:
[115]
Eric Nichols, Sarah Rose Siskind, Waki Kamino, Selma Šabanović, and Randy Gomez. 2021. Iterative design of an emotive voice for the tabletop robot Haru. In Social Robotics. Haizhou Li, Shuzhi Sam Ge, Yan Wu, Agnieszka Wykowska, Hongsheng He, Xiaorui Liu, Dongyu Li, and Jairo Perez-Osorio (Eds.), Lecture Notes in Computer Science, Vol. 13086. Springer International Publishing, Cham, 362–374. DOI:
[116]
Matthias Nieuwenhuisen and Sven Behnke. 2013. Human-like interaction skills for the mobile communication robot robotinho. International Journal of Social Robotics 5, 4 (Nov. 2013), 549–561. DOI:
[117]
Aastha Nigam and Laurel D. Riek. 2015. Social context perception for mobile robots. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, Piscataway, NJ, 3621–3627. DOI:
[118]
Timo Ojala, Matti Pietikäinen, and David Harwood. 1996. A comparative study of texture measures with classification based on featured distributions. Pattern Recognition 29, 1 (1996), 51–59. DOI:
[119]
German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. 2019. Continual lifelong learning with neural networks: A review. Neural Networks 113 (2019), 54–71.
[120]
Maria Pateraki, Markos Sigalas, Georgios Chliveros, and Panos Trahanias. 2013. Visual human-robot communication in social settings. In Proceedings of ICRA Workshop on Semantics, Identification and Control of Robot-Human-Environment Interaction.
[121]
Andre Pereira, Catharine Oertel, Leonor Fermoselle, Joe Mendelson, and Joakim Gustafson. 2019. Responsive Joint Attention in Human-Robot Interaction. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, Piscataway, NJ, 1080–1087. DOI:
[122]
Roberto Pinillos, Samuel Marcos, Raul Feliz, Eduardo Zalama, and Jaime Gómez-García-Bermejo. 2016. Long-term assessment of a service robot in a hotel environment. Robotics and Autonomous Systems 79 (May 2016), 40–57. DOI:
[123]
David Portugal, Paulo Alvito, Eleni Christodoulou, George Samaras, and Jorge Dias. 2019. A study on the deployment of a service robot in an elderly care center. International Journal of Social Robotics 11, 2 (Apr. 2019), 317–341. DOI:
[124]
Shokoofeh Pourmehr, Jack Thomas, Jake Bruce, Jens Wawerla, and Richard Vaughan. 2017. Robust sensor fusion for finding HRI partners in a crowd. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, Piscataway, NJ, 3272–3278. DOI:
[125]
José Augusto Prado, Carlos Simplício, Nicolás F. Lori, and Jorge Dias. 2012. Visuo-auditory multimodal emotional structure to improve human-robot-interaction. International Journal of Social Robotics 4, 1 (Jan. 2012), 29–51. DOI:
[126]
John Páez and Enrique González. 2022. Human-robot Sscaffolding: An architecture to foster problem-solving skills. ACM Transactions on Human-Robot Interaction 11, 3 (Sept. 2022), 1–17. DOI:
[127]
Ricardo Ragel, Rafael Rey, Álvaro Páez, Javier Ponce, Keisuke Nakamura, Fernando Caballero, Luis Merino, and Randy Gómez. 2022. Multi-modal data fusion for people perception in the social robot Haru. In Social Robotics. Filippo Cavallo, John-John Cabibihan, Laura Fiorini, Alessandra Sorrentino, Hongsheng He, Xiaorui Liu, Yoshio Matsumoto, and Shuzhi Sam Ge (Eds.), Lecture Notes in Computer Science, Vol. 13817. Springer Nature Switzerland, Cham, 174–187. DOI:
[128]
Dhanesh Ramachandram and Graham W. Taylor. 2017. Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34, 6 (Nov. 2017), 96–108. DOI:
[129]
Arnaud Ramey, Javier F. Gorostiza, and Miguel A. Salichs. 2012. A social robot as an aloud reader: putting together recognition and synthesis of voice and gestures for HRI experimentation. In Proceedings of the 7th Annual ACM/IEEE International Conference on Human-Robot Interaction, 213–214.
[130]
Caleb Rascon and Ivan Meza. 2017. Localization of sound sources in robotics: A review. Robotics and Autonomous Systems 96 (Oct. 2017), 184–210. DOI:
[131]
Brian Reily, Peng Gao, Fei Han, Hua Wang, and Hao Zhang. 2022. Real-time recognition of team behaviors by multisensory graph-embedded robot learning. The International Journal of Robotics Research 41, 8 (Jul. 2022), 798–811. DOI:
[132]
Laurel D. Riek. 2013. The social co-robotics problem space: Six key challenges. In Proceedings of the Robotics: Science, and Systems (RSS), Robotics Challenges and Visions. 13–16.
[133]
Adam Robaczewski, Julie Bouchard, Kevin Bouchard, and Sébastien Gaboury. 2021. Socially assistive robots: The specific case of the NAO. International Journal of Social Robotics 13, 4 (Jul. 2021), 795–831. DOI:
[134]
Fraser Robinson and Goldie Nejat. 2023. A deep learning human activity recognition framework for socially assistive robots to support reablement of older adults. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, Piscataway, NJ, 6160–6167. DOI:
[135]
Nicole Robinson, Brendan Tidd, Dylan Campbell, Dana Kulić, and Peter Corke. 2023. Robotic vision for human-robot interaction and collaboration: A survey and systematic review. ACM Transactions on Human-Robot Interaction 12, 1 (Mar. 2023), 1–66. DOI:
[136]
Isidoros Rodomagoulakis, Nikolaos Kardaris, Vassilis Pitsikalis, Effrosyni Mavroudi, Athanasios Katsamanis, Antigoni Tsiami, and Petros Maragos. 2016. Multimodal human action recognition in assistive human-robot interaction. In Proceedings of the 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Piscataway, NJ, 2702–2706.
[137]
Alessandra Rossi, Mariacarla Staffa, Antonio Origlia, Maria di Maro, and Silvia Rossi. 2021. BRILLO: A robotic architecture for personalised long-lasting interactions in a bartending domain. In Proceedings of the Companion of the 2021 ACM/IEEE International Conference on Human-Robot Interaction. ACM, New York, NY, 426–429. DOI:
[138]
Miguel A. Salichs, Álvaro Castro-González, Esther Salichs, Enrique Fernández-Rodicio, Marcos Maroto-Gómez, Juan José Gamboa-Montero, Sara Marques-Villarroya, José Carlos Castillo, Fernando Alonso-Martín, and Maria Malfaz. 2020. Mini: A new social robot for the elderly. International Journal of Social Robotics 12, 6 (Dec. 2020), 1231–1249. DOI:
[139]
Jordi Sanchez-Riera, Xavier Alameda-Pineda, and Radu Horaud. 2012. Audio-visual robot command recognition: D-META’12 grand challenge. In Proceedings of the 14th ACM International Conference on Multimodal Interaction (ICMI ’12). ACM, New York, NY, 371. DOI:
[140]
Yoko Sasaki, Ryo Tanabe, and Hiroshi Takernura. 2018. Online spatial sound perception using microphone array on mobile robot*. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, Piscataway, NJ, 2478–2484. DOI:
[141]
Khadija Shaheen, Muhammad Abdullah Hanif, Osman Hasan, and Muhammad Shafique. 2022. Continual learning for real-world autonomous systems: Algorithms, challenges and frameworks. Journal of Intelligent & Robotic Systems 105, 1 (2022), 9.
[142]
Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. 2016. NTU RGB+D: A large scale dataset for 3D human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1010–1019.
[143]
Zhihao Shen, Armagan Elibol, and Nak Young Chong. 2021. Multi-modal feature fusion for better understanding of human personality traits in social human–robot interaction. Robotics and Autonomous Systems 146 (Dec. 2021), 103874. DOI:
[144]
Shreyas S. Shivakumar, Neil Rodrigues, Alex Zhou, Ian D. Miller, Vijay Kumar, and Camillo J. Taylor. 2020. PST900: RGB-thermal calibration, dataset and segmentation network. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, Piscataway, NJ, 9441–9447. DOI:
[145]
Nikhita Singh, Jin Joo Lee, Ishaan Grover, and Cynthia Breazeal. 2018. P2PSTORY: Dataset of children as storytellers and listeners in peer-to-peer interactions. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI’18). ACM, New York, NY, 1–11. DOI:
[146]
Andrew Specian, Ross Mead, Simon Kim, Maja Mataric, and Mark Yim. 2022. Quori: A community-informed design of a socially interactive humanoid robot. IEEE Transactions on Robotics 38, 3 (Jun. 2022), 1755–1772. DOI:
[147]
Micol Spitale, Chris Birmingham, R. Michael Swan, and Maja J. Mataric. 2021. Composing HARMONI: An open-source tool for human and robot modular OpeN interaction. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, Piscataway, NJ, 3322–3329. DOI:
[148]
Ryu Takeda. 2017. Noise-robust MUSIC-based sound source localization using steering vector transformation for small humanoids. Journal of Robotics and Mechatronics 29, 1 (Feb. 2017), 26–36. DOI:
[149]
Zheng-Hua Tan, Nicolai Bæk Thomsen, Xiaodong Duan, Evgenios Vlachos, Sven Ewan Shepstone, Morten Hojfeldt Rasmussen, and Jesper Lisby Højvang. 2018. iSocioBot: A multimodal interactive social robot. International Journal of Social Robotics 10, 1 (Jan. 2018), 5–19. DOI:
[150]
Matteo Terreran, Leonardo Barcellona, and Stefano Ghidoni. 2023. A general skeleton-based action and gesture recognition framework for human–robot collaboration. Robotics and Autonomous Systems 170 (Dec. 2023), 104523. DOI:
[151]
Nadia Magnenat Thalmann, Nidhi Mishra, and Gauri Tulsulkar. 2021. Nadine the social robot: Three case studies in everyday life. In Social Robotics. Haizhou Li, Shuzhi Sam Ge, Yan Wu, Agnieszka Wykowska, Hongsheng He, Xiaorui Liu, Dongyu Li, and Jairo Perez-Osorio (Eds.), Lecture Notes in Computer Science, Vol. 13086. Springer International Publishing, Cham, 107–116. DOI:
[152]
Mariët Theune, Daan Wiltenburg, Max Bode, and Jeroen Linssen. 2017. R3D3 in the wild: Using a robot for turn management in multi-party interaction with a virtual human. In Proceedings of the IVA Workshop on Interaction with Agents and Robots: Different Embodiments, Common Challenges.
[153]
Susanne Trick, Franziska Herbert, Constantin A. Rothkopf, and Dorothea Koert. 2022. Interactive reinforcement learning with Bayesian fusion of multimodal advice. IEEE Robotics and Automation Letters 7, 3 (Jul. 2022), 7558–7565. DOI:
[154]
Rudolph Triebel, Kai Arras, Rachid Alami, Lucas Beyer, Stefan Breuers, Raja Chatila, Mohamed Chetouani, Daniel Cremers, Vanessa Evers, Michelangelo Fiore, Hayley Hung, Omar A. Islas Ramírez, Michiel Joosse, Harmish Khambhaita, Tomasz Kucner, Bastian Leibe, Achim J. Lilienthal, Timm Linder, Manja Lohse, Martin Magnusson, Billy Okal, Luigi Palmieri, Umer Rafi, Marieke van Rooij, and Lu Zhang. 2016. SPENCER: A socially aware service robot for passenger guidance and help in busy airports. In Field and Service Robotics. David S. Wettergreen and Timothy D. Barfoot (Eds.), Lecture Notes in Computer Science, Vol. 113. Springer International Publishing, Cham, 607–622. DOI:
[155]
Mario Trobinger, Christoph Jahne, Zheng Qu, Jean Elsner, Anton Reindl, Sebastian Getz, Thore Goll, Benjamin Loinger, Tamara Loibl, Christoph Kugler, Carles Calafell, Mohamadreza Sabaghian, Tobias Ende, Daniel Wahrmann, Sven Parusel, Simon Haddadin, and Sami Haddadin. 2021. Introducing GARMI - A service robotics platform to support the elderly at home: Design philosophy, system overview and first results. IEEE Robotics and Automation Letters 6, 3 (Jul. 2021), 5857–5864. DOI:
[156]
Antigoni Tsiami, Panagiotis Paraskevas Filntisis, Niki Efthymiou, Petros Koutras, Gerasimos Potamianos, and Petros Maragos. 2018. Far-field audio-visual scene perception of multi-party human-robot interaction for children and adults. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Piscataway, NJ, 6568–6572. DOI:
[157]
Matthew Turk. 2001. Perceptual user interfaces. In Frontiers of Human-Centered Computing, Online Communities and Virtual Environments. Rae A. Earnshaw, Richard A. Guedj, Andries van Dam, and John A. Vince (Eds.). Springer, London, 39–51. DOI:
[158]
Nguyen Tan Viet Tuyen, Alexandra L. Georgescu, Irene Di Giulio, and Oya Celiktutan. 2023. A multimodal dataset for robot learning to imitate social human-human interaction. In Proceedings of the Companion of the 2023 ACM/IEEE International Conference on Human-Robot Interaction. ACM, New York, NY, 238–242. DOI:
[159]
Jean-Marc Valin, Shun’ichi Yamamoto, Jean Rouat, Francois Michaud, Kazuhiro Nakadai, and Hiroshi G. Okuno. 2007. Robust recognition of simultaneous speech by a mobile robot. IEEE Transactions on Robotics 23, 4 (Aug. 2007), 742–752. DOI:
[160]
Michel Valstar, Björn W. Schuller, Jarek Krajewski, Roddy Cowie, and Maja Pantic. 2014. AVEC 2014: The 4th international audio/visual emotion challenge and workshop. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, New York, NY, 1243–1244. DOI:
[161]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems. I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc., Red Hook, NY. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
[162]
Yurii Vasylkiv, Heike Brock, Yu Fang, Eric Nichols, Keisuke Nakamura, Serge Thill, and Randy Gomez. 2020. An exploration of simple reactive responses for conveying aliveness using the Haru robot. In Social Robotics. Alan R. Wagner, David Feil-Seifer, Kerstin S. Haring, Silvia Rossi, Thomas Williams, Hongsheng He, and Shuzhi Sam Ge (Eds.), Lecture Notes in Computer Science, Vol. 12483. Springer International Publishing, Cham, 108–119. DOI:
[163]
Yurii Vasylkiv, Ricardo Ragel, Javier Ponce-Chulani, Luis Merino, Eleanor Sandry, Heike Brock, Keisuke Nakamura, Irani Pourang, and Randy Gomez. 2021. Design and development of a teleoperation system for affective tabletop robot Haru. In Social Robotics. Haizhou Li, Shuzhi Sam Ge, Yan Wu, Agnieszka Wykowska, Hongsheng He, Xiaorui Liu, Dongyu Li, and Jairo Perez-Osorio (Eds.), Lecture Notes in Computer Science, Vol. 13086. Springer International Publishing, Cham, 564–573. DOI:
[164]
Dominique Vaufreydaz, Wafa Johal, and Claudine Combe. 2016. Starting engagement detection towards a companion robot using multimodal features. Robotics and Autonomous Systems 75 (Jan. 2016), 4–16. DOI:
[165]
Paul Viola and Michael J. Jones. 2004. Robust real-time face detection. International Journal of Computer Vision 57, 2 (May 2004), 137–154. DOI:
[166]
Xiang-Yang Wang, Jun-Feng Wu, and Hong-Ying Yang. 2010. Robust image retrieval based on color histogram of local feature regions. Multimedia Tools and Applications 49, 2 (Aug. 2010), 323–345. DOI:
[167]
David Whitney, Miles Eldon, John Oberlin, and Stefanie Tellex. 2016. Interpreting multimodal referring expressions in real time. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA). IEEE, Piscataway, NJ, 3331–3338. DOI:
[168]
Jason R. Wilson, Phyo Thuta Aung, and Isabelle Boucher. 2022. When to help? A multimodal architecture for recognizing when a user needs help from a social robot. In Social Robotics. Filippo Cavallo, John-John Cabibihan, Laura Fiorini, Alessandra Sorrentino, Hongsheng He, Xiaorui Liu, Yoshio Matsumoto, and Shuzhi Sam Ge (Eds.), Lecture Notes in Computer Science, Vol. 13817. Springer Nature Switzerland, Cham, 253–266. DOI:
[169]
Luke J. Wood, Abolfazl Zaraki, Ben Robins, and Kerstin Dautenhahn. 2021. Developing kaspar: A humanoid robot for children with autism. International Journal of Social Robotics 13, 3 (Jun. 2021), 491–508. DOI:
[170]
Kai Wu, Shu Ting Goh, and Andy W. H. Khong. 2013. Speaker localization and tracking in the presence of sound interference by exploiting speech harmonicity. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 365–369. DOI:
[171]
Jorge Wuth, Pedro Correa, Tomás Núñez, Matías Saavedra, and Néstor Becerra Yoma. 2021. The role of speech technology in user perception and context acquisition in HRI. International Journal of Social Robotics 13, 5 (Aug. 2021), 949–968. DOI:
[172]
Lu Xia, Chia-Chih Chen, and Jake K. Aggarwal. 2012. View invariant human action recognition using histograms of 3D joints. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 20–27. DOI:
[173]
Haibin Yan, Marcelo H. Ang, and Aun Neow Poo. 2014. A survey on perception methods for human–robot interaction in social robots. International Journal of Social Robotics 6, 1 (Jan. 2014), 85–119. DOI:
[174]
Zhi Yan, Tom Duckett, and Nicola Bellotto. 2017. Online learning for human classification in 3D LiDAR-based tracking. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 864–871. DOI:
[175]
Zhi Yan, Tom Duckett, and Nicola Bellotto. 2020. Online learning for 3D LiDAR-based human detection: experimental analysis of point cloud clustering and classification methods. Autonomous Robots 44, 2 (Jan. 2020), 147–164. DOI:
[176]
Zhi Yan, Li Sun, Tom Duckctr, and Nicola Bellotto. 2018. Multisensor online transfer learning for 3D LiDAR-based human detection with a mobile robot. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, Piscataway, NJ, 7635–7640. DOI:
[177]
Guang-Zhong Yang, Jim Bellingham, Pierre E. Dupont, Peer Fischer, Luciano Floridi, Robert Full, Neil Jacobstein, Vijay Kumar, Marcia McNutt, Robert Merrifield, Bradley J. Nelson, Brian Scassellati, Mariarosaria Taddeo, Russell Taylor, Manuela Veloso, Zhong Lin Wang, and Robert Wood. 2018. The grand challenges of Science Robotics. Science Robotics 3, 14 (Jan. 2018), eaar7650. DOI:
[178]
Mohammad Samin Yasar, Md Mofijul Islam, and Tariq Iqbal. 2023. IMPRINT: Interactional dynamics-aware motion prediction in teams using multimodal context. ACM Transactions on Human-Robot Interaction (Oct. 2023), 3626954. DOI:
[179]
Karim Youssef, Katsutoshi Itoyama, and Kazuyoshi Yoshii. 2017. Simultaneous identification and localization of still and mobile speakers based on binaural robot audition. Journal of Robotics and Mechatronics 29, 1 (Feb. 2017), 59–71. DOI:
[180]
Zerrin Yumak and Nadia Magnenat-Thalmann. 2016. Multimodal and multi-party social interactions. In Context Aware Human-Robot and Human-Agent Interaction. Nadia Magnenat-Thalmann, Junsong Yuan, Daniel Thalmann, and Bum-Jae You (Eds.), Human–Computer Interaction Series, Springer International Publishing, Cham, 275–298. DOI:
[181]
Zerrin Yumak, Jianfeng Ren, Nadia Magnenat Thalmann, and Junsong Yuan. 2014a. Modelling multi-party interactions among virtual characters, robots, and humans. Presence: Teleoperators and Virtual Environments 23, 2 (Aug. 2014), 172–190. DOI:
[182]
Zerrin Yumak, Jianfeng Ren, Nadia Magnenat Thalmann, and Junsong Yuan. 2014b. Tracking and fusion for multiparty interaction with a virtual character and a social robot. In Proceedings of the SIGGRAPH Asia 2014 Autonomous Virtual Humans and Social Robot for Telepresence. ACM, New York, NY, 1–7. DOI:
[183]
Brian J. Zhang and Naomi T. Fitter. 2023. Nonverbal sound in human-robot interaction: A systematic review. ACM Transactions on Human-Robot Interaction 12, 4 (Feb. 2023), 3583743. DOI:
[184]
Zhao Zhao and Rhonda McEwen. 2022. “Let’s read a book together”: A long-term study on the usage of pre-school children with their home companion robot. In Proceedings of the 2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, Piscataway, NJ, 24–32. DOI:
[185]
Xiao-Hu Zhou, Xiao-Liang Xie, Zhen-Qiu Feng, Zeng-Guang Hou, Gui-Bin Bian, Rui-Qi Li, Zhen-Liang Ni, Shi-Qi Liu, and Yan-Jie Zhou. 2020. A multilayer-multimodal fusion architecture for pattern recognition of natural manipulations in percutaneous coronary interventions. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, Piscataway, NJ, 3039–3045. DOI:
[186]
Anastasia Zlatintsia, Alex C. Dometiosa, Nikolaos Kardarisa, Ioanna Rodomagoulakisa, Panagiotis Koutrasa, Xenophon Papageorgioua, Panagiotis Maragosa, Constantinos S. Tzafestasa, Panagiotis Vartholomeosb, Klaus Hauerc, Christian Werner, Roberto Annicchiaricod, Matteo G. Lombardid, Francesco Adrianod, Tarek Asfoure, Andrea Maria Sabatinif, Claudia Laschif, Marco Cianchettif, Aylin Guler, Ioannis Kokkinosh, Benjamin Kleini, and Rodrigo Lopez´j. 2020. I-Support: A robotic platform of an assistive bathing robot for the elderly population. Robotics and Autonomous Systems 126 (Apr. 2020), 103451. DOI:
[187]
Athanasia Zlatintsi, Isidoros Rodomagoulakis, Vassilis Pitsikalis, Petros Koutras, Nikolaos Kardaris, Xanthi Papageorgiou, Costas Tzafestas, and Petros Maragos. 2017. Social human-robot interaction for the elderly: Two real-life use cases. In Proceedings of the Companion of the 2017 ACM/IEEE International Conference on Human-Robot Interaction. ACM, New York, NY, 335–336. DOI:
[188]
Mateusz Żarkowski. 2019. Multi-party turn-taking in repeated human–robot interactions: An interdisciplinary evaluation. International Journal of Social Robotics 11, 5 (Dec. 2019), 693–707. DOI:

Cited By

View all
  • (2024)Noncontact perception for assessing pilot mental workload during the approach and landing under various weather conditionsSignal, Image and Video Processing10.1007/s11760-024-03619-x19:2Online publication date: 9-Dec-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Human-Robot Interaction
ACM Transactions on Human-Robot Interaction  Volume 13, Issue 4
December 2024
492 pages
EISSN:2573-9522
DOI:10.1145/3613735
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 October 2024
Online AM: 29 April 2024
Accepted: 29 February 2024
Revised: 02 November 2023
Received: 16 May 2022
Published in THRI Volume 13, Issue 4

Check for updates

Author Tags

  1. Human–robot interaction
  2. multimodal perception
  3. situated interaction
  4. social robotics
  5. human social environments

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2,056
  • Downloads (Last 6 weeks)428
Reflects downloads up to 26 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Noncontact perception for assessing pilot mental workload during the approach and landing under various weather conditionsSignal, Image and Video Processing10.1007/s11760-024-03619-x19:2Online publication date: 9-Dec-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media