Abioye 2018 [2] and 2022 [1] | Speech and gesture commanding for simulated aerial robot motion | Monochannel microphone | CMUSphinx automatic speech recognition (ASR) | Recognized speech commands | Rule-based approach to determine if unimodal commands are sequential or synchronous, and if information is emphatic or complementary |
RGB camera | Haar feature-based cascade classifier hand detection, convex hull finger detection (OpenCV) | Hand gestures |
Ban 2018 [10] | Multiple speaker tracking using dataset of various indoor acoustic settings | 6-Microphone array Audio–Visual Diarization (AVDIAR dataset) | Direct-path related Transfer function | Speaker location | Bayesian estimator |
RGB image data (AVDIAR dataset) | Facial detection | Facial location |
Bayram 2015 [12] | Multiple speaker tracking and following on indoor mobile robot | 7-Microphone array (Microcone) | Generalized EigenValue Decomposition-Multiple signal classification (MUSIC) | Speaker location | Particle filter |
2x RGB cameras (Microsoft Kinect) | Haar feature-based cascade classifier facial detection | Facial location |
Belgiovine 2022 [14] and Gonzalez-Billandon 2020 and 2021 [57, 58] | Human localization for recording voice and facial identity data from an iCub robot in a laboratory setting | Stereo camera (embedded in iCub) | Deep network for facial localization | Face location | Multihuman tracking via Kalman Filtering and Hungarian algorithm for data association |
Binaural microphones (embedded in iCub) | Deep network for speaker localization | Coarse speaker location: to left, to right, in front |
Bohus 2009 [20] and 2010 [21] | Virtual agent personal assistant and trivia game host; for multiparty interaction research in an indoor social environment | RGB camera (AXIS 212) | Face detection and tracking | Human location | Scene analysis module to infer human intentions, engagement, and actions; fusion algorithm details unspecified |
Facial pose estimation | Focus of attention |
Clothing RGB variance analysis | Organizational affiliation |
4-Microphone linear array | Windows 7 Speech recognizer | Recognized speech |
SSL | Speaker location |
Chao 2013 [27] | Detecting human activity and evaluating interruption behaviors in a mixed-initiative laboratory interaction | RGB-D camera Microsoft Kinect | Skeletal pose tracking; coarse gaze estimation; gesture activity detection | Estimate if human is gazing at robot; gesture activity | Control architecture for the dynamics of embodied natural coordination and engagement system; uses timed Petri nets to handle asynchronous human inputs and select action |
Monochannel microphone | Pitch computation (Pure Data module) | Speaker pitch |
Chau 2019 [28] | Simultaneous localization of humans and robot platform | 8-Microphone array (TAMAGO-03) | Generalized singular value decomposition-MUSIC and SNR estimation | Speaker direction-of-arrival | Gaussian mixture probability hypothesis density filter |
RGB camera | OpenPose keypoint extraction | Multitarget human pose estimates |
Chu 2014 [34] | Determining desire to engage with Curi robot platform in a laboratory setting via contingency detection | 2-Microphone array | Sound cue computation | Sound cue features for two audio channels | Finite state machine to determine when to process body and audio features; SVM contingency classifier to determine if human wishes to engage or not |
RGB-D camera (Asus Xtion Pro Live) | Skeletal pose tracking (OpenNI) and body motion cue computation | Motion cues for seven joints |
Churamani 2017 [35] | Identity and speech recognition for personalized interaction with Nico robot in laboratory educational task | 2x RGB cameras | Haar feature-based cascade classifier facial detection; face recognition using LBP histograms | Human identity and face location | Weighted sum of voice and facial identities; dialog manager to control flow of interaction based on identity and speech |
Monochannel audio | Speech recognition (Google Cloud speech-to-text); CNN human identification | Human identity and recognized speech |
Foster 2014 [44] and 2017 [47] and Pateraki 2013 [120] | Estimating human interaction status/intentions with JAMES bartending robot in notional bartending environment | 4-Microphone array (Microsoft Kinect) | SSL; ASR (Microsoft Speech API); lexical feature extraction (OpenCCG) | Speaker azimuth, recognized speech, and lexical features | Comparison of binary regression, logistic regression, multinomial logistic regression, SVM, k-nearest neighbor (k-NN), decision tree, naïve Bayes, propositional rule learner to estimate bar patron state |
RGB-D camera (Microsoft Kinect) | Hand and face blob tracking; 2D silhouette and 3D shoulder fitting; least-squares RGB image matching | Human head and hand location, 3D torso pose and head pose angles |
2x Stereo cameras |
Foster 2016 [45] and 2019 [46] | Human localization, identification, intent recognition, and speech recognition in a large public mall on Pepper mall robot for MuMMER project | 4-Microphone array (embedded in Pepper) | NN to perform simultaneous speech detection and localization; speech recognition (Google Cloud speech-to-text) | Speaker location and recognized speech | Weighted average of gaze and human distance to estimate if human is interested in interaction |
RGB-D camera (Intel RealSense D435) | CPM pose recognition; head pose estimation (OpenHeadPose); facial feature computation (OpenFace) | Body pose, head pose, and facial features |
Gebru 2018 [49] | Multiple speaker localization and diarization using dataset of various indoor acoustic settings | 6-microphone array (AVDIAR dataset) | Binaural feature extraction; SSL using trained model | Speaker location | MAP Bayesian estimator to detect speakers; NN search to attribute speech to speaker |
Voice activity detection (VAD) | Speech activity |
RGB image data (AVDIAR dataset) | Visual tracking | Speaker head and torso locations |
Glas 2013 [51] and 2017 [52] | Sensor network + mobile Robovie robot in shopping mall to detect shopper identity and locations and issue personalized greetings | 8x 2D LiDAR | Human tracking via particle filtering (ATRacker) | 2D human locations | Geometric model to estimate human height; NN data association to fuse identity with human location |
2x RGB cameras | Facial recognition (OKAO vision) | Human identity |
Gomez 2015 [54] | Speech-to-speaker association using Hearbo robot in rooms of varying reverberation | RGB-D camera (Microsoft Kinect v2) | Depth and tracking, head position, and mouth activity detection | Visual azimuth to speaker; mouth activity | Speaker Resolution module which associates acoustic azimuth to valid, speaking visual azimuth |
16-Microphone array | MUSIC SSL (HARK) | Acoustic azimuth to speaker |
Ishi 2015 [73] | Estimating human location and head orientation in laboratory environment | 2x 2D LiDAR (Hokuyo UTM-30L) | Particle filter | Human location | Computation of sound source location from acoustic azimuth vectors. Fused with LiDAR location estimate if within threshold; location estimate used to compute head orientation vector |
2X 16-microphone array, 2X 8-microphone array | MUSIC SSL | Acoustic azimuths |
Jacob 2013 [75] | Command recognition for a surgical scrub nurse assistant robot arm (FANUC LR Mate 200iC) in notional surgical task | RGB-D camera (Microsoft Kinect One) | Custom fingertip locator, Kalman Filter smoother, feature extraction, and HMM gesture classifier | Recognized gestures | State machine to perform assistive actions (e.g., pick up and pass specific surgical instruments) or enable/disable command modes |
Monochannel microphone | Speech recognition (CMUSphinx) | Recognized speech commands |
Jain 2020 [76] | Detecting user engagement during educational game in a month-long, in-home study | RGB Camera (USB webcam) | OpenPose | # people in scene | Gradient-boosted decision trees to classify engagement/disengagement |
OpenFace | Face detection, eye gaze direction, head position, facial expression features |
Monochannel audio (USB webcam) | Audio feature extraction (Praat) | Audio pitch, frequency, intensity, and harmonicity |
Kardaris 2016 [83] and Zlatintsi 2017 [187] and 2020 [186] | Verbal and gestural command recognition for elderly care robot evaluated on MOBOT multimodal dataset and I-Support assistive bathing robot | RGB-D camera (Microsoft Kinect) | Dense trajectory feature extraction and SVM classification | Recognized gestures | Ranked selection of best available modality |
Optical flow activity detection (OpenCV) | Activity status |
8-Channel MEMS microphone array | Beamforming denoising; MFCC+\(\Delta\) feature extraction; N-Best grammar-based speech recognition (HTK toolkit) | Recognized speech commands |
Kollar 2012 [86] | Human tracking, speech and gesture recognition in an indoor environment on the Roboceptionist and CoBot service robots | RGB-D Camera (Microsoft Kinect) | Skeletal pose tracking; vector computations of skeletal keypoints | Recognized gestures and proxemics | Rule-based dialog manager |
Android tablet | Android speech recognition; probabilistic graph/naïve Bayes language model | Recognized speech commands |
Komatsubara 2019 [87] | Estimating child social status using a sensor network embedded in a classroom | RGB-D camera (Microsoft Kinect) | Head and shoulders detection | Human location | NN association fuses identity and location; custom social feature extraction module; SVM with RBF classifies social status |
6x RGB camera (Omron) | Facial recognition (OKAO Vision) | Human identity |
Linder 2016 [97] | Development of multiple human tracker for dense human environments, tested on real-world and synthetic datasets | 2x 2D LiDAR (SICK LMS 500) | OpenCV random forest leg tracker | Leg locations | Comparison of NN, extended NN, multi-hypothesis, and minimum description length trackers |
RGB-D camera (Asus Xtion Pro) | Comparison of depth template-based, monocular HoG, and RGB-D HoG upper-body detectors | Torso locations |
Linssen 2017 [99] and Theune 2017 [152] | Development of R3D3 receptionist robot and pilot testing in day care center | 4-Microphone array (Microsoft Kinect) | ASR using Kaldi deep network | Recognized speech | Rule-based dialog and action manager |
RGB-D Camera (Microsoft Kinect) | FaceReader software (Vicar Vision) | Emotional state and demographics |
Maniscalco 2022 [106] and 2024 [105] | Identifying humans, estimating engagement on Pepper robot guide in campus and museum environments | 4-Microphone array | RMS signal computation, Google Cloud speech-to-text | Recognized speech | Finite state automaton; specific sensory inputs activate state transitions for engagement and interaction (smach) |
RGB image | People detection, face recognition, gaze estimation, and age/gender estimation (Pepper SDK) | Human identity, location, gaze direction, and age/gender |
Sonar | FIFO queue to stabilize distance measurement | Distance to human |
Martinson 2013 [108] | Human identification using soft biometrics on Octavia social robot platform in a public interaction study | Depth/time-of-flight camera (Mesa Swissranger SR4000) | Segmentation via connected components analysis; computation of 3D face position; height estimation via geometric model | Human location and height | Computing similarities of soft biometrics with those of known humans; weighted sum of soft biometric similarities to determine identity |
RGB camera (Point Grey FIREFLY) | Facial detection (Pittsburgh Pattern Recognition SDK); color histogram computation of face and clothing | Face and clothing color histograms |
Martinson 2016 [109] | Human detection in indoor office setting | RGB-D camera | Alexnet CNN human detection on RGB image (Caffe) | Human detection likelihood | Weighted sum of CNN and layered person detection likelihoods |
Alexnet CNN human detection on depth image (Caffe) | Human detection likelihood |
Layered person detection: Segmentation, geometric feature computation, and GMM classification of depth clusters | Human detection likelihood |
Mohamed 2021 [110] | Generating whole person model from robot data and providing generic ROS HRI stack via ROS4HRI project | Audio data | Voice activity detection (VAD), feature extraction (openSMILE, HARK); SSL (HARK) | Audio features, voice activity, and speaker location | Body/face matcher and person manager to fuse face, voice, and body information |
RGB image | Face detection and pose estimation (OpenFace); expression and demographic detection (OpenVINO); face recognition (dlib) | Expression, demographics, and identity |
RGB-D image | Skeletal pose tracking (Kinect, OpenNI, OpenVINO) and human description generator | Human skeletal pose description and recognized gestures |
Nakamura 2011 [114] | Human localization using Hearbo robot in indoor setting | 8-Microphone array | GEVD-MUSIC with hGMM Sound Source Identification | Speaker direction-of-arrival | Particle filter |
Thermal camera (Apiste FSV-1100) | Thermal-Distance integrated localization model; binary thermal mask applied to clusters of depth points to localize human heads | Human location (3-dimensional) |
Time-of-flight distance camera (Mesa SR4000) |
Nieuwenhuisen 2013 [116] | Human localization and importance detection on Robothino museum tour guide robot | 2D LiDAR (Hokuyo URG-04LX) | Leg and torso detection | Body locations | Face and body locations fused by Hungarian algorithm data association in a multi-hypothesis tracker; human importance in group estimated by relative distance and angle to robot |
2x RGB cameras | Viola-Jones face detector | Face locations |
Directional microphone | Small-vocabulary speech recognition (Loquendo) | Recognized commands |
Paez 2022 [126] | Emotion recognition on Baxter robot for improved robot mentorship in an indoor collaborative play setting | Microphone | Speech recognition (unspecified) | Recognized speech | k-Means emotion classifier of fused body/head movement and facial emotion features |
RGB-D camera (Microsoft Kinect) | Body gesture and facial movement recognition (unspecified) | Recognition of 22 body gestures and 8 Head movements |
RGB-D camera (Intel RealSense SR300) | Facial emotion recognition (Affdex SDK) | Recognized emotion |
Pereira 2019 [121] | Observing human speech and gaze in joint human-robot game with Furhat robot | Monochannel microphone | Speech recognition (Microsoft cloud speech recognition) | Recognized speech | State machine to determine robot gaze target; state transitions triggered by prioritized human inputs |
2x RGB-D camera | Gaze tracking (GazeSense) | Gaze direction |
Object tracking (ARToolkit) | Location of game objects |
Portugal 2019 [123] | Development of elderly care robot SocialRobot and deployment in elderly care center | HD RGB camera (Microsoft LifeCam Studio) | Haar feature-based cascade classifier facial detection; PCA/Eigenface facial recognition | Face locations and identities | Customizable XML-based service engine based on human inputs |
RGB-D camera (Asus Xtion Pro Live) | Depth-augmented face localization | Updated face location |
2x microphone array (Asus Xtion Pro Live) | Speech recognition (PocketSphinx); emotion and affect recognition (openEAR) | Recognized speech; emotional state |
Pourmehr 2017 [124] | Mobile robot locating humans interested in interaction; indoor and outdoor | 2D LiDAR | Leg detector | Human location occupancy grid | Weighted sum of three unimodal occupancy grids |
RGB-D camera (Microsoft Kinect) | Torso detector | Human location occupancy grid |
4-Microphone array (Microsoft Kinect) | MUSIC sound source localization | Speech direction-of-arrival occupancy grid |
Prado 2012 [125] | Emotion recognition in indoor setting; used to synthesize emotional robot behaviors | RGB camera | Haar feature-based cascade classifier facial detection (OpenCV); PCA to extract FAUs; dynamic Bayesian network (DBN) classifier | Recognized emotion from face | DBN classifier to fuse face and voice emotions |
Microphone | Extract pitch, duration, and volume of utterance (Praat); DBN classifier | Recognized emotion from voice |
Ragel 2022 [127] | Interactive play with children in indoor setting using tabletop Haru robot | 2x RGB camera | Face detection (face_recognition and OpenCV), facial feature extraction (Microsoft Azure Face API), mask detection (TensorFlow network) | Estimated gender, emotion, and if wearing facemask | Sequential data fusion pipeline using cacheing, filtering, and fusion to combine asynchronous skeletal pose, hand, face, and speech data; incrementally updated as new data available |
6-Microphone array (embedded in Haru) 8-microphone array (external) | SSL and EMVDR noise filtering (HARK); speech recognition (Google Cloud speech-to-text) | Recognized speech |
RGB-D (Orbbec Astra in Haru); Azure Kinect and Kinect v2 (external) | Hand detection (MediaPipe; skeletal pose tracking (Kinect)) | Hand and skeletal pose keypoints |
Sanchez-Riera 2012 [139] | Human command recognition using multimodal dataset for D-META grand challenge | Binaural audio (RAVEL dataset) | MFCC computation; HMM and SVM to classify MFCCs | Recognized speech commands | Weighted sum of speech and gesture command classifications |
Stereo RGB vision (RAVEL dataset) | Scene flow and STIPs feature extraction; HMM and SVM to classify scene flow and STIPs | Recognized gestures |
Tan 2018 [149] | Development of iSocioBot social robot platform and deployment in four public events | RGB camera (Logitech HD Pro C920) | LPQ facial recognition | Human identity | Probabilistic hypothesis testing to fuse face and sound source locations; weighted sum to fuse speaker and facial identities |
Facial detection (OpenCV) | Face location |
4-Microphone array (Microsoft Kinect) | Unspecified SSL | Speaker direction-of-arrival |
Wireless handheld microphone | Speech recognition (Google Cloud speech-to-text) | Recognized speech |
i-vector framework (ALIZE 3.0) | Speaker identity |
Terreran 2023 [150] | Whole-body gesture and activity recognition | RGB-D camera (RGB channel) | 2D skeletal pose tracking (OpenPose) | 2D skeletal pose keypoints | 2D-to-3D projection and lifting fuses 2D keypoints to 3D pointcloud for 3D pose estimation; ensemble classifier predicts activity |
RGB-D camera (depth channel) | None | 3D pointcloud |
Trick 2022 [153] | Receiving human reinforcement training inputs for a manipulator arm action planner in a laboratory setting | Microphone | MFCC feature extraction and CNN keyword classification (Honk) | Recognized keyword commands | Bayesian independent opinion pool |
RGB-D camera (Intel RealSense D435) | Skeletal pose tracking (OpenPose) and SVM gesture classification (Scikit-learn) | Recognized gestures |
Tsiami 2018 [156] | Simultaneous indoor speaker localization, speech recognition and gesture recognition for children in an indoor play setting | 3x 4-microphone array (Microsoft Kinect) | SRP-PHAT SSL; GMM-HMM speech recognition | Speaker location; recognized speech | Highest average probability of gestures; majority voting for speech recognition; NN to fuse audio and visual speaker locations |
3x RGB-D camera (Microsoft Kinect) | Bag-of-words and SVM gesture classifier | Recognized gestures |
1x RGB-D camera (Microsoft Kinect) | Skeleton tracking | Human location |
Whitney 2016 [167] | Recognizing human object references on a Baxter robot in a laboratory setting | 4-Microphone array (Microsoft Kinect) | Unigram speech model | Recognized word | Bayesian estimator using uni-,bi-,and tri-gram state models to estimate object human is referring to |
RGB-D camera (Microsoft Kinect) | Skeletal pose recognition (OpenNI) and elbow-wrist vector computation | Object being pointed to |
Yan 2018 [176] | Multi-human tracking to train a 3D LiDAR human detector in indoor public area | RGB-D camera (ASUS Xtion Pro Live) | Shoulder and head detection | Torso locations | Multi-hypothesis tracker to fuse leg and torso locations |
2D LiDAR (Hokuyo UTM-30LX) | Leg tracker | Leg locations |