Prior work on human perceptions of robots in video, simulation, and in-person studies has been largely fragmented by the research methodologies. To more comprehensively understand how human perceptions vary between these methodologies, we conducted a
\(2\times 2\) between-subjects study with a mobile robot in a laboratory setting. The two independent factors of our study were
Interaction Environment (Real vs. Simulated environment) and the level of
Interactivity of the research methodology (Interactive participation vs. Video observation). Photos of all experimental conditions are shown in
Figure 1. The difference between Real and Simulated interactions is shown in
Figure 2. To the best of our knowledge, our study, which utilized two navigation tasks, is the first to compare human perceptions of robots obtained in real-world interactions with perceptions obtained from interactive simulations, where humans control a virtual avatar. We compared these human perceptions of a robot in real-world interactions and interactive simulations with perceptions of the robot after viewing a video recording. Our study protocol was approved by our Institutional Review Board.
3.1 Hypotheses
As shown in
Figure 1, our two independent variables led to four conditions: Real-Interactive, Real-Video, Sim-Interactive, and Sim-Video. We studied whether these conditions had an effect on four aspects of human perceptions of the robot: Competence [
17], Discomfort [
8], Social Presentation, or “the robot’s ability to appear to be a desirable social partner” [
4], and Social Information Processing, which captures social intelligence [
4]. We also studied the effect of interactivity on perceived workload [
19]. These measures are common in the HRI literature [
18,
30,
33,
47,
57].
Our first set of hypotheses focused on the idea that human perceptions of a mobile robot in the Real environment would differ from perceptions of the robot in the Simulated environment. These hypotheses were motivated by prior work that suggests that peoples’ perception of a robot can vary between simulation and real-world interactions (e.g., [
38,
65,
69]). In particular, Tsoi et al. [
65] provided evidence that human perceptions of robots collected via video studies and compared to those collected using interactive, online simulations could differ, but did not compare them to observations obtained in real-world HRI. More specifically:
H1. Human perceptions of the robot’s competence (H1a), discomfort (H1b), social presentation (H1c), and social information processing (H1d) in the Real environment will differ from the Simulated environment.
Our second set of hypotheses tested the potential difference in human perception of a mobile robot between a participant interacting with a robot compared to a participant viewing an interaction with another person in a video. This hypothesis is motivated by the common use of videos in HRI studies and the growing use of interactive simulations as a potential replacement [
56,
65,
71]. Prior work suggests that people may perceive a robot more positively when physically present [
37] and that people may be influenced by co-present robots (e.g., [
1,
21]).
H2. Human perceptions of the robot’s competence (H2a), discomfort (H2b), social presentation (H2c), and social information processing (H2d) will differ between interactive conditions (Sim-Interactive and Real-Interactive) and video-based conditions (Sim-Video and Real-Video).
Our third set of hypotheses considered data from the Real-Interactive condition as the gold standard for gathering human perceptions of robots. Then, because video observations lack interactivity in comparison to interactive simulations, we suspected that human perceptions collected with the Sim-Video and Real-Video conditions would be less similar to those obtained in the real world than the perceptions obtained with the Sim-Interactive condition.
H3. Human perceptions of the robot’s competence (H3a), discomfort (H3b), social presentation (H3c), and social information processing (H3d) in video-based conditions (Sim-Video and Real-Video) are more similar to the Sim-interactive condition than to the Real-Interactive condition.
Our fourth and final hypothesis is motivated by prior work that associates embodied and interactive experiences with low workload. For example, Wang et al. [
70] found that robot agent embodiment resulted in lower perceived workloads during interaction with robotic agents compared to voice-only agents. Tsoi et al. [
65] found partial support for lower perceived workload when completing an HRI survey that involved providing perceptions of a robot in interactive interactions compared to a survey that involved providing perceptions based on video observations
H4. The Interactive conditions will lead to a lower perceived workload by participants than the Video conditions.
3.2 Participants
In total, we recruited 213 participants for our study. For the Real-Interactive condition, participants were recruited via flyers and word of mouth. Participants for all other conditions were recruited online using the Prolific crowdsourcing platform.
All the participants were at least 18 years old, had normal or corrected-to-normal vision, and were fluent in English. The participants in the Real-Interactive condition were required to be able to walk comfortably and stand for the duration of the study (20–30 minutes). Participants in the online portion of the study were limited to those on non-mobile devices, such as laptops and desktop computers, to ensure a reasonable screen size on their device and the ability to control the virtual avatar in simulation using a physical keyboard.
We excluded 53 participants from analyses because 35 participants in an Interactive condition had incomplete video recordings due to technical issues or had incomplete surveys, 14 participants had other technical issues or did not follow directions, and 4 accidentally participated in the Sim-Video condition after participating in the Sim-Interactive condition.
Among the final 160 participants (40 per condition), 90 participants identified as male, 66 as female, 2 as non-binary, 1 as genderqueer, and 1 declined to state their gender. Additionally, 32 participants were between ages 25–34, 50 were between ages 35–44, 40 were between ages 45–54, 23 were between ages 55–64, 13 were between ages 65–74, and 2 were between ages 75–84. On average, the participants indicated neutral familiarity with robots on a 7-point scale (
\(M=3.91,SE=0.13\)). The online participants had an average Internet speed of
\(163.46\) Mbps (
\(SE=15.86\)), which was in line with prior use of SEAN-EP [
65].
3.3 Setup
For the Real-Interactive condition, the experiment was conducted in a laboratory room on a university campus in the United States. The room contained physical obstacles consisting of EverBlock construction blocks, as shown in
Figures 1(a) and
2(a). There were also four distinct pieces of artwork on easel stands positioned in the corners of the room. A close-up photo of one of the pieces of artwork in the real laboratory environment is shown in
Figure 2(b).
We designed our study such that a robot, controlled by the ROS Navigation Stack with Social Cost Layers [
39], autonomously navigated near the participant to jointly complete two tasks: the
Follow Task and the
Art Task. The Follow Task was designed to place the participant’s focus on the robot throughout the interaction. Follow tasks are typical for robots that serve as tour guides and have been investigated in the past in social navigation [
7,
43,
45,
53]. Meanwhile, we designed the Art Task to allow participants to observe the robot’s movement during a more dynamic and complex navigation task. These tasks are further described in the next section. Importantly, the robot that we used in the study was a Pioneer 3-DX on which we affixed a laptop, oriented with the screen pointing forward, to allow for robot communication with the participant. We also attached a depth sensor and localization beacon to the robot.
The participants in the Real-Interactive condition wore a GoPro camera on their chest (as in
Figure 2(a)) to record videos from a first-person perspective while completing study activities. HTC Vive Trackers were used to localize the robot and the participants. Also, the participants used a custom web application on a mobile phone, which we provided, to do task-specific actions. This included pressing a button on the phone to begin each task and recording their answers to survey questions. The web application was also used to display text on the robot’s laptop.
For the Sim-Interactive condition, we modeled the laboratory room used for the Real-Interactive condition as well as the Pioneer robot using the Unity game engine and SEAN 2.0 [
66].
Figures 1(b),
1(d),
2(c), and
2(d) illustrate the virtual world that we created for the study. In addition, we used SEAN-EP [
65] to embed our simulation in a Qualtrics web survey, which gathered participants’ demographic data and all other relevant measures regarding their experience of virtual human–robot interactions. The participants used their keyboards to control a virtual avatar in the SEAN simulations and to complete the same activities as in the Real-Interactive condition.
For the Real-Video and Sim-Video conditions, we used recordings of participants’ interactions with the robot in the real-world lab and the virtual re-creation, respectively. A GoPro camera worn by participants in the Real-Interactive condition (as in
Figure 2(a)) was used to record the interactions that were observed by participants in the Real-Video condition. For the Sim-Video condition, we used SEAN 2.0 to save video recordings of the HRI that happened under the Sim-Interactive condition. The recordings were made from the perspective of the virtual avatar that was controlled by a human in SEAN. In order to ensure participants in the Video condition were able to understand what the robot was communicating, we added captions to all videos that displayed the same text that was shown on the robot’s laptop screen. We did not use audio in the simulation or the videos due to the difficulty of generating realistic audio. An example of the captions is provided in
Figure 1(c) and (d). The videos were then embedded in a Qualtrics survey like the one used for the Real-Interactive condition.
3.4 Procedure
At the beginning of the study, the participant provided demographic information (as in
Section 3.2). Then, the participant continued on to complete the study’s four phases: (1) Introduction, (2) Follow Task, (3) Art Task, and (4) Closing. In each task, the participant was specifically asked to pay attention to how the robot moved.
Phase 1: Introduction. In the Real-Interactive condition, the participant was introduced to the robot by an experimenter who told them that they would interact with the robot through a series of tasks. Then, the experimenter assisted the person as they put on the GoPro chest harness to record their activities during the study. In the Sim-Interactive condition, the participants completed a walk-through tutorial that showed them the virtual Pioneer robot and their randomly assigned avatar. The walk-through then explained how to navigate the simulated lab. In the Real-Video and Sim-Video conditions, the participant was given text instructions indicating that they would watch videos of a person or avatar interacting with a robot. The participant was also shown an image of the robot to familiarize the person with the Pioneer 3-DX platform.
Phase 2: Follow Task. In the Real-Interactive condition, the participant was instructed to move to a specific marker on the floor and then press a button on the mobile device to begin the follow task. Then, the participant followed the robot along a pre-defined path, which was composed of four segments.
The path involved navigating around EverBlock construction blocks placed throughout the room, as shown in
Figure 2(a) and (c).
After following the robot along each of the four path segments, the participant answered survey questions about their impression of the robot. In the Sim-Interactive condition, the participant completed the same task but in a SEAN simulation.
For the Real-Video and Sim-Video conditions, we paired each participant with a study session that involved Real-Interactive and Sim-Interactive participation, respectively. Then, the videos of the Follow Task from the Interactive sessions were shown to the participants in the Video conditions. In this manner, a participant in Real-Video and Sim-Video conditions was able to watch recordings of the task and answer survey questions about their impression of the robot in the videos as in the Interactive conditions.
Phase 3: Art Task. In the Real-Interactive condition, the participant was told that there had been an art heist in the lab, and some of the art had been replaced with fakes. The participant and the robot were tasked with collecting information about the four art pieces in the laboratory to help the experimenters figure out which were real and which were fake.
Figure 2(b) displays one of the art pieces in the real world, and
Figure 2(d) shows it in simulation. For each of the four art pieces, a participant performed the following steps:
(1)
The participant was directed to find the robot.
(2)
Once the person found the robot, a text message was displayed on the robot’s computer screen which instructed them to follow it.
(3)
The robot then led the participant to a piece of artwork.
(4)
The participant was instructed via text on the robot’s computer screen to count the number of a given object shown in the art piece.
(5)
After instruction, the robot moved away to a different location and waited for the participant to complete the object counting.
(6)
The participant provided their answer to the counting request using the mobile device and was directed to find the robot again to repeat the process for the next art piece.
The Art Task was designed so that the person and the robot would engage in more dynamic interactions than in the Follow Task. In this case, while the person was counting objects in an art piece, the robot moved far from the participant and waited until they completed counting the objects in the picture. Only when the participant started moving away from the picture did the robot start to move back towards the person. Then, both the robot and participant moved towards each other and soon thereafter engaged in face-to-face or side-by-side spatial formations (e.g., as in [
25,
74]).
In the Real-Video and Sim conditions, the description of the Art Task was provided in text before the participant began the task.
Also, in the Sim-Interactive condition, the participant used an interface that we implemented in the simulation to record their responses to the counting request by the robot. Meanwhile, in the Video conditions, the participant recorded their answers using the Qualtrics web survey. This survey included videos from Interactive conditions using the same participant-session pairing explained for the Follow Task.
Phase 4: Closing. Finally, the participant provided their impressions of their perceived workload for the tasks in the study.
In-person participants in the Real-Interactive condition were paid \(\$\)15.00 USD per hour rounded to the nearest 10-minute increment.
Participants in all other conditions completed the study online using Prolific. They were paid \(\$\)5.00 USD as we estimated the online study sessions to take 20 min.
3.5 Dependent Measures
We measured 2 aspects of participants’ experience during our study using widely adopted survey measures in HRI:
Human Perceptions of the Robot. We measured four aspects of human perceptions of the robot: (1) Competence, (2) Discomfort, (3) Social Presentation, and (4) Social Information Processing. The first two aspects were measured using the Robot Social Attributes Scale (RoSAS) [
8], which includes robot Competence and Discomfort factors. The items were answered in relation to how the robot moved during the tasks. Ratings for the Competence and Discomfort scales were gathered on 7-point responding format ranging from 1 (Definitely Not Associated) with the robot to 7 (Definitely Associated), which was the same as the original RoSAS responding format.
Robot Social Presentation and Social Information Processing were measured using the short-form of the
Perceived Social Intelligence (PSI) questionnaire [
4]. The Social Presentation scale had a total of seven items, all of which began with “This robot…” and ended with statements such as “enjoys meeting people,” and “cares about others.” The Social Information Processing scale had a total of 13 items, which started with “This robot…” and ended with statements like “responds appropriately to human emotion” or “can figure out what people think.” Ratings for PSI statements were gathered on a 5-point responding format ranging from 1 (Strongly Disagree) to 5 (Strongly Agree), which was the same as the original PSI responding format.
For each scale, we aggregated responses across items to calculate a composite measure after confirming high internal reliability. The Cronbach’s
\(\alpha\) values were
\(0.90\) for Competence,
\(0.76\) for Discomfort,
\(0.76\) for Social Presentation, and
\(0.94\) for Social Information Processing. The Cronbach’s
\(\alpha\) value for each aspect we measured was within the 0.7 to 0.95 acceptable value range [
60].
Perceived Workload. We used items from the NASA Task Load Index (TLX) [
19] to assess the perceived workload for the Follow and Art Tasks. Perceptions of Mental Demand, Physical Demand, Temporal Demand, Effort, and Frustration were gathered on a 7-point responding format from 1 (lowest) to 7 (highest). The 7-point responding format was used for consistency in the responding format with the other scales. The 7-point format was chosen over the 5-point format because responding formats with 6 or more categories have been shown to correlate better [
51]. Example survey items included “How mentally demanding were the tasks?” (Mental Demand) and “How insecure, discouraged, irritated, stressed, and annoyed were you?” (Frustration). The Cronbach’s
\(\alpha\) for the NASA TLX survey items was
\(0.75\), which is within the 0.7–0.95 range of acceptable values [
60].