research-article

Open access

How Do We Perceive Our Trainee Robots? Exploring the Impact of Robot Errors and Appearance When Performing Domestic Physical Tasks on Teachers’ Trust and Evaluations

Authors:

Pourya Aliasghari,

Moojan Ghafurian,

Chrystopher L. Nehaniv,

Kerstin DautenhahnAuthors Info & Claims

ACM Transactions on Human-Robot Interaction, Volume 12, Issue 3

Article No.: 37, Pages 1 - 41

https://doi.org/10.1145/3582516

Published: 05 May 2023 Publication History

All formats PDF

Abstract

To be successful, robots that can learn new tasks from humans should interact effectively with them while being trained, and humans should be able to trust the robots’ abilities after teaching. Typically, when human learners make mistakes, their teachers tolerate those errors, especially when students exhibit acceptable progress overall. But how do errors and appearance of a trainee robot affect human teachers’ trust while the robot is generally improving in performing a task? First, an online survey with 173 participants investigated perceived severity of robot errors in performing a cooking task. These findings were then used in an interactive online experiment with 138 participants, in which the participants were able to remotely teach their food preparation preferences to trainee robots with two different appearances. Compared with an untidy-looking robot, a tidy-looking robot was rated as more professional, without impacting participants’ trust. Furthermore, while larger errors at the end of iterative training had a greater impact, even a small error could significantly reduce trust in a trainee robot performing the domestic physical task of food preparation, regardless of the robot’s appearance. The present study extends human–robot interaction knowledge about teachers’ perception of trainee robots, particularly when teachers observe them accomplishing domestic physical tasks.

1 Introduction

Intelligent robots are being developed now for many purposes. These applications include assisting people in different tasks, such as household chores. While these robots might be preprogrammed to carry out a range of duties, they are designed to be adaptable for learning how to perform new tasks from humans (e.g., their owners) to extend their functionalities [16, 62]. In order to make this skill transfer process easier and to allow even non-expert users to teach robots, natural teaching mechanisms have been widely proposed and studied (e.g., imitation learning [9]). This means that, in the future, it is anticipated that humans will act as robots’ teachers. In that situation, after a robot learns a new task from a human, that human teacher must perceive the robot as capable and proficient in the new skill and trust it in order to allow the robot to be used as an effective assistant later on.

When people are learning and practising a new task, they will make mistakes. Despite all of the technological advances, most robotic systems to this day are still unable to be at least as skilful as humans, and it is likely that they will make errors during social interactions or while performing a task. Intelligent robots may perform their tasks in dynamic and unstructured environments based on data collected through imperfect sensors. Furthermore, many robots operate autonomously or semi-autonomously. Therefore, they must make decisions on their own and act accordingly. The presence of humans in robots’ workspaces may also introduce uncertainty. All of these factors adversely affect the reliability of intelligent robots that are designed by humans. In turn, these factors can make the robots occasionally behave incorrectly [18], making small or big mistakes. It has been found that a robot’s errors can negatively impact a user’s perception of it as well as several aspects of Human–Robot Interaction (HRI) [32]. For instance, participants may react to these failures in real-time by frowning or averted gaze [28]. Trust, as a critical factor concerning acceptance and persuasiveness of robots [59], can also be highly impacted after humans observe robots’ erratic behaviours [17, 26]. Trust is also closely linked to the intention to use the robots [30]. When people cannot trust a robot, they may be reluctant to use it in the future [53, 72].

Robots that are designed to learn from humans are also bound to make mistakes, during the learning process and even after having learned the task, for reasons mentioned earlier (demonstrated in [37]). In human–human teaching situations, teachers usually expect and tolerate mistakes made by students when learning new skills and would even consider mistakes as opportunities for learning. However, how do people perceive a trainee robot that makes mistakes during and after being taught a new task? Faulty behaviours in such systems may have different severity levelsdepending on their consequences [36]. Therefore, in this study, we explore how small or big errors of a domestic robot during and after the process of learning a physical task may affect teachers’ interpretations of its behaviours and impact their trust.

Two experiments are reported in this article to study the impact of errors in the context of a domestic robot learning a simple food preparation task. While we originally planned in-person human–robot experiments, due to the COVID-19 pandemic, we had to move online. The first experiment investigates what the majority of people consider to be a small error or a big error when a robot is performing a food preparation task. In the second experiment, which is built on the insights gained from the first, participants use our designed virtual experimental platform to remotely demonstrate to a humanoid robot how to perform the task. The robot then practises this task with or without errors, depending on its stage of learning and experimental condition. Our aim is to identify the consequences of robot errors on teachers’ perceptions of and trust in the robot. As the importance of individual differences in studying trust has already been demonstrated [42], we consider participants’ personality traits as well as their general disposition of trust in our analyses. Moreover, we explore the possible impact of a learner robot’s tidy/untidy appearance, indicated by the way it is dressed. Previous research has shown that the appearance of a robot can impact trust [11, 27]; we suspected that it might also impact other user evaluations of the robot. To this end, we study two cases in which a robot either has a tidy or an untidy appearance. To the best of our knowledge, the effect of clothing tidiness of a robot on participants’ perceptions has not been investigated in past HRI studies.

While participants rated the tidy robot as more professional than the untidy robot, we did not find any evidence that the robot’s clothing style affected teachers’ attitudes towards and trust in our domestic robot. Multiple aspects of teachers’ perceptions of the robots improved as the robots’ behaviours improved. Errors made by the robot at the end of the teaching period had a profound impact. Even a small error when practising a recently learned physical task had a significant negative effect on trust in the studied teaching interaction. The impact was found to be correlated with several personality traits of participants as well as with their disposition to trust other people.

The main contribution of this work is that we use a learner robot not expected to already know how to properly perform a domestic physical task and demonstrate how its errors, even small errors, towards the end of the teaching period can affect human teachers’ perceptions of the robot.

The remainder of this article is organized as follows. A review of related work is presented in Section 2, followed by a description of our overall approach along with the details of the task designed for the robot to perform in Section 3. The experimental methodologies are presented in Section 4. Procedures of each experiment, corresponding results, and discussions are found in Sections 5 and 6. Limitations of this study and possible avenues for future work are discussed in Section 6 as well. We present our conclusions in Section 7.¹

2 Background and Related Work

In this section, we first discuss important topics and the state-of-the-art in research on human–robot trust. Next, we review the literature on how a robot’s errors can impact HRI. We then discuss how robot appearance can influence HRI.

2.1 Trust and Robot Errors in HRI

Trust can be described as “the attitude that an agent will help achieve an individual’s goals in a situation characterized by uncertainty and vulnerability” [38, p. 51] It is a complex relationship that may be established, violated, or recovered, depending on many factors. A recent review about human trust in artificial intelligence can be found in [23]. As noted by Hancock et al. [26], factors influencing trust in HRI can include the robot’s attributes and its performance (e.g., robot’s behaviour, dependability, reliability, and predictability), the human’s abilities and characteristics (e.g., user’s expertise, attentional capability, personality traits, attitudes towards robots, and tendency to trust), and environmental aspects (e.g., in-group membership, culture, communication between the human and robot,and task type and complexity).

As previously discussed, robots may collect information they need with sensors that can be noisy and inaccurate. They also may be operating in environments that are unstructured and dynamic. They then need to make their own decisions and act accordingly. These factors makerobots prone to error and, in some situations, lead them to behave erratically, i.e., in a manner that deviates from behaviours expected by human users [69]. In HRI, these errors have been categorized into technical malfunctions or interaction failures [32]; or, according to their severity, as either benign errors or catastrophic errors [36]. More recently, Tian and Oviatt [69] have developed a taxonomy of errors from a user-centred perspective based on the influence of errors on users’ perceptions. They classified the errors as either “performance errors”, which may lower a user’s rating of a robot’s intelligence and competence in achieving a task, or “social errors”, which may break social norms and degrade a user’s rating of a robot’s socio-affective competence and their relationship with it.

Errors in the behaviours of robots have been shown to reduce their perceived trustworthiness [17, 60], the effect of which may be different depending on the severity and timing of the mistakes [53]. A taxonomy of the impact of HRI failures on trust has been developed recently [70]. Also, a review on how robots communicate failure and how failures affect people’s perceptions of robots and their feelings towards them is provided in [32]. In the following, we discuss in more detail a recent research project on trust and robot errors in HRI, which inspired the approach taken in this article.

Rossi et al. [52] have performed a series of studies to investigate how humans can trust robots in home environments. The initial step of this research involved assessing the severity of a set of potential mistakes that a domestic robot could commit from the perspective of a group of participants [54]. The researchers identified three big and three small errors. They then investigated how the timing and severity of a robot’s mistakes could affect trust [53]. In a virtual storyboard environment, an experiment was conducted with the following five conditions of a robot doing successive home assistance tasks: (a) performing them all correctly, (b) making three big/small mistakes (determined by the previous initial step) at the beginning and also at the end, and (c) exhibiting three big/small errors at the beginning and three small/big ones at the end. Finally, in an emergency evacuation scenario that contained uncertainty and vulnerability, participants were given several choices, asked (a) whether they would trust the robot to deal with the situation on its own, (b) whether they did not trust the robot and would want to deal with it themselves, (c) whether they would want to deal with the situation in teamwork with the robot, or (d) whether they would not trust either the robot or themselves and leave with the robot plus call emergency services. A majority of participants in the study by Rossi et al. [55] did not trust the robot when there were large errors at the beginning or end of the scenario. With small errors, however, participants tended to trust the robot in teamwork. It was also found that making big mistakes at the beginning of the interaction, rather than at the end, impacted trust more negatively. A higher disposition for trusting in people’s benevolence increased participants’ trust in the robot.

In another study with a Pepper robot, Rossi et al. [57] found that users’ trust may be enhanced by a longer history of interactions with the robot or a higher level of knowledge about its capabilities. Later, while repeating the same experiment with the humanoid, child-sized Kaspar robot, trust was not significantly influenced by those factors [58]. Note that Pepper and Kaspar have very different embodiments, which could have influenced those results. Last, researchers studied the relationship between the social behaviours of a robot and human’s trust [56]. When a Pepper robot guided the participants in a crowded environment and acted socially, it was able to gain participants’ trust, which was not the case before cooperating with the robot.

Booth et al. [11] examined whether students trust a Turtlebot robot and would assist it in entering or leaving a secure building. The factors that increased trust are as follows: having participants in a group as opposed to being alone when encountering the robot, the robot requesting to leave instead of entering the building, and the robot’s appearance resembling a food delivery tool as opposed to its unmodified look. Analysing a high-risk situation, Robinette et al. [51] reported that even after a robot misbehaved while providing navigation guidance at the beginning, the participants were willing to trust it in an in-person simulation of emergency evacuation by following the directions suggested by the robot to an unknown exit. However, when mistakes were made during the emergency, i.e., the robot pointing to a blocked doorway, only one-third of the participants trusted the robot. In a virtual interaction with the robot, again in an emergency scenario, even a single failure strongly affected participants’ trust in the robot depending on the level of risk [50].

A number of studies have explored other impacts of robots’ errors, specifically in ‘teaching’ interactions [28, 35, 64]. We provide a summary of these studies here. Concerning a manufacturing setup in which a supervisor (the experimenter) was also present near a robotic arm, Hedlund et al. [29] recently conducted an experiment to study human teachers’ response to robot failures after learning a manipulation task by different kinds of demonstrations. The authors found that robot failures, simulated through playing prerecorded trajectories, affected participants’ trust in themselves and the robot. In another study [28], the participants taught a kind of dance to a humanoid robot that was designed to repeat any recognized actions of its teacher. Throughout the interaction, there were instances with both correct and intentionally incorrect robot behaviours as a response to the presented material when the robot was mimicking participants’ actions. When the robot’s actions were correct, the participants often smiled and nodded along. The article further states that negative feedback such as frowning and averted gazing was generated after incorrect robot behaviours.

Using the kinesthetic teaching method, participants in another study demonstrated an object manipulation task to a robotic arm: they physically showed it a trajectory of motions by manoeuvring the arm [64]. Following teaching, the experimenters ran a program with an error for the participants to observe instead of running each participant’s flawless trajectory. The errors were missing the object (low severity), placing it in an inaccurate location (medium severity), or dropping it when moving (high severity). In assigning those severity levels, the effect of each error on the immediate surroundings was considered. Researchers analysed the timing and the intensity of participants’ behaviours, such as laughing and smiling, and recorded their verbal reactions as well. Faster and more intense responses were detected from the participants (teachers) who experienced more severe robot errors.

In a case in which a conversational agent was instructing participants how to cook, Kontogiorgos et al. [35] studied the impact of type of embodiment and the severity of failures. High-severity mistakes were simulated by introducing a time limit for completing the task. The authors found that the participants responded more intensely to the failures, using verbal and nonverbal signals, when they interacted with a human-like embodiment rather than a smart speaker. However, failure severity was not always found to affect the detected behavioural signals in that study.

Situating this work with respect to the literature: In most of the reviewed work about robots’ errors and their impact on trust and participants’ attitudes, participants were interacting with a robot that was supposed to know how to perform a task and was expected to work properly. In the context of teaching, teachers might often be forgiving about errors made by their students during the learning process. In HRI, there is evidence that a lower expectation of a robot’s functionality can minimize the negative effects of its breakdowns on trust [75] and participants’ evaluations of the robot [39]. In our study, which is also related to trust, we follow an approach similar to previous literature to explore a learning situation in which errors with multiple severities are being made by a student robot in a home (kitchen) environment.

2.2 Robots’ Appearance Impacting HRI

The way in which humans dress has been shown to influence the formation of first impressions [61]. The clothing styles of students in a school environment can affect teachers’ perception of their intelligence and potential for academic achievement [7]. In HRI, the appearance of a robot may provide information on its abilities and competencies. Many studies on the impact of robot appearance on different aspects of humans’ perceptions and interactions have focused on the level of machine-likeness as a design characteristic of the robot (i.e., being anthropomorphic or zoomorphic vs. machine-like) [8, 24, 39, 48, 66, 73].

The appearance of an agent can be related to the notion of perceived authority. People in everyday life tend to comply with requests from those whom they perceive to have authority. A person who is wearing professional apparel, for example, a tidy suit [46] or a uniform [13], has been established to effectively portray authority. Indeed, as noted before, impact of a robot’s appearance on authority and trust was seen in [11]. Here, a machine-like robot was much more likely to gain entry to a locked building, pretending to deliver cookies, versus when it had no visible identifier, i.e., was an undefined robot. Haring et al. [27] have studied whether the human-like appearance of a teacher robot could encourage participants to practise a task for longer compared with a less human-like robot. In this context, however, no differences were found between the more and less human-like robots concerning compliance time. In a study by Lee et al. [39], the impact of robot errors was not found to be different when they tested a human-like or a non-human-like robot.

Situating this work with respect to the literature: In our present work, the clothing of a robot is manipulated to study whether the results discussed earlier that have been observed in humans can also be applied to robots. Compared with previous studies regarding appearances of robots, this study takes a different approach, manipulating the appearance of the same robot (Pepper) to create two different clothing styles. The intention was to explore whether these clothing styles would influence how ‘professional’ the users found the robot to be and how this might impact perceived authority and, therefore, participants’ attitudes towards the robot and their trust in the robot.

3 Research Questions and Hypotheses

This study has been led by the following Research Questions (RQs):

RQ1.

How does the last impression of a student robot’s performance affect trust?

RQ2.

Do different personalities of human teachers and their disposition to trust other people affect their perception of and their trust in a trainee robot?

RQ3.

How do the ratings of teachers about behaviours of a trainee robot change over time while it is practising a task and appears to gradually improve its performance?

RQ4.

Can a tidy appearance of a trainee robot (indicated by its clothing style) affect the view of the teachers about the robot and their trust in the robot?

We considered the following hypotheses (corresponding to these RQs):

H1.1.

Participants may ignore a small error after they observe the robot improving in learning a task and not making errors in a few consecutive rounds.

H1.2.

A big error after the same learning process may cause a significant loss of trust.

H1.1 and H1.2 are formed based on the work of Rossi et al. [52, 55], who found that participants were able to trust a domestic robot, at least to some degree, when making smaller mistakes. Considering that the robot in our study is a learner, we expect the teachers to not be affected by small errors.

H2.1.

We anticipate correlations between participants’ personality and their trust in the robot, as suggested by [26].

H2.2.

We expect to see relationships between participants’ disposition of trust to other humans and their trust in the robot. Researchers have previously observed such relationships, for example, in [55] in which the disposition to trust people’s benevolence increased participants’ trust in the robot.

H3.

The robot would be rated as more confident, proficient, and goal driven as it progresses in learning and improves its performance. This is hypothesized based on literature related to perception of human learners (see, e.g., [31, 45]).

H4.

We expect participants to perceive higher levels of authority from a student robot and trust in it more when it has a more professional (i.e., tidy) clothing style (based on [11, 13, 46]; see Section 2.2 for details).

4 Methodology

We performed two virtual experiments to study the research questions. The main part of this study (Experiment 2) evaluated perceptions of the participants about learner robots and investigated multiple aspects of their trust in the robot. To simulate a scenario close to a real teaching interaction, the participants were able to virtually teach multiple food preparation tasks to a humanoid robot according to their own preferences using an online interface we developed. The behaviours of the robot were programmed to gradually improve over multiple rounds of teaching and practising. Initially, the robot made a big error while executing the learned task for the first time, followed by a small error in its second performance. The robot then performed a series of correct behaviours, simulating a situation in which learning has been completed successfully. In the final task performance, the robot either made a small error, a big error, or no error. This sequence of robot performances was then used to elicit participants’ final impressions of the learning process. The design of this experiment is explained in detail in Section 6.²

Before using erroneous actions of a robot with different severities, we needed to investigate what errors are rated as small or big in our selected task. This motivated another online study that was conducted first (Experiment 1, presented in Section 5), in which we asked participants to rate the severity of a set of robot errors in our context. Based on the results of Experiment 1, we selected two small errors and two big errors to include in Experiment 2.

4.1 The Food Preparation Task

For both experiments in this study, we used basic situations in which a robot needed to add two ingredients from the available options to specific food items (soup, salad, or tea). When considering domestic physical tasks, we selected a food-preparation task assuming that participants would be fairly experienced with it as a part of their daily lives.

The task is set up as follows. Assume that there are three ingredients on the left side and two ingredients on another side of a dish located on a table (observed through a camera from the front). The ingredients are meant to be inside labelled plastic containers. To complete preparing the three different food items (soup, salad, or tea), exactly one item from each side of the dish should be added to it. On the side of the dish showing three containers, one includes a cleaning product that is not supposed to be used for cooking. Therefore, for each food item that is being prepared, there are \({2\times 2=4}\) possibilities to reasonably add a food ingredient from each side. To produce multiple versions of this food-preparation task, three different kinds of dishes were used: a bowl for salad, a cup for tea, or a bowl for making soup. Table 1 specifies the ingredients for each food item.

Table 1.

Left side			Food	Right side
cleaning product (e.g., dishwashing liquid)	lemon juice	balsamic vinegar	salad	parmesan cheese	feta cheese
	skim milk	whole milk	tea	honey	sugar
	cooked noodles	cooked rice	soup	mint	basil

Table 1. List of the Food Items and the Items Beside Each of them used for the Food Preparation Task

The food-preparation task can be performed either correctly or incorrectly depending on which two items are chosen by participants to be added. For the faulty behaviours, while performing the described task, we considered three potential classes of errors: (a) forgetting to add one of the selected ingredients, (b) replacing a selected ingredient with an adjacent one (but not with the cleaning product), and (c) adding a cleaner³ to the food instead of a selected food ingredient. All of these errors can be considered object detection failures, if committed by a robot, under the classification of errors in the recent work of Das et al. [15].

5 Experiment 1

To assess the rated severity of those classes of errors, we presented participants in a crowdsourcing experiment with six written descriptions of situations in which they were said to have already taught the described task to a robot (i.e., the items that should be added), but a set of errors has occurred while the robot was performing the task (e.g., see Figure 1). In this experiment, there was no robot displayed to the participants. Following the written description of each error, participants had to rate its severity on a continuous scale⁴ from very small to very big error. Here, instead of asking the participants to select the ingredients according to their preference for a specific dish (as we later did in Experiment 2), every item in Table 1 was said to have been already taught to the robot, together with another item from the other side. Then, both items were included as part of all kinds of errors to rate (forgetting, replacing, and adding a cleaner). This design was to ensure that ratings were collected for all possible combinations, and will be explained in detail in Section 5.1. Food preparation preferences of participants were recorded at the end to check whether individual preferences affected the rated severity of the errors (e.g., if replacing honey with sugar was rated similarly for those who preferred honey in their tea as compared with those who preferred sugar in their tea). Note that a subset of the results of this experiment was reported in [1, 2].⁵

Fig. 1.

5.1 Procedure and Measures

The study was conducted implementing multiple HTML pages running on a lab server. Participants went through the following steps:

Step 1 — Demographics questionnaire: After giving consent by checking a box, we asked the participants about their gender and age in the demographic information form. They were given the option to skip answering any of these questions.

Step 2 — Errors evaluation: In the main part of this experiment, six situations in which a robot had been taught to prepare food were described, following a brief written instruction. There was one question per page. To explain the structure of these questions, assume that for a specific food (f), one of the ingredients on its left side (according to Table 1, excluding the cleaner) and one on its right side are already selected. In this article, these are denoted by \(X_{1}\) and \(Y_{1}\) , respectively. This means that the robot should add \(X_{1}\) and \(Y_{1}\) but not add \(X_{2}\) and \(Y_{2}\) . The question \({Q(f,X_{1},Y_{1},X_{2},Y_{2})}\) was given with the following description when all variables were substituted with the names of the ingredients. Salad is the food in this example:

“Assume you have taught your robot how you prefer it to make a salad for you. Let’s say you instructed it to add some \(X_{1}\) and \(Y_{1}\) in addition to your favourite vegetables. Now, your robot is practising its task. How would you rate the severity of the following mistakes?”

The robot errors (forgetting, replacing, and adding a cleaner) with both \(X_{1}\) and \(Y_{1}\) being part of them were listed after each question. We included the following set of errors after the question \({Q(f,X_{1},Y_{1},X_{2},Y_{2})}\) , when all the variables were replaced with the name of the items:

Forgetting \(X_{1}\) : The robot forgets to add \(X_{1}\) .

Forgetting \(Y_{1}\) : The robot forgets to add \(Y_{1}\) .

Replacing \(X_{1}\) : Instead of \(X_{1}\) , the robot adds \(X_{2}\) .

Replacing \(Y_{1}\) : Instead of \(Y_{1}\) , the robot adds \(Y_{2}\) .

Adding cleaner: Instead of \(X_{1}\) , the robot adds dishwashing liquid.

Adding cleaner: Instead of \(Y_{1}\) , the robot adds laundry detergent.

7,8.

(Attention check)

After each error, participants could click anywhere on a continuous scale to choose from “very small mistake” to “very big mistake” (see Figure 1). The rated severities were recorded in a range between 0 to 1,000. In three of the questions, the attention check items were “The robot makes a very small/big error”; thus, the participants should have selected the correct side of the scale. In the remaining questions, attention check items were similar to items 1 and 3, respectively, but stated differently: “ \(X_{1}\) is not being added” and “ \(X_{2}\) is being added instead of \(X_{1}\) ”. The answer to these two pairs should not have been much different. Attention checks were not placed at the first or last position of the list, even though the order of the items in the list was randomized.

A question was asked twice for each food item, along with the entire list of 8 mistakes. The first time preparing each food item was questioned, \(X_{1}\) and \(Y_{1}\) (e.g., lemon juice and Parmesan cheese for salad) were randomly indicated. In the second question related to that food, \(X_{2}\) and \(Y_{2}\) (e.g., balsamic vinegar and feta cheese for salad) were considered. This way, every participant answered both \({Q(f,X_{1},Y_{1},X_{2},Y_{2})}\) and \({Q(f,X_{2},Y_{2},X_{1},Y_{1})}\) for every food (a total of 36 rating tasks per participant, excluding the attention checks).

Step 3 — Post-experimental questionnaire: After completing the rating tasks, we asked participants to indicate their personal food preparation preferences by selecting one ingredient per side from Table 1 (excluding the dishwashing liquid) for each food item. Also, participants were asked to write down any other small/big errors that they could think of in the particular context.

5.2 Participants

A total of 217 participants were recruited using the Amazon Mechanical Turk (MTurk) platform. To increase the reliability of responses, we made the study available only for users with a higher than 97% approval rate, who had completed at least 100 tasks before. In addition, only people in Canada or the United States were able to see our task. All participants received 1 USD for spending around 10 minutes. The study received full ethics clearance from the University of Waterloo’s Human Research Ethics Committee.

After the collected data were reviewed, we decided to consider situations in which participants did not provide meaningful answers to the open-ended questions (about the other potential errors) as an additional attention check failure, for example, a few participants answered those questions only with “yes” or “no”. We discarded data from those who failed in at least 2 out of 13 attention checks. The sample size became 173 after that. Final data included 100 males, 67 females, and six participants who preferred not to share their gender ( \(Min_{age} = 21, Max_{age} = 72, M_{age} = 36.75, SD_{age} = 10.36\) ).

5.3 Results

Figure 2(A) presents the average ratings for the severity of each class of error (also presented in [2]). For all food items, adding cleaner was rated significantly more severe than the forgetting or replacing errors. One-way repeated-measures ANOVA with Greenhouse-Geisser correction for violation of the sphericity assumption did not reveal any statistically significant difference between adding cleaning products to each of the three kinds of food ( \(F(1.8,310.7)=2.032, p=.138\) ). A similar plot may also be generated by considering only errors in the questions that contained the same ingredients as each participant preferred (i.e., the items indicated as \(X_{1}\) and \(Y_{1}\) matched the preferred ones). Those results are shown in Figure 2(B). Again, adding cleaner was a much more severe error than other errors for all of the food items.

Fig. 2.

Regarding two other types of errors (forgetting and replacing), we grouped the results by the indicated ingredients (what was shown as \(X_{1}\) and \(Y_{1}\) in the questions) and participants’ preferred ingredients (what each participant selected at the end of the study as their personal preference for that food). Figure 3 shows the rated severity of the forgetting and replacing errors.

Fig. 3.

Since in Experiment 2 we will be going to ask the participants to teach their ‘own preference’ to a robot, the selected errors should be rated similarly in terms of severity independent of participants’ choices. This means, for example, that to be able to include replacing sugar/honey as an error in Experiment 2, replacing sugar with honey should be rated similar to replacing honey with sugar in terms of severity. Therefore, the situations in which an ‘indicated’ ingredient was also the ‘preferred’ item are of interest. These are shown in the right-most and left-most points in any of the subplots in Figure 3. We also tested whether different food preparation preferences for choosing the ingredients affected the evaluation of the errors (e.g., did the participants who preferred balsamic vinegar rate the severity of forgetting that ingredient differently than those who preferred lemon juice and that item was forgotten?). We aimed to avoid situations with such significant differences in Experiment 2 to increase the chance that the participants would perceive a similar level of severity in the errors regardless of their personal preferences.

With considering only those who preferred a specific ingredient over another one, every set of severity data included answers from a specific subset of participants. Therefore, we used two-sided independent-sample t-tests for the analyses. Assumptions of normality, homogeneity of variances, and randomness of data used in these tests were checked using the Shapiro-Wilk test, F-test, and Runs test, respectively. Note that for any ingredient in our study, there were at least 36 participants who preferred that option over the other alternative.

We found that while preparing a salad, the difference between the severity of the replacing error concerning balsamic vinegar and lemon juice (i.e., replacing balsamic vinegar with lemon juice and vice versa) showed a trend approaching significance⁶ ( \(t(171)=-1.804, p=.073\) ). Also, the severity of forgetting Parmesan cheese for participants who preferred this item was rated significantly different than forgetting feta cheese for those who preferred feta cheese ( \(t(171)=-2.026, p\lt .05\) ). For a cup of tea, participants who preferred sugar rated forgetting it significantly different from those who preferred honey when honey was forgotten ( \(t(171)=-2.945, p\lt .01\) ). These three findings are shown in Figure 3.

5.4 Discussion

Experiment 1 was conducted to inform the design of Experiment 2 in terms of choosing big and small robot errors in our identified food preparation task. Based on these findings, we selected two small errors and two big errors for Experiment 2 (the reason why two were needed will be explained in Section 6). For doing this, we used two criteria. First, the difference between the rated severity of big errors and small errors should have been large enough to increase the likelihood that the participants in Experiment 2 will distinguish them easily by observing the behaviours of a robot. Second, and more important, the rated severity of the errors should not have been significantly affected by the individual preferences for the ingredients.

By looking at Figure 3(B), we can observe that the replacing error in preparing tea was rated lower in severity than the other cases (rated mostly on the left side of the continuous slider), regardless of participants’ preferences. Therefore, for two small errors, we used two variations of replacing one ingredient with another while preparing tea (i.e., replacing sugar and honey, or skim milk and whole milk). Adding cleaner was rated as a highly severe error in the preparation of every food. Accordingly, we considered this a big error for each type of food.

6 Experiment 2

In the main part of this study, we carried out a \(2\times 3\) virtual experiment concerning a simulated learning interaction. Unlike Experiment 1, Experiment 2 asked the participants to virtually demonstrate to a robot standing behind a table which items to add by selecting the items that they prefer to be added to the food or, in other words, to “teach the robot” their preferred ingredients. In Section 6.2, we introduce the virtual interface that was used as a framework to simulate a human–robot teaching interaction. This interactive method enabled participants to remotely teach a task to a robot and to experience the act of teaching without the need to physically come to the lab as it was not possible due to COVID-19 restrictions at the University of Waterloo. A subset of results from this experiment, mainly concerning how the last impression of a student robot can affect trust, was previously presented in [1, 2]. Here, we extend our previous work by discussing two new research questions (RQ3 and RQ4) as well as providing a more comprehensive analysis of participants’ trust towards a trainee robot.

The robot actions were decided to be either completely according to the instructions or faulty depending on the teaching round and experimental condition. Our aim was to investigate the following scenario. A robot is generally improving and has not made errors for a while, but an error happens in the end. How does that single error affect teachers’ trust in the robot? Do participants expect the robot to work properly afterwards and consider the faulty behaviour to be an accident?

Every participant interacted with two Pepper humanoid robots that were different only in their clothing style. Therefore, the appearance of the robot was a within-participants factor with two variations of “tidy” and “untidy”, as shown in Figure 4. Throughout the experiment, we named these two robots as “Robot 1” and “Robot 2” according to the teaching order, which was randomized for each participant.⁷ Throughout the experiment, participants trained each of these two robots in six rounds. Every round consisted of a teaching step, in which participants (as teachers) presented their preference in the food preparation task to the robot and a practising step right afterwards, in which the robot performed the learned task in front of the teacher (adding two ingredients to prepare a specific food). This task was described in Section 4.1.

Fig. 4.

During the experiment, the robots were improving by making fewer or no errors to show progress in learning. For all participants, the robots made a big error (Adding cleaner) during their first practice (round 1). Then, they made a small error (replacing) when they were practising for the second time (round 2). From the third to the fifth rounds, the robots exhibited no error while performing the task. The behaviour that robots exhibited in the final (sixth) round depended on the experimental condition (groups that participants were assigned to). This “final impression of learning” was introduced as a between-participants factor with three groups. In Group 1, participants experienced both robots performing the task correctly in the final rounds. Participants in Group 2 encountered the robots making small errors. In Group 3, participants observed both robots making big errors in the final rounds. All participants were then asked to answer a few questions to reflect on their level of trust in the robot.

This experimental design required a maximum of two errors of the same type to be made by the robots: one big and one small error in rounds 1 and 2, for all the conditions, plus a possible second small or big error depending on the experimental condition in round 6. Therefore, we used two small and two big errors based on results from Experiment 1. In the teaching process, all participants taught the same order of food items to the robot across six rounds (this was repeated for the second robot). The order was 1: salad \(\rightarrow\) 2: tea \(\rightarrow\) 3: soup \(\rightarrow\) 4: salad \(\rightarrow\) 5: soup \(\rightarrow\) 6: tea. Because the two small errors selected in Experiment 1 were related to tea preparation, the second and the last rounds of teaching were concerning tea. We distributed the rest of the food items in such a manner that the participants did not teach the same preparation task more than twice.

6.1 Robotic Implementation

We used prerecorded videos of a Pepper humanoid robot (by SoftBank Robotics Corporation) performing the task in Experiment 2. This robot had the capability of standing behind a table and manipulating lightweight objects with some limitations. Therefore, great care had to be taken in choosing appropriate containers in terms of dimensions and weight so that Pepper could manipulate them.

To cover all of the possibilities in the robot’s actions (i.e., practising the task by adding ingredients after each teaching round), all combinations of adding one item from each side to the food items were prerecorded. Given two different clothing styles that the robot had and three variations of the food items, we captured a total of \({2\times 3\times (2\times 3)=36}\) videos. All other aspects of the videos (e.g., the camera’s field of view and position, lighting, and the robot’s and objects’ positions) were fixed. The robot was employing the same type of gaze behaviour and arm movements in all videos. We used a combined gaze type with fast and smooth arm motions since participants (as teachers) could be expected to be positive towards this kind of behaviour based on results from our previous study [3]. Thus, the robot was looking mostly at the manipulated objects and occasionally at the camera (therefore, the observing teacher/participant) while doing the task as smoothly as possible. Each video was about 30 seconds long.

The Pepper robot was dressed with the same type of apron and hat in both the tidy and untidy conditions. While the tidy robot was wearing an ironed apron that was perfectly aligned, in the untidy condition the apron had lots of visible wrinkles, its neck strap was twisted, and the excess length of it was not properly managed. In addition, the hat of the tidy robot was supported from the inside to have a rigid shape. Figure 4 shows the two appearances of the robot.

The designed task required the robot to pick up some containers, pour some of their contents into another vessel, and then put them back in their original place. The robot’s hands had some limitations in manipulating objects. To overcome those, we used small plastic containers and some cans that were all empty to be light enough for easy manipulation. We put curved papers with the appropriate textures printed on them inside the containers for cheese, dried herbs, honey, and sugar to make them appear full and more realistic. Note that the camera was mounted at an adequate level so that the participants could not see what was inside the bowl or cup. The containers were selected to be deep enough to hide the fact that nothing was actually being added (see Figure 4(C)).

6.2 Virtual Teaching Framework

In order to allow participants to remotely teach the cooking task to the robot, we implemented a virtual teaching interface using HTML, JavaScript, SQL, and Go programming languages. An example of a virtual teaching environment has been previously used, for example, in [63, 68], in the form of a game including a robotic character and a few clickable objects, mainly to study reinforcement learning in HRI. The teaching interface in our study, described later, differs from those found in the studies discussed earlier as it presents a scenario involving a real robot.

For the teaching part, participants were able to select their preferred items, as illustrated in Figures 5(A) to 5(C), on an image of a specific food preparation task. When the participant’s cursor was placed on any selectable item, a dark green ring appeared around it and when the participant clicked on the item, the ring became more striking and was added around the item, showing that the item was selected (see Figure 5). Participants were able to change their choices as many times as they wanted by selecting alternatives before clicking on the confirm button. Upon confirmation, the image was replaced with one of the prerecorded videos specific to that situation, showing the robot performing the pouring actions either according to the instructions or incorrectly, to show the practising step (see Figure 5(D)). This way, the participants experienced smooth transitions from the teaching to the practising steps. To reiterate, to do so, we took a total of 36 videos covering all possible combinations of items selected by participants and potential errors, as well as for two different robot appearances.

Fig. 5.

In the front end, using HTML and JavaScript, these components were automatically resized and located based on the resolution of the participants’ screens (thus, the green rings were always placed exactly on top of the items, in the correct location) and appeared and disappeared as needed. The back end was developed in the Go programming language to control the scenarios and flow of the study, and to select the videos that should be viewed by each participant. The SQL database in Go was used to manage the data.

6.3 Procedure and Measures

Similar to Experiment 1, this experiment was conducted on a lab server using multiple HTML pages. The experimental procedure was as follows (Figure 6).

Fig. 6.

Step 1 - Demographics questionnaire and initial check: By accepting the consent form (checking a box), participants responded to a demographic questionnaire. A form including questions about age, gender, level of education, and cultural background was included. Participants were given the choice to not disclose any information in this step Next, before moving forward with collecting the data, we needed to make sure that participants were able to see the text in the pictures and videos (to select ingredients and notice the errors). Therefore, we embedded an image of three containers placed on the table (in the same setting as with the main part) and asked participants to write down and submit what was written on the labels attached to the containers. In order to not reveal anything about the scenario before the teaching interactions, we used alternative labels (not used in the main step) but with the same font size while the robot was not present.

Step 2 - Pre-experimental questionnaire: Next, in the pre-experimental questionnaire, we included two standard short surveys to examine whether participants’ personalities and their degree of trusting other people influence their ratings of robots. In Part 1, we used the Ten Item Personality Inventory (TIPI) [25] for measuring Big-Five personality traits (the dimensions of Extraversion, Agreeableness, Conscientiousness, Emotional Stability, and Openness to Experience). These items, shown in Table 2, were followed by 7-point discrete Likert scales (1=disagree strongly to 7=agree strongly). We divided this questionnaire into two pages containing five questions each to minimize fatigue. We also asked participants to rate different aspects of their trust in other humans. In Part 2, the Disposition of Trust questionnaire (DT) [41] was used (see Table 2). Again, 12 items were split into two pages to prevent fatigue, and the same type of Likert scale was used for each item. We included an attention check item as a 13th question in the DT questionnaire, asking participants how much they believed that drinking water is liquid.

Table 2.

Part 1: Ten Item Personality Inventory (TIPI)
I see myself as:
	1. Extraverted, enthusiastic.
	2. Critical, quarrelsome.
	3. Dependable, self-disciplined.
	4. Anxious, easily upset.
	5. Open to new experiences, complex.
	6. Reserved, quiet.
	7. Sympathetic, warm.
	8. Disorganized, careless.
	9. Calm, emotionally stable.
	10. Conventional, uncreative.
Part 2: Disposition of Trust questionnaire (DT)
I. Faith in Humanity, Benevolence:
	1. In general, people really do care about the well-being of others.
	2. A typical person is sincerely concerned about the problems of others.
	3. Most of the time, people care enough to try to be helpful, rather than just looking out for themselves.
II. Faith in Humanity, Integrity:
	4. In general, most people keep their promises.
	5. I think people generally try to back up their words with their actions.
	6. Most people are honest in their dealing with others.
III. Faith in Humanity, Competence:
	7. I believe that most professional people do a very good job at their work.
	8. Most professionals are very knowledgeable in their chosen field.
	9. A large majority of professional people are competent in their area of expertise.
IV. Trusting Stance:
	10. I usually trust people until they give me a reason not to trust them.
	11. I generally give people the benefit of the doubt when I first meet them.
	12. My typical approach is to trust new acquaintances until they prove I should not trust them.

Table 2. Items of the Pre-experimental Questionnaire

Step 3 — Robot learning familiarization: Once this information was collected, the familiarization step started. This part, with two videos (used with permission), was included to illustrate examples of how people teach real robots and how robots perform the learned actions afterwards. In the first video, a person is grabbing the arms of an iCub robot and showing it how to pour something into another vessel. Then, the robot is shown doing this task on its own (related to [43]). The second video shows a robot being taught how to cook a simple meal (related to [10, 14]). On the top of the first video, we outlined that “modern intelligent robots are able to acquire new skills from people through observation or experience. For example, someone can grab the arms of a robot to show it a way of performing a task. To see how a real robot can learn, please watch this video showing the outcome of a scientific work”. Also, we described the second video by noting that “robots can also learn how to cook”, followed by “now you will watch a video presenting the learning process of a robot”. The buttons for proceeding to the next steps appeared after each video ended to ensure that they were thoroughly watched. Also, the videos automatically paused in case anyone switched to another window.

Step 4 — Instructions: Step-by-step instructions were provided to the participants on how to use our virtual teaching interface to teach their preferences to the robot. To let the participants try the interface, a set of simplified videos of the Pepper robot, without any clothing, in the same experimental setup as in the main task, was used (see Figure 5). Written instructions along with visual signs indicated to the participants that they must select exactly one item from each of the two alternative options on every side of the dish. Following that, participants watched a video demonstrating what happens when they click on the containers in the picture to select them. We then let the participants try this teaching process once. Finally, before beginning the main task, we described the situation to the participants using text: “There are two robots in this study. In the following, you will teach each of them how you prefer to have a cup of tea, a bowl of salad, and soup. In the teaching process, you will interact with each robot six times. After every training round, you will watch the robot practising what it has learned. Then, you will be asked to answer a few questions regarding its behaviour and your thoughts.”

Step 5 — Teaching process and behaviour evaluation: Participants then started the main task according to the condition they were randomly assigned to. As described earlier, the main part of this experiment consisted of teaching two robots in a random order, involving six teaching+training rounds for each. The participants used the same interface as they experienced in Step 4 to teach their preferences. Half of the participants taught the tidy robot first and the other half taught the untidy one first. In every round, the participants first showed the robot how they preferred to have the food; then, the robot started practising. When each video ended, a list of nine continuous scales appeared (Figure 7). The first item asked directly about the possible error in the exhibited behaviour to help us determine whether participants understood the actions. This was placed first, so that the participants were reminded that the task could have been performed erratically, before moving on to the other measures. They could either choose anywhere in the range from the robot “made a very small mistake” to “made a very big mistake”, or select a checkbox indicating that the robot “made no mistake”. A severity score, ranging from 0 to 1,000, or –1 if it seemed that there had been no error, was recorded for every round.

Fig. 7.

The remainder of the measures were derived from the attributes we used in our previous study [3, 4]. The only difference was that since the robots’ gaze behaviour was fixed, we were no longer interested in both “attention to the task” and “attention to the teacher”. Thus, only “attention to the task” was included. Therefore, the measures were robot’s perceived confidence, calmness, proficiency in the task, attention to the task, eagerness to learn, being goal-driven, and liking the task (rated on continuous scales; see Figure 7). For the attention check items, we used repeated questions with different wordings in a way that it was in an opposite direction of the original scale. The attention checks did not have any effect on the remuneration for the participants. However, we discarded the data from those who failed the attention checks for use in our analyses (the details are described in Section 6.4).

Step 6 — Trust and learning evaluation: After finishing all 6 rounds of teaching with each robot, participants were asked to answer a few questions regarding the robot that they just taught. Thus, every participant completed this questionnaire twice. There were seven items in this trust and learning evaluation form, with the picture of the robot shown on top of the page:

•

“Assume you will have some guests tonight. You are very busy, and no one in your household has time to help you. Would you allow this robot to prepare dinner for tonight?” The choices were: “Yes, it can do it alone.”, “Yes, but together with me.”,“No, I will order from a restaurant (e.g., with Uber Eats).”, or “No, I will manage to do it myself.”

•

“Assume you have a very busy week ahead. Would you allow this robot to do your laundry?” This question was to examine the transfer of trust to another task. The same options were offered again, with the third replaced with “No, I will request help from others or use a service.”

•

Whether the robot “improved a lot” or “did not improve at all” was asked on a continuous scale.

•

Whether the teaching scenario “looked very realistic” or “not realistic at all” was also questioned on a continuous slider.

•

“What do you think the robot’s gender is?” This was to control for confounding factors. The options were: “definitely male/female”, “maybe male/female”, “Could be either male or female”, and “Neither male nor female” (6 in total)

•

“How successful do you think this robot would be in teaching cooking tasks to another robot?” (rated on a continuous scale)

•

“If you had this robot in your home, how likely would you be to use it to assist you with chores?” (rated on a continuous scale)

In this questionnaire, we included intention to use in the future to measure trust in the sense of uncertainty and vulnerability since some groups of participants observed that depending only on the robot could possibly lead to food contaminated with dishwashing liquid. The approach of asking participants to trust and, therefore, make use of the robot in a situation in which they are vulnerable was inspired by previous work, for example, the emergency evacuation scenario in [49, 50, 51, 53]. In the present study, participants could decide either to get help from the robot in food preparation or a laundry task or to put it aside and do the task on their own.

Step 7 — Post-experimental questionnaire: In the final step, participants were asked to complete a post-experimental questionnaire. This included direct comparisons between the two robots as well as a few more questions about the participants’ preferences in performing their regular activities, to further control for confounding factors. In multiple steps, we showed pictures of the two robots labelled as “the first robot you saw” and “the second robot you saw”, side-by-side, and asked the participants which robot appeared more professional, more skilled, more experienced with the tasks, and had more authority. An attention check item asking “Which robot had a tidier appearance?” was included afterwards. In the last step, we directly asked “Which robot would you trust more?”. All of these questions had a third option to choose if the robots seemed “equal” in any of those attributes. The order of these questions remained the same for all participants. Half of the participants saw the picture of the first robot placed on the left side. On a separate page, the participants indicated how much they like cooking and doing laundry and what proportion of their “weekly meals”/“monthly laundry” they cook/do themselves, on continuous scales.

6.4 Participants

The MTurk platform was used to recruit participants. A total of 251 complete responses were collected from 260 people who participated in this experiment. Everyone was given a 1 USD base payment, plus a 1 USD bonus pro-rated based on the portion of questions that were answered. This experiment took around 25 to 30 minutes to complete. The same availability criteria as Experiment 1 were applied (people in Canada and the United States holding a 97% lifetime approval rate with having completed at least 100 tasks; see Section 5.2 for more details). The countries to recruit from were consistent with Experiment 1 because including more countries could have added more noise as people may have different preferences for food ingredients. The study received full ethics clearance from the University of Waterloo’s Human Research Ethics Committee.

We carefully reviewed all of the complete responses to check the participants’ attention. Twelve consistency checks in the measures after each practising round, as well as the question in the DT and post-experimental questionnaires discussed earlier, were used as attention checks. Also, to ensure that the study’s assumptions were not violated (i.e., participants rated small errors as small and big errors as big), we considered four additional consistency checks: (a) rating any big error lower than 750/1,000 in severity, (b) rating any faultless behaviour as an error more severe than 250/1,000, (c) rating any big or small error as no error, and (d) having any small error rated more severe than a big error. We then excluded those who failed at least three times in the attention and consistency checks overall. To improve consistency within the experiential conditions (i.e., to ensure that all the participants made almost the same assumptions regarding the robots’ errors), we filtered data to have only those individuals who rated the small errors at least 25% less severe than the big errors on average. In light of these considerations, data from 113 participants were omitted. All of the participants who passed the filter were able to read the labels correctly. This left 138 participants, 45 in condition 1, 49 in condition 2, and 44 in condition 3 (condition 1: \(Min_{age} = 20, Max_{age} = 64, M_{age} = 37.36, SD_{age} = 10.00\) ; condition 2: \(Min_{age} = 22, Max_{age} = 65, M_{age} = 39.41, SD_{age} = 11.57\) ; and condition 3: \(Min_{age} = 20, Max_{age} = 70, M_{age} = 38.61, SD_{age} = 11.52\) ).

6.5 Statistical Analysis

To study effects on the collected categorical data (e.g., some indicators of trust, including the ones about using the robot in a future cooking/laundry task in Step 6) while considering possible confounding factors, we used Generalized Linear Models (GLMs) [44] with a binomial family. The factors listed in the following were initially included in the models. We then kept only a subset of those to minimize Akaike’s Information Criterion (AIC) [12]. After checking the interaction effects, we came up with the final models presented in the next section. To analyze the qualitative data collected using continuous scales (e.g., the behaviour evaluation measures asked in Step 5), Linear Mixed-effects Models (LMMs) [6] were employed to check for the significant effects, taking possible confounding factors into account. The considered factors were:

•

The robot (i.e., robot appearance).

•

The condition that each participant was assigned to, except for investigating the changes in the participants’ perceptions over time (i.e., in six rounds), as everything was identical in the first five rounds for all of the conditions. For that case, the rated severity of the error in every round was included instead.

•

Whether it was the first robot or the second robot that the participant saw. This is noted as the “encounter” factor with two levels of “first” and “second” throughout this paper.

•

Participants’ demographics, that is, their age and gender as well as their characteristics (based on TIPI and DT questionnaires).

•

The gender that the participants assigned to each robot.

•

How much participants liked cooking/laundry, and the proportion of those tasks that they reported doing themselves (measured through the post-experimental questionnaire in Step 7).

In all of the LMMs, a random effect was fit based on participant, along with the aforementioned fixed (or potentially confounding) effects. In addition to GLMs and LMMs, chi-squared tests and t-tests were occasionally conducted for testing significant differences between the two conditions. The p values obtained from the pairwise tests that were related to one of our hypotheses were adjusted for multiple hypothesis testings using the Holm-Bonferroni method.

6.6 Results

In this section, we first describe how trust was impacted by the last impression of trainee robots (RQ1 and RQ2) and then explain how participants’ perceptions of them changed over time (RQ3). Finally, we present the results related to robot appearance (RQ4).

As explained before, to ensure that the assumptions of this study were not violated, we removed data from participants who rated the severity of an error differently from what was expected (e.g., rating a small error as a big error). Figure 8 shows the final ratings of the robots’ errors. Regarding the small errors, which were about preparing tea, two-sided independent-sample t-tests did not show any statistical differences between replacing items on the right side and the left side of the tea ( \(t(372)=1.013,p=.312\) ). Furthermore, the difference between replacing either of the two types of milk ( \(t(323)=0.250,p=.803\) ) or either of the two types of sweetener ( \(t(323)=-0.025,p=.980\) ) was not detected to be significant. In cases in which the robot correctly added the selected items (in rounds 3, 4, and 5 for all of the conditions, plus round 6 for condition 1), participants rated the behaviours as no error. Thus, the amount of added ingredient and other details were not important to the participants.

Fig. 8.

6.6.1 Last Impression of a Student Robot Affecting Trust —RQ1, RQ2.

In this subsection, findings from the trust and learning evaluation questionnaires (responded to by every participant after teaching each of the two robots in Step 6) are presented.

Preferences in cooking and laundry tasks: To investigate participants’ trust and test H1.1 and H1.2, we asked them to specify whether they would allow the robot to cook dinner for them alone or collaboratively, or would prefer to do this task on their own or buy food from a restaurant. The results, grouped by three different factors (appearance, encounter, and condition), are displayed in Figure 9. This figure also includes the same question with regard to doing laundry for studying the transfer of trust to other tasks. The first two choices that allowed the robot to take part in any form are grouped together as a sign of a positive attitude toward using the robot and trusting it. The two other choices that totally excluded the robot are considered together as indicators of negative attitudes toward using the robot and not trusting it.

Fig. 9.

As shown in Figure 9 and previously presented in [2], the percentage of participants who did not trust the robot was increasing when the final errors were becoming more severe from condition 1 to condition 3. However, the number of participants who trusted the robots based on their appearance or according to the order in which they taught them seems to be similar with different levels of those factors. Using GLMs, we further examined the effects of appearance, encounter, and condition on these two measures of trust while considering the confounding factors. The models presented in Table 3 confirmed the significance of the differences observed in Figure 9: regarding both the cooking and laundry scenarios, only a significant effect of the last impression of learning (i.e., condition) was detected among those factors in addition to some effects of items in the TIPI and DT questionnaires. In other words, experimental conditions significantly affected participants’ trust regarding both the cooking and the laundry task.

Table 3.

Covariate		Trust for cooking				Trust for doing laundry
Covariate		Estimate	SE	z		Estimate	SE	z
Condition
	1 (BSCCCC) (b)
	2 (BSCCCS)	–1.08	0.35	–3.052	^**	–0.97	0.38	–2.572	^*
	3 (BSCCCB)	–2.38	0.38	–6.226	^***	–1.80	0.38	–4.767	^***
TIPI
	Extraversion
	Agreeableness
	Conscientiousness	–0.30	0.14	–2.194	^*
	Emotional Stability					–0.20	0.10	–2.081	^*
	Openness to Experiences	0.24	0.11	2.194	^*	0.23	0.10	2.341	^*
DT
	Benevolence	0.32	0.13	2.350	^*
	Integrity
	Competence	0.26	0.14	1.809	.	0.41	0.12	3.357	^***
	Trusting Stance	–0.19	0.12	–1.542

Table 3. Generalized Linear Models Predicting Participants’ Preferences in Cooking and Laundry Tasks

. = \(p \lt .01\) ; * = \(p \lt .05\) ; ** = \(p \lt .05\) ; *** = \(p \lt .001\) .

B = Big error; S = Small error; C = Correct behaviour.

(b) = Baseline level.

Pairwise comparisons adjusted using the Holm-Bonferroni method showed that small errors of the robots in the sixth practising round (condition 2) significantly decreased trust compared with when the behaviours were correct ( \(se = 0.35, z = -3.052, p \lt .01\) ). Big errors at the end (condition 3) negatively affected trust even more compared with small errors ( \(se = 0.34, z = -3.833, p \lt .001\) ). In terms of confounding factors and related to H3 and H4, we found that participants who had a higher disposition for trusting people’s benevolence were significantly more willing to let the robots participate in cooking dinner for them ( \(se = 0.13, z = 2.350, p \lt .05\) ). The same effect was observed with those who were more open to new experiences ( \(se = 0.11, z = 2.194, p \lt .05\) ). We noticed a trend, approaching significance, suggesting that those with higher dispositions for trusting people’s competencies may rely more on the robots for cooking ( \(se = 0.14, z = 1.809, p = .070\) ). In contrast, participants with higher conscientiousness had a significantly lower tendency to allow the robots to cook for them ( \(se = 0.14, z = -2.194, p \lt .05\) ).

Regarding the transfer of trust to a laundry scenario, the adjusted pairwise tests revealed that participants in condition 2 had significantly lower faith in the robots helping with that task compared with those in condition 1 ( \(se = 0.38, z = -2.572, p \lt .05\) ). Participants’ trust was even lower in robots doing their laundry when a big error happened at the end of learning the cooking task (condition 3), compared with when a small error was made ( \(se = 0.32, z = -2.560, p \lt .05\) ). Participants with a higher level of deposition to trust others’ competencies were more likely to trust robots to do their laundry ( \(se = 0.12, z = 3.357, p \lt .001\) ). It was also detected that those with greater openness to experience trusted the robots more in this laundry task ( \(se = 0.10, z = 2.341, p \lt .05\) ). However, participants who were more emotionally stable trusted the robots less in the laundry task ( \(se = 0.10, z = -2.081, p \lt .05\) ).

Participants’ attitudes: The participants also responded to four questions on continuous scales regarding the teaching experience and their opinions towards the robots after completing the six rounds with each robot (Step 6). These items were the perceived realism of the teaching scenarios, perceived improvement of robots over time, expected success of robots in teaching cooking tasks to another robot, and likelihood of using robots to assist with chores in the future. The average ratings for each robot (i.e., appearance factor) and in various conditions are plotted in Figure 10. LMMs presented in Table 4 were employed to further investigate these measures. The cumulative results (both robots together) pertaining to the last two measures are also described in [2].

Fig. 10.

Table 4.

Covariate		Realism of teaching scenarios				Improvement of robots over time
Covariate		Estimate	SE	t		Estimate	SE	t
Condition
	1 (BSCCCC) (b)
	2 (BSCCCS)					–263.94	34.93	–7.556	^***
	3 (BSCCCB)					–377.79	35.71	–10.580	^***
DT
	Benevolence	–66.57	24.88	–2.676	^**	24.09	10.50	2.295	^*
	Integrity	71.83	25.19	2.811	^**
	Competence	41.11	15.85	2.593	^*
	Trusting Stance
Covariate		Success of robots in teaching cooking tasks to another robot				Likelihood of using robots in the future for assistance
Covariate		Estimate	SE	t		Estimate	SE	t
Condition
	1 (BSCCCC) (b)
	2 (BSCCCS)	–242.83	41.65	–5.831	^***	–217.80	50.29	–4.331	^***
	3 (BSCCCB)	–301.09	42.77	–7.040	^***	–334.59	51.68	–6.474	^***
DT
	Benevolence
	Integrity
	Competence					54.47	17.64	3.087	^**
	Trusting Stance

Table 4. Linear Mixed-effects Models Predicting Participants’ Teaching Experience and Their Opinions about the Robots

***

* = \(p \lt .05\) ; ** = \(p \lt .01\) ; *** = \(p \lt .001\) .

B = Big error; S = Small error; C = Correct behaviour.

(b) = Baseline level.

Regarding how realistic the teaching scenarios looked, although the ratings in Condition 3 seem slightly lower than other conditions according to Figure 10, we did not observe any significant effect of condition on how realistic the task was rated. Only some items of the DT questionnaire (see Table 2) affected the perceived realism of the task. We noticed that participants with a higher disposition of trusting people’s integrity ( \(se = 25.19, t = 2.811, p \lt .01\) ) and competence ( \(se = 15.85, t = 2.593, p \lt .05\) ) rated the scenarios as more realistic, but participants with a higher disposition of trusting people’s benevolence ( \(se = 24.88, t = -2.676, p \lt .01\) ) rated the scenario as less realistic.

While the appearance of the robots could not affect any ratings on those scales, the errors (i.e., condition) affected the rest of the measures described here. Adjusted pairwise comparisons showed that participants felt significantly less improvement in the performance of the robots after the big errors happened at the end compared with after a small error happened ( \(se = 34.89, t = -3.263, p \lt .01\) ) or when there was no error ( \(se = 35.71, t = -10.580, p \lt .001\) ). The small errors could also decrease this rating compared with no error ( \(se = 34.93, t = -7.556, p \lt .001\) ). In the measure, opposite to the realism of the teaching scenario, those with a higher rating of people’s benevolence rated that robots improved more ( \(se = 10.50, t = 2.295, p = \lt .05\) ).

The only factor that we found to have affected the expected success of robots in teaching the cooking tasks to another robot was experimental condition. Small ( \(se = 41.65, t = -5.831, p \lt .001\) ) and big ( \(se =42.77, t = -7.040, p \lt .001\) ) errors in the last round significantly decreased the ratings for this measure, with no difference detected between the big and small errors ( \(se = 41.89, t = -1.391, p =.167\) ). Finally, the likelihood that participants use the robots in future to assist them was found to be affected by the last errors and their disposition of trusting people’s competencies ( \(se = 17.64, t = 3.087, p\lt .01\) ). When big errors happened in the sixth round, participants appeared to be less inclined to use the robots compared with when small errors happened ( \(se = 50.75, t = -2.301, p =\lt .05\) ) and the behaviour was fully correct ( \(se = 51.68, t = -6.474, p \lt .001\) ). A small error also had the same effect compared with having no final error ( \(se = 50.29, t = -4.331,\) \(p \lt .001\) ).

6.6.2 Human Teachers’ Perception Change Over Time —RQ3.

Results presented in this part are based on participants’ evaluations of the robots in the teaching process and behaviour evaluation step (rated after watching every round of a robot practising in Step 5). When the severity of errors decreased from rounds 1 to 3, measures on all of the scales (see Figure 11) improved and remained high until the sixth round, and then dropped if/when the errors happened again. These measurements did not appear to be affected by the clothing style of the robots (the statistical analysis will be presented later). Paired-sample t-tests did not reveal any significant differences in the rating of severity of small ( \(t(186)=-0.905,p=.366\) ) and big ( \(t(181)=0.903,p=.368\) ) errors made by the tidy or untidy robots.

Fig. 11.

Another factor that was found to be important was the order in which each participant taught the robots (i.e., the encounter factor). Figure A.1 demonstrates the participants’ ratings of the first and second robots that they observed. As will be shown later in the statistical models, the encounter factor could significantly affect all of the measures except for the robots’ perceived calmness. As a post-hoc test and using one-sided paired-samples t-tests, we detected significant differences between the first and second robots in certain rounds and for some of the tested attributes. In all cases, except for the one that is related to perceived calmness in round 6, the average ratings of attributes were higher for the first robot (regardless of whether it was tidy or untidy). While the appearance of the robots did not seem to affect the rated severity of errors, the small errors were rated as less severe with the robot that was taught first ( \(t(186)=-2.103,p\lt .05\) ). There was a similar difference for the big errors in the same direction, but that was only close to being statistically significant ( \(t(181)=-1.639,p=.052\) ).

Participants experienced different behaviours of the robots in the sixth round based on the conditions that they were assigned to. Figure 11 presents the averages of ratings of the robots according to the condition (note that each participant rated two robots). We observed that the previously described general trends, within the sixth-round scores reduced consistently with the severity of the errors (based on condition). Since rounds 1 to 5 were identical for all conditions, there was no experimental difference among the three conditions plotted on the left side of the dashed lines in Figure 11.

According to one-sided paired-sample t-tests that were adjusted using the Holm-Bonferroni method for multiple testings, all measured attributes significantly increased from round 1 to round 2, and from round 2 to round 3. This was correlated with the improvement in the behaviour of the robots as specified in H3. By also comparing round 3 with round 4, as well as round 4 with round 5, some instances were observed in which the ratings of the robots improved significantly even though the behaviours of the robots were the same. The ratings of confidence ( \(t(137)=3.697, p\lt .001\) ), calmness ( \(t(137)=2.067, p\lt .05\) ), liking the task ( \(t(137)=3.891, p\lt .001\) ), and eagerness to learn ( \(t(137)=1.735, p\lt .05\) ) were higher in round 5 compared with round 4. In addition, participants rated eagerness to learn as higher in round 4 of teaching robots compared with round 3 ( \(t(137)=3.496, p\lt .05\) ).

Further tests revealed that within each condition, there were some statistically significant differences in the ratings when the same errors happened in the beginning (either round 1 or 2) compared with at the end (round 6). For participants in condition 2 and regarding the small errors of the robots, ratings of confidence ( \(t(48)=2.140, p\lt .05\) ), calmness ( \(t(48)=2.055, p\lt .05\) ), and being goal driven ( \(t(48)=2.304, p\lt .05\) ) were significantly higher in round 6 compared with round 2. Concerning the big errors for participants in condition 3, the same effect in the opposite direction was detected; the robots were rated as less liking the task ( \(t(43)=-1.719, p\lt .05\) ), less eager to learn ( \(t(43)=-2.243, p\lt .05\) ), and less goal driven ( \(t(43)=-2.283, p\lt .05\) ) after the big error in round 6, compared with round 1.

Predictive models: To further study impact on participants’ ratings of the trainee robots, the LMMs predicting the attributes (i.e., confidence, calmness, liking the task, attention to the task, proficiency, eagerness to learn, and being goal driven) are summarized in Table 5. As can be seen, severity and encounter were two factors that affected every measure except for encounter, which did not affect robots’ calmness. However, the appearance of the robot and participants’ age and gender did not affect any of the attributes.

Table 5.

Covariate		Confidence				Calmness				Liking the task				Attention to the task				Proficiency				Eagerness to Learn				Being Goal Driven
Covariate		Estimate	SE	t		Estimate	SE	t		Estimate	SE	t		Estimate	SE	t		Estimate	SE	t		Estimate	SE	t		Estimate	SE	t
Encounter
	First (b)
	Second	–24.28	9.07	–2.678	^**					–36.38	7.70	–4.725	^***	–22.93	8.94	–2.566	^*	–22.63	8.35	–2.711	^**	–30.60	7.92	–3.865	^***	–37.95	8.22	–4.620	^***
Severity		–0.32	0.01	–26.976	^***	–0.25	0.01	–22.229	^***	–0.27	0.01	–27.218	^***	–0.52	0.01	–44.920	^***	–0.59	0.01	–54.824	^***	–0.34	0.01	–33.477	^***	–0.42	0.01	–39.622	^***
TIPI
	Extraversion	–14.65	6.07	–2.412	^**	–22.07	6.85	–3.222	^**					–11.88	5.28	–2.249	^*									–14.98	5.67	–2.641	^**
	Agreeableness
	Conscientiousness													25.87	7.96	3.251	^**									17.55	8.55	2.053	^**
	Emotional Stability					14.86	7.69	1.9432	.
	Openness to Experiences									16.81	6.95	2.420	^*					12.22	5.57	2.194	^*	14.32	6.66	2.150	^*
DT
	Benevolence
	Integrity
	Competence									26.73	8.56	3.121	^*									24.57	8.21	2.991	^**
	Trusting Stance	22.71	7.29	3.114	^**	29.13	7.81	3.729	^***					15.99	6.12	2.614	^**	14.58	5.39	2.704	^**					13.39	6.57	2.038	^*

Table 5. Linear Mixed-effects Models Predicting the Robots’ Attributes: Confidence, Calmness, Liking the Task, Attention to the Task, Proficiency, Eagerness to Learn, and Being Goal Driven While Learning

. = \(p \lt .01\) ;* = \(p \lt .05\) ; ** = \(p \lt .05\) ; *** = \(p \lt .001\) .

(b) = Baseline level.

All measures (except calmness) were generally rated higher for the first robot each participant taught. All of the ratings were found to be negatively correlated with the severity of errors in each round. About the TIPI scales, we found that more extraverted participants rated the robots as less confident ( \(se=6.07, t=-2.412, p\lt .05\) ), calm ( \(se=6.85, t=-3.222, p\lt .01\) ), attentive ( \(se=5.28, t=-2.249, p\lt .05\) ), and goal driven ( \(se=5.67, t=-2.641, p\lt .01\) ). Participants who had higher conscientiousness rated the robot to be more attentive ( \(se=7.96, t=3.251, p\lt .01\) ) and goal driven ( \(se=8.55, t=2.053, p\lt .05\) ). The robots were rated to be calmer by those with higher emotional stability ( \(se=7.69, t=1.943, p=.055\) ). Finally, participants who were more open to experiences rated the robots as more liking the task ( \(se=6.95, t=2.420, p\lt .05\) ), proficient ( \(se=5.57, t=2.194, p\lt .05\) ), and eager to learn ( \(se=6.66, t=2.150, p\lt .05\) ).

About the DT questionnaire items (see Table 2), participants with a higher disposition of trusting people’s competencies rated the robot as more liking the task ( \(se=8.56, t=3.121, p\lt .05\) ) and more eager to learn ( \(se=8.21, t=2.991, p\lt .01\) ). The robots were rated as more confident ( \(se=7.29, t=3.114, p\lt .01\) ), calm ( \(se=7.81, t=3.729, p\lt .001\) ), attentive ( \(se = 6.12, t = 2.612, p \lt .01\) ), proficient ( \(se=5.39, t=2.704, p\lt .01\) ) and goal driven ( \(se=6.57, t=2.038, p\lt .05\) ) by those who scored higher in the trusting stance sub-scale of the DT questionnaire.

6.6.3 Impact of Clothing Style of the Robot —RQ4.

In the post-experimental questionnaire (Step 7), we asked each participant to compare the two robots in terms of multiple attributes or choose whether they seemed equal. The results are shown in Figure 12 to test H4. Regarding all five measures, the majority of participants believed that robots with different clothing styles were equal. However, according to adjusted one-sided chi-squared tests, participants selected the tidy robot to be significantly more professional than the untidy one ( \(\chi ^2(1,N=67)=40.866,p\lt .001\) ). There was no significant difference between the tidy and untidy robots for ratings of being more skilled ( \(\chi ^2(1,N=64)=0.500,p=.760\) ), being more experienced ( \(\chi ^2(1,N=59)=0.305,p=.290\) ), having greater authority ( \(\chi ^2(1,N=51)=0.039,p=.421\) ), or being able to be trusted more ( \(\chi ^2(1,N=58)=1.241,p=.132\) ).

Fig. 12.

6.7 Discussion

We examined how human teachers interpreted their trainee robot as the training progressed and they observed improvements in the task performance of the robot. We also investigated trust in trainee robots under different conditions — when no errors occurred in the last rounds versus when a small or a big error occurred. The first experiment informed the selection of errors in the second experiment (i.e., the main teaching task). We also investigated the effect of a robot’s clothing style on participants’ perception of the robot and their trust in the robot.

In this online study, we implemented an interface to enable participants to teach a task to a robot. Being aware of the participants’ ratings of realism of this scenario was critical, as it could give insight into, at least to some extent, the transferability of the results into real-world settings. After teaching each robot, we asked the participants whether “the teaching scenario” looked realistic (rated on a continuous scale). As observed in Figure 10(A), the average of participants’ evaluation of the teaching scenario was closer to being realistic regardless of the ending faulty behaviours of the robot (i.e., no differences between conditions were detected). This suggests that the designed virtual framework was able to successfully give the participants the impression of teaching a robot in all of the tested situations. Therefore, the introduced framework might also be used for future virtual studies that require humans to remotely teach robots a task.

Although a physical teaching situation may be different in nature (as will be further noted in the Limitations section), based on participants’ ratings of the realism of the teaching scenario, we can cautiously expect that their overall experience would remain similar if a real interaction was studied in person. For example, with appropriate pretraining of robot motor actions for object manipulations, instead of using prerecorded videos, we could leave it to the participants to teach the part in which the robot needs to select the items to add rather than asking them to demonstrate motion trajectories.

6.7.1 Last Impression of a Student Robot Affecting Trust —RQ1, RQ2.

To answer RQ1,⁸ we hypothesized that teachers may ignore small errors when they detect that a robot is improving (H1.1). However, all of the measures in the trust and learning evaluation form were significantly affected by even a small error occurring in the sixth training round. When the trainee robots made any errors after they seemed to have learned the tasks properly by making no errors in three consecutive rounds, participants indicated a lower rating of improvement for robots and a lower rating of trust in them for getting their help with a cooking task or performing another task (their laundry). The robots that made any errors after learning (in round 6) were also expected to be less successful in teaching cooking skills to another robot and were rated to be less suitable for assistance in the future. All of these findings suggested that even a small error can significantly affect how participants perceive a domestic trainee robot and how they trust it.

Furthermore, a big error led to decreasing participants’ trust in getting robots’ help with the same or a new task, confirming H1.2. With such errors, a larger drop in the participants’ overall rating of improvement and likelihood of using the robots in the future was detected. All of those factors were more severely affected by big errors as compared with small errors. Similar to our findings, trust had been found to be correlated with the magnitude of robot errors in [53]. Also, there was no difference in the expected success of the robots teaching another robot when they made a small final error or a big one. This may imply that participants assume that a teacher (human or robot) has to be highly qualified to teach a robot and, thus, even a small error can affect them.

The impact of different characteristics of participants on their ratings about the robots as well as on their trust was also investigated in this work (to study RQ2⁹). We hypothesized that the personality traits of the participants (H2.1) and their disposition of trust in other people (H2.2) may affect some aspects of their trust in the robot. In our experiment, trust was found to be positively correlated with participants’ belief in the benevolence of other people (same as [55]) and their openness to experience. Participants with higher conscientiousness were found to rely more on themselves/a restaurant instead of the robot to cook for them. The likelihood of using the robot for a task other than cooking, when the participants were not familiar with the robot’s capabilities in performing that task, was positively correlated with their level of trust in people’s competencies and, again, their openness to experience. All of these findings confirmed that trust in a robot may vary based on individual differences. We also found some correlations between these factors and changes in the ratings of participants regarding trainee robots while learning. Most notably, we found that more extraverted participants rated the robot to be less confident, calm, attentive, and goal driven. These findings emphasize the importance of taking users’ personalities into account when setting up social interactions with robots, even if robots are trainees.

6.7.2 Human Teachers’ Perception Change Over Time —RQ3.

With respect to RQ3,¹⁰ we observed that as the robots made smaller errors over the first three training rounds, participants’ ratings of their behaviours improved. This was expected in H3 concerning the robots’ confidence, proficiency, and being goal driven. However, the improvements were significant in ratings of all of the attributes measured (calmness, liking the task, attentiveness, and eagerness to learn, as well as those specified in H3). One noteworthy finding was that robots’ confidence, calmness, liking the task, and eagerness to learn were rated as higher in the fifth round compared with the fourth round. This means that when participants observed that the robots were capable of performing the task correctly for two consecutive rounds (the third and the fourth rounds), they might have felt that the robots have become slightly more confident, calm, and eager to learn, and that they liked the task slightly more afterwards (in the fifth round). Another finding was that the ratings of eagerness to learn also continued to increase in the fourth round, which was even one round after the behaviours of the robots stopped improving and remained the same compared with the previous round. This may imply that while a robot is progressing in a task, participants strongly assume that the robot is becoming more eager to learn, the effect of which remains in place even shortly after the robot performs the task correctly as a sign that it has learned the task.

There were also some exploratory findings regarding the changes in the ratings of the robots. Related to the encounter factor, some measures in particular rounds were found to be rated significantly higher for the first robot that each participant taught compared with the second robot. Teachers may have had lower expectations when teaching a robot for the first time regardless of its appearance and, as a result, rated its behaviour as “better” (e.g., more confidant, attentive, and eager to learn). In line with this finding, the small errors made by the first robot were rated as less severe than those made by the second robot.

While the behaviours of the robots were the same in the second round and in the sixth round for participants in condition 2 (i.e., robots made small errors in both rounds), confidence, calmness, and being goal driven were rated slightly higher in the sixth round compared with the second round. One explanation could be that seeing the faulty behaviour after three correct actions could have moderated rating of the error: after observing the robots performing correctly, participants may have become more permissive with the final small faults of the robots and, therefore, more positively rated the observed behaviour. In contrast, when the same situation concerning the big errors was explored (i.e., participants in condition 3 observed big errors in both the first and the sixth rounds), we found that ratings of liking the task, eagerness to learn, and being goal driven were lower in the sixth round compared with the first round. This might be because participants were not expecting the robots to make a severe error again (after they observed a behaviour improvement from the first round and three correct performances afterwards), thus intensifying the negative impact of the big errors in round 6.

6.7.3 Impact of Clothing Style of the Robot —RQ4.

With two different clothing styles for the robots (“tidy” or “untidy”), we investigated possible effects of priming on the perceived authority, as well as trust (RQ4¹¹). The robot with a tidy appearance was rated to be significantly more professional compared with the untidy one. In human communities, experts usually dress according to their occupation. Therefore, we expected to find a correlation between the robot’s neat look and its perceived professionalism, and expected participants to differentiate between the two robots at least to some extent. Indeed, participants did perceive differences between the two robots with different appearances. However, apart from perceived professionalism, there were no other differences between the two appearance types on how they affected different factors in the experiment. For instance, the majority of participants indicated that both robots were equally experienced and trustworthy. In our study, the tidy clothing style itself could not convey anything beyond perceived professionalism when the behaviours of the two robots were the same. This suggests that the behaviours of a domestic student robot are much more important than its clothing style for shaping human teachers’ interpretations and establishing trust. Since neither trust nor any items in the rated attributes seemed to be affected by the robot’s appearance based on our results, this study could not confirm H4. In fact, for the majority of participants, both robots were rated as equally skilled and experienced. They also were rated as having equal authority and were equally trustworthy based on the direct comparisons.

We conclude the Discussion section by providing an overview of the findings. In summary, H1.1 was rejected, meaning that even a small error of a domestic trainee robot could have a significant impact on the way that participants perceived the robot and the level of trust they put in it. H1.2, however, was confirmed: a larger error had a greater impact on those factors. We identified a number of personality traits, such as participants’ belief in the benevolence of other people and their openness to experience affecting trust, thereby confirming H2.1 and H2.2. We observed that as the robots progressed in learning a domestic physical task and improved their performance, participants’ perceptions of their behaviours improved. Therefore, H3 was supported. This study did not find any significant differences between the two types of robot clothing styles in terms of perceived authority and trustworthiness. Thus, it was unable to confirm H4.

6.8 Limitations and Future Work

Conducting the two experiments virtually enabled us to recruit a larger number of participants from a more diverse pool of participants using a crowdsourcing framework. Such online experiments are also gaining popularity in HRI as a safe method to collect data during the COVID-19 pandemic [20]. In some cases, results obtained with this method have been shown to be comparable with in-person studies [5, 33]. We designed our virtual interfaces carefully to minimize the effects of potential confounding factors. In addition, as discussed previously, strict inclusion criteria were used when recruiting participants and attention checks were used to identify those who might not have paid enough attention to the task, which helped in improving the quality of the collected responses.

Strict inclusion criteria were used to ensure that participants in a given condition perceived the robot errors as intended and there was a common understanding of the severities of the robot’s errors since the current study is about the impact of severity of errors as opposed to type of errors. Of course, people in a larger population, for example, our participants before excluding a number of them based on consistency checks, may have personal differences in terms of perceiving the errors (e.g., as a result of different preferences and habits) in a specific task and, therefore, a particular error of a robot may impact individuals’ intentions to use the robot differently. A future study can investigate how particular types of robot errors may impact users, how the results generalize to other domestic tasks, and what factors are important for personalization of interactions.

Still, while the participants evaluated the teaching scenarios to be realistic, our results may be different from when they teach a real robot in an in-person HRI scenario. As participants get involved in interacting with a physical robot, they might pay attention to different aspects of it to assess its capabilities (as reported in [74]), which may affect rated attributes and trust. In that case, teachers may have an opportunity to explore the robot’s capabilities to better understand its limitations, using methods such as kinesthetic teaching [14]. As a result, they might form a more realistic mental model about the abilities of the robot, which can affect their trust. In a physical situation, teaching could also involve more subtle actions and gestures such as pointing and looking at items, touching ingredients, and/or speech, whereas in our online platform this was replaced by mouse actions to do the teaching.

The way we implied progress in learning could be different when using a real mechanism that learns sequences of actions and trajectories. For instance, in a manipulation task, learning may reflect on trajectories becoming more smooth (e.g., [47]) or the robot grasps more accurately (e.g., [21]). For the teachers, especially novice robot users, these attributes may not be as visible as the errors that we introduced in this work. Thus, the changes we detected in the interpretation of the teachers from the beginning of the interaction are not easy to generalize.

Finally, depending on the task, robots may exhibit various kinds of errors in different stages of the task performance, as also noted in previously proposed taxonomies [32] (e.g., in grasping, creating generalized trajectories, etc.). However, in this study, we concentrated only on object recognition errors. Future studies may consider the type of errors while learning as a factor to see how they could impact trust.

7 Conclusion

In this work, we studied how participants’ trust in a domestic trainee robot changes based on the last impression of the robot. Our study was different from previous HRI work on trust as the learner robot was not expected to already know how to perform its assigned tasks. We discovered that the behaviours of a robot, even in the training phase, are important for shaping participants’ (teachers’) trust. While the trainee robot improved in the task over multiple rounds, a small error by it in the last training round was found to significantly affect trust and some other aspects of the interaction. Results also suggested that the untidy clothing style of a robot may not affect trust or any other attribute of the robot except for its rated professionalism. These findings contribute to knowledge in the field by informing designers and researchers about the different approaches that can be used to increase the efficiency and success of trainee robots, and by emphasizing that even a small error made by a trainee robot can significantly affect human teachers’ perception of the robot.

Although our work had to move online due to COVID-19, the virtual platform for teaching a robot that we developed might be useful beyond COVID-19 restrictions, not replacing, but complementing in-person HRI experiments. This platform allowed a large group of participants to teach a task (according to their own preference) to a robot and could give participants the impression of acting as a teacher, even though the teaching process was virtual, and can be implemented in order to remotely study human teachers’ attitude, behaviour, perception, and more in situations in which robots improve by learning or make errors.

Footnotes

Some of the results have been previously presented in [1, 2]. These results relate to a subset of measures that partially address the first two research questions [2].

Codes for the virtual teaching platform and data analysis as well as the data can be found here: https://github.com/paliasgh/Exploring-the-Impact-of-Robot-Errors-and-Appearance-on-Teachers-Trust-and-Evaluations.

In this article, ‘cleaner’ refers to a cleaning product such as dishwashing liquid or laundry detergent.

⁴

Previous research has demonstrated the benefits of using continuous rating scales in research that includes online surveys [22, 40, 71].

⁵

Results reported in our previous work [2] are extended here by presenting and discussing Figure 2(B), plus Figure 3(A), Figure 3(C), and the “forgetting error” in Figure 3(B). We believe that this extra information can be useful for future studies investigating similar errors and can further clarify the choices made in Experiment 2.

⁶

In this article, similar to the recent previous work in HRI (e.g., [19, 34, 65, 67, 76]) we also report on results that approach significance since these could potentially inform future research that may confirm their significance.

⁷

In this article, whenever “robots” is mentioned, it refers to the exact same Pepper robot with two different clothing styles.

⁸

How does the last impression of a student robot affect trust?

⁹

Do different personalities of human teachers and their disposition to trust other people affect their perception of and their trust in a trainee robot?

¹⁰

How do the ratings of teachers about behaviours of a trainee robot change over time while it is practising a task and appears to gradually improve its performance?

¹¹

Can the professional look of a trainee robot (indicated by its clothing style) affect view of the teachers about the robot and their trust in the robot?

Appendix

Other Findings:

Figure A.1.

Relationship between the DT and TIPI attributes: To further identify the characteristics of participants, Table A.1 specifies whether any correlations existed between our participants’ Big-Five personality traits (TIPI attributes) and scales of their disposition to trust other people (DT questionnaire attributes). Pearson Rank-Order Correlation analysis revealed that participants’ agreeableness was moderately correlated with all of the scales of their disposition of trust. Furthermore, participants who were more open to experiences had a moderately higher trusting stance.

Table A.1.

	Extraversion	Agreeableness	Conscientiousness	Emotionally stability	Openness to experience
Benevolence	p = .037 r = .178	p<.001 r = .412	p = .255 r = .098	p = .003 r = .252	p = .004 r = .244
Integrity	p = .026 r = .189	p<.001 r = .420	p = .257 r = .097	p = .001 r = .269	p = .043 r = .172
Competence	p = .816 r = .019	p<.001 r = .381	p = .059 r = .161	p = .004 r = .241	p = .237 r = .101
Trusting stance	p<.001 r = .295	p<.001 r = .360	p = .227 r = .103	p = .008 r = .227	p<.001 r = .336

Table A.1. Correlations Between the Aspects of the Participants’ Disposition of Trust and Their Big-Five Personality Traits

Perceived gender of the robots: The gender assigned by the participants to the robots in different conditions, based on two different appearance types, and whether a particular robot appeared first or second (i.e., the encounter factor) are shown in Figure A.2. We fit a GLM with condition, appearance, and encounter as factors. According to the model (Table A.2), there were no significant differences between variations in appearance and encounter. However, participants in condition 2 perceived the gender of the robots significantly differently than those in condition 1. As Figure A.2 illustrates, more participants in condition 2 selected that the robots were “definitely male”.

Figure A.2.

Table A.2.

Covariate	Estimate	SE	z
Condition 2	–1.06	0.40	–2.672	**
Condition 3	–0.33	0.45	–0.726
Encounter Second	0.06	0.32	0.197
Appearance Untidy	–0.16	0.32	–0.493

Table A.2. Generalized Linear Model Predicting Participants’ Rating of Robots’ Gender

Condition 1, First encounter and Tidy appearance are the baselines.

** = \(p\lt .01\) .

Overall, participants mostly assumed that the genders of the robots were maybe male (34.1% of times over \(138\times 2=246\) ratings, as each participant chose a gender for each of the two robots). Two options of either male or female and neither male nor female also were selected frequently (30.9% and 23.6% of the time, respectively). While 19.9% of the time robots were perceived as definitely male, only 0.8% of the time participants perceived the robot as definitely female. There were also some instances, 2.8% of the time, when participants selected the gender of the robot to be maybe female. Figure A.3 presents these results.

Figure A.3.

References

[1]

Pourya Aliasghari. 2021. Exploring Human Teachers’ Interpretations of Trainee Robots’ Nonverbal Behaviour and Errors. Master’s thesis. University of Waterloo, Waterloo, Canada. http://hdl.handle.net/10012/16898.

Abstract

1 Introduction

2 Background and Related Work

2.1 Trust and Robot Errors in HRI

2.2 Robots’ Appearance Impacting HRI

3 Research Questions and Hypotheses

4 Methodology

4.1 The Food Preparation Task

5 Experiment 1

5.1 Procedure and Measures

5.2 Participants

5.3 Results

5.4 Discussion

6 Experiment 2

6.1 Robotic Implementation

6.2 Virtual Teaching Framework

6.3 Procedure and Measures

6.4 Participants

6.5 Statistical Analysis

6.6 Results

6.6.1 Last Impression of a Student Robot Affecting Trust —RQ1, RQ2.

6.6.2 Human Teachers’ Perception Change Over Time —RQ3.

6.6.3 Impact of Clothing Style of the Robot —RQ4.

6.7 Discussion

6.7.1 Last Impression of a Student Robot Affecting Trust —RQ1, RQ2.

6.7.2 Human Teachers’ Perception Change Over Time —RQ3.

6.7.3 Impact of Clothing Style of the Robot —RQ4.

6.8 Limitations and Future Work

7 Conclusion

Footnotes

Appendix

Other Findings:

References

Cited By

Index Terms

Recommendations

How Do Different Modes of Verbal Expressiveness of a Student Robot Making Errors Impact Human Teachers’ Intention to Use the Robot?

Effect of Domestic Trainee Robots&#x2019; Errors on Human Teachers&#x2019; Trust

Exploring the Impact of Fault Justification in Human-Robot Trust

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations

Effect of Domestic Trainee Robots’ Errors on Human Teachers’ Trust