In this section, we first describe how trust was impacted by the last impression of trainee robots (RQ1 and RQ2) and then explain how participants’ perceptions of them changed over time (RQ3). Finally, we present the results related to robot appearance (RQ4).
As explained before, to ensure that the assumptions of this study were not violated, we removed data from participants who rated the severity of an error differently from what was expected (e.g., rating a small error as a big error). Figure
8 shows the final ratings of the robots’ errors. Regarding the small errors, which were about preparing tea, two-sided independent-sample t-tests did not show any statistical differences between replacing items on the right side and the left side of the tea (
\(t(372)=1.013,p=.312\) ). Furthermore, the difference between replacing either of the two types of milk (
\(t(323)=0.250,p=.803\) ) or either of the two types of sweetener (
\(t(323)=-0.025,p=.980\) ) was not detected to be significant. In cases in which the robot correctly added the selected items (in rounds 3, 4, and 5 for all of the conditions, plus round 6 for condition 1), participants rated the behaviours as no error. Thus, the amount of added ingredient and other details were not important to the participants.
6.6.1 Last Impression of a Student Robot Affecting Trust —RQ1, RQ2.
In this subsection, findings from the trust and learning evaluation questionnaires (responded to by every participant after teaching each of the two robots in Step 6) are presented.
Preferences in cooking and laundry tasks: To investigate participants’ trust and test
H1.1 and
H1.2, we asked them to specify whether they would allow the robot to cook dinner for them alone or collaboratively, or would prefer to do this task on their own or buy food from a restaurant. The results, grouped by three different factors (
appearance,
encounter, and
condition), are displayed in Figure
9. This figure also includes the same question with regard to doing laundry for studying the transfer of trust to other tasks. The first two choices that allowed the robot to take part in any form are grouped together as a sign of a positive attitude toward using the robot and trusting it. The two other choices that totally excluded the robot are considered together as indicators of negative attitudes toward using the robot and not trusting it.
As shown in Figure
9 and previously presented in [
2], the percentage of participants who did not trust the robot was increasing when the final errors were becoming more severe from condition 1 to condition 3. However, the number of participants who trusted the robots based on their appearance or according to the order in which they taught them seems to be similar with different levels of those factors. Using GLMs, we further examined the effects of
appearance,
encounter, and
condition on these two measures of trust while considering the confounding factors. The models presented in Table
3 confirmed the significance of the differences observed in Figure
9: regarding both the cooking and laundry scenarios, only a significant effect of the last impression of learning (i.e.,
condition) was detected among those factors in addition to some effects of items in the TIPI and DT questionnaires. In other words, experimental conditions significantly affected participants’ trust regarding both the cooking and the laundry task.
Pairwise comparisons adjusted using the Holm-Bonferroni method showed that small errors of the robots in the sixth practising round (condition 2) significantly decreased trust compared with when the behaviours were correct ( \(se = 0.35, z = -3.052, p \lt .01\) ). Big errors at the end (condition 3) negatively affected trust even more compared with small errors ( \(se = 0.34, z = -3.833, p \lt .001\) ). In terms of confounding factors and related to H3 and H4, we found that participants who had a higher disposition for trusting people’s benevolence were significantly more willing to let the robots participate in cooking dinner for them ( \(se = 0.13, z = 2.350, p \lt .05\) ). The same effect was observed with those who were more open to new experiences ( \(se = 0.11, z = 2.194, p \lt .05\) ). We noticed a trend, approaching significance, suggesting that those with higher dispositions for trusting people’s competencies may rely more on the robots for cooking ( \(se = 0.14, z = 1.809, p = .070\) ). In contrast, participants with higher conscientiousness had a significantly lower tendency to allow the robots to cook for them ( \(se = 0.14, z = -2.194, p \lt .05\) ).
Regarding the transfer of trust to a laundry scenario, the adjusted pairwise tests revealed that participants in condition 2 had significantly lower faith in the robots helping with that task compared with those in condition 1 ( \(se = 0.38, z = -2.572, p \lt .05\) ). Participants’ trust was even lower in robots doing their laundry when a big error happened at the end of learning the cooking task (condition 3), compared with when a small error was made ( \(se = 0.32, z = -2.560, p \lt .05\) ). Participants with a higher level of deposition to trust others’ competencies were more likely to trust robots to do their laundry ( \(se = 0.12, z = 3.357, p \lt .001\) ). It was also detected that those with greater openness to experience trusted the robots more in this laundry task ( \(se = 0.10, z = 2.341, p \lt .05\) ). However, participants who were more emotionally stable trusted the robots less in the laundry task ( \(se = 0.10, z = -2.081, p \lt .05\) ).
Participants’ attitudes: The participants also responded to four questions on continuous scales regarding the teaching experience and their opinions towards the robots after completing the six rounds with each robot (Step 6). These items were the perceived realism of the teaching scenarios, perceived improvement of robots over time, expected success of robots in teaching cooking tasks to another robot, and likelihood of using robots to assist with chores in the future. The average ratings for each robot (i.e.,
appearance factor) and in various conditions are plotted in Figure
10. LMMs presented in Table
4 were employed to further investigate these measures. The cumulative results (both robots together) pertaining to the last two measures are also described in [
2].
Regarding how realistic the teaching scenarios looked, although the ratings in Condition 3 seem slightly lower than other conditions according to Figure
10, we did not observe any significant effect of condition on how realistic the task was rated. Only some items of the DT questionnaire (see Table
2) affected the perceived realism of the task. We noticed that participants with a higher disposition of trusting people’s integrity (
\(se = 25.19, t = 2.811, p \lt .01\) ) and competence (
\(se = 15.85, t = 2.593, p \lt .05\) ) rated the scenarios as more realistic, but participants with a higher disposition of trusting people’s benevolence (
\(se = 24.88, t = -2.676, p \lt .01\) ) rated the scenario as less realistic.
While the appearance of the robots could not affect any ratings on those scales, the errors (i.e., condition) affected the rest of the measures described here. Adjusted pairwise comparisons showed that participants felt significantly less improvement in the performance of the robots after the big errors happened at the end compared with after a small error happened ( \(se = 34.89, t = -3.263, p \lt .01\) ) or when there was no error ( \(se = 35.71, t = -10.580, p \lt .001\) ). The small errors could also decrease this rating compared with no error ( \(se = 34.93, t = -7.556, p \lt .001\) ). In the measure, opposite to the realism of the teaching scenario, those with a higher rating of people’s benevolence rated that robots improved more ( \(se = 10.50, t = 2.295, p = \lt .05\) ).
The only factor that we found to have affected the expected success of robots in teaching the cooking tasks to another robot was experimental condition. Small ( \(se = 41.65, t = -5.831, p \lt .001\) ) and big ( \(se =42.77, t = -7.040, p \lt .001\) ) errors in the last round significantly decreased the ratings for this measure, with no difference detected between the big and small errors ( \(se = 41.89, t = -1.391, p =.167\) ). Finally, the likelihood that participants use the robots in future to assist them was found to be affected by the last errors and their disposition of trusting people’s competencies ( \(se = 17.64, t = 3.087, p\lt .01\) ). When big errors happened in the sixth round, participants appeared to be less inclined to use the robots compared with when small errors happened ( \(se = 50.75, t = -2.301, p =\lt .05\) ) and the behaviour was fully correct ( \(se = 51.68, t = -6.474, p \lt .001\) ). A small error also had the same effect compared with having no final error ( \(se = 50.29, t = -4.331,\) \(p \lt .001\) ).
6.6.2 Human Teachers’ Perception Change Over Time —RQ3.
Results presented in this part are based on participants’ evaluations of the robots in the teaching process and behaviour evaluation step (rated after watching every round of a robot practising in Step 5). When the severity of errors decreased from rounds 1 to 3, measures on all of the scales (see Figure
11) improved and remained high until the sixth round, and then dropped if/when the errors happened again. These measurements did not appear to be affected by the clothing style of the robots (the statistical analysis will be presented later). Paired-sample t-tests did not reveal any significant differences in the rating of severity of small (
\(t(186)=-0.905,p=.366\) ) and big (
\(t(181)=0.903,p=.368\) ) errors made by the tidy or untidy robots.
Another factor that was found to be important was the order in which each participant taught the robots (i.e., the
encounter factor). Figure
A.1 demonstrates the participants’ ratings of the first and second robots that they observed. As will be shown later in the statistical models, the encounter factor could significantly affect all of the measures except for the robots’ perceived calmness. As a post-hoc test and using one-sided paired-samples t-tests, we detected significant differences between the first and second robots in certain rounds and for some of the tested attributes. In all cases, except for the one that is related to perceived calmness in round 6, the average ratings of attributes were higher for the first robot (regardless of whether it was tidy or untidy). While the appearance of the robots did not seem to affect the rated severity of errors, the small errors were rated as less severe with the robot that was taught first (
\(t(186)=-2.103,p\lt .05\) ). There was a similar difference for the big errors in the same direction, but that was only close to being statistically significant (
\(t(181)=-1.639,p=.052\) ).
Participants experienced different behaviours of the robots in the sixth round based on the conditions that they were assigned to. Figure
11 presents the averages of ratings of the robots according to the condition (note that each participant rated two robots). We observed that the previously described general trends, within the sixth-round scores reduced consistently with the severity of the errors (based on
condition). Since rounds 1 to 5 were identical for all conditions, there was no experimental difference among the three conditions plotted on the left side of the dashed lines in Figure
11.
According to one-sided paired-sample t-tests that were adjusted using the Holm-Bonferroni method for multiple testings, all measured attributes significantly increased from round 1 to round 2, and from round 2 to round 3. This was correlated with the improvement in the behaviour of the robots as specified in H3. By also comparing round 3 with round 4, as well as round 4 with round 5, some instances were observed in which the ratings of the robots improved significantly even though the behaviours of the robots were the same. The ratings of confidence ( \(t(137)=3.697, p\lt .001\) ), calmness ( \(t(137)=2.067, p\lt .05\) ), liking the task ( \(t(137)=3.891, p\lt .001\) ), and eagerness to learn ( \(t(137)=1.735, p\lt .05\) ) were higher in round 5 compared with round 4. In addition, participants rated eagerness to learn as higher in round 4 of teaching robots compared with round 3 ( \(t(137)=3.496, p\lt .05\) ).
Further tests revealed that within each condition, there were some statistically significant differences in the ratings when the same errors happened in the beginning (either round 1 or 2) compared with at the end (round 6). For participants in condition 2 and regarding the small errors of the robots, ratings of confidence ( \(t(48)=2.140, p\lt .05\) ), calmness ( \(t(48)=2.055, p\lt .05\) ), and being goal driven ( \(t(48)=2.304, p\lt .05\) ) were significantly higher in round 6 compared with round 2. Concerning the big errors for participants in condition 3, the same effect in the opposite direction was detected; the robots were rated as less liking the task ( \(t(43)=-1.719, p\lt .05\) ), less eager to learn ( \(t(43)=-2.243, p\lt .05\) ), and less goal driven ( \(t(43)=-2.283, p\lt .05\) ) after the big error in round 6, compared with round 1.
Predictive models: To further study impact on participants’ ratings of the trainee robots, the LMMs predicting the attributes (i.e., confidence, calmness, liking the task, attention to the task, proficiency, eagerness to learn, and being goal driven) are summarized in Table
5. As can be seen,
severity and
encounter were two factors that affected every measure except for encounter, which did not affect robots’ calmness. However, the appearance of the robot and participants’ age and gender did not affect any of the attributes.
All measures (except calmness) were generally rated higher for the first robot each participant taught. All of the ratings were found to be negatively correlated with the severity of errors in each round. About the TIPI scales, we found that more extraverted participants rated the robots as less confident ( \(se=6.07, t=-2.412, p\lt .05\) ), calm ( \(se=6.85, t=-3.222, p\lt .01\) ), attentive ( \(se=5.28, t=-2.249, p\lt .05\) ), and goal driven ( \(se=5.67, t=-2.641, p\lt .01\) ). Participants who had higher conscientiousness rated the robot to be more attentive ( \(se=7.96, t=3.251, p\lt .01\) ) and goal driven ( \(se=8.55, t=2.053, p\lt .05\) ). The robots were rated to be calmer by those with higher emotional stability ( \(se=7.69, t=1.943, p=.055\) ). Finally, participants who were more open to experiences rated the robots as more liking the task ( \(se=6.95, t=2.420, p\lt .05\) ), proficient ( \(se=5.57, t=2.194, p\lt .05\) ), and eager to learn ( \(se=6.66, t=2.150, p\lt .05\) ).
About the DT questionnaire items (see Table
2), participants with a higher disposition of trusting people’s competencies rated the robot as more liking the task (
\(se=8.56, t=3.121, p\lt .05\) ) and more eager to learn (
\(se=8.21, t=2.991, p\lt .01\) ). The robots were rated as more confident (
\(se=7.29, t=3.114, p\lt .01\) ), calm (
\(se=7.81, t=3.729, p\lt .001\) ), attentive (
\(se = 6.12, t = 2.612, p \lt .01\) ), proficient (
\(se=5.39, t=2.704, p\lt .01\) ) and goal driven (
\(se=6.57, t=2.038, p\lt .05\) ) by those who scored higher in the trusting stance sub-scale of the DT questionnaire.