1 Introduction
The need for human input on demand has steadily increased alongside the growth in the adoption of artificial intelligence (AI) and machine learning (ML) systems across all domains [
28]. The foundations of many AI systems we interact with daily rely on the labor of crowd work [
30]. With the availability of crowd workers on-demand [
12], human intelligence tasks (HITs) can be distributed and completed at scale on crowdsourcing platforms like Amazon Mechanical Turk,
1 Prolific,
2 and Toloka.
3 Tasks range from data labeling [
10], image annotation [
48], and classification [
82] to the creation and support of real-time healthcare applications [
3,
7].
Due to the repetitiveness of HITs, tasks can be monotonous and boring, causing task rejection and drop-out [
34,
60], which is problematic for both crowd workers and task requesters. Task rejection can affect the morale of crowd workers [
17], and high drop-out rates result in low-quality crowd work, also affecting worker pay. Monotonous and boring work decreases the motivation of workers [
9], resulting in reduced worker engagement. Furthermore, motivation is known to be an essential factor when it comes to reducing work-related stress and burnout [
87]. Similarly, job satisfaction has been shown to be positively related to subjective well-being [
6]. To decrease the problematic effects of monotonous and tedious tasks for crowd workers and task distributors, we need to improve worker engagement by creating better worker experiences. In the long run, this can also result in improving the quality of crowd work [
97].
Although some crowdsourcing tasks require collaboration and teamwork among workers [
10,
19,
55,
66], workers typically execute microtasks individually and sometimes in isolation [
24,
57]. Not all workers, therefore, have the opportunity to experience a sense of community due to this, and little is typically done to increase group identification among workers. In addition to improving worker engagement during task execution, increasing a sense of community can go a long way toward creating better worker experiences. Prior work has shown that crowd workers use external forums to communicate with other crowd workers [
91,
95,
96], such as Reddit HWTF, Facebook, MTurkGrind, MTurkForum, and Turkernation. However, elaborate social interventions and facilitating extensive engagement via forums are not viable solutions for all workers. While several crowd workers have been shown to communicate with other workers, many do not communicate with others and work alone [
96]. In part, this may be due to workers not having time to engage in external forums as a result of other commitments not related to crowd work [
1]. It is, therefore, prudent to explore whether a lightweight method that does not require extensive social engagement or exchange of private information can still help build a sense of community among workers while completing tasks individually. Through our work, we aim to address these challenges pertaining to both research and empirical gaps.
Digital avatars are known to increase identification and user experience in online multi-player video games [
89], solitary educational games [
40], and conversational crowdsourcing tasks [
68]. Moreover, the ability to personalize the avatar by customizing its appearance further increases users’ self-identification with the avatar [
5]. Prior HCI research has shown a promising impact of crowd worker avatar customization within a conversational interface to reduce cognitive workload and increase worker retention [
68]. However, the notion of evolving and customizable worker avatars and their effect on worker experience and task-related outcomes remains unexplored. Addressing this research gap, we propose to couple avatar evolution and customization with workers’ progress in task batches.
Since digital avatars facilitate the creation of a virtual identity [
5,
63,
89], we argue that a personal worker avatar can be an effective tool to increase a sense of community among the workers while protecting their privacy. Prior research found that avatar identification relates to [
90] and predicts [
23] group identification in online video games. Gabbiadini et al. [
23] explained that when users see their avatar in the group, they imagine themselves as being part of the group. Similarly, Takano and Taka [
86] found that avatar identification has a positive effect on the feeling of belonging, partially mediated by self-expression. Inspired by this prior literature, we aim to facilitate group identification by creating a community space where workers can share their personalized avatars with other crowd workers. With the worker community space, we aim to build a lightweight intervention that can be used in tasks without elements of collaboration to reflect a feeling of unity [
84] by placing the virtual identity of the worker among other worker avatars. As a part of customization, the facial expressions of avatars can then be used to share (task-related) feelings with other workers in a community space on task completion, as sharing feelings (affective self-disclosure) can contribute to a feeling of connection [
84]. Combining the interventions of evolving avatars and group identification, we address the following research questions in our work:
By combining avatar customization, gamified avatar evolution, and creating a sense of community, we aim to improve overall worker experiences and the quality of the task outcomes. Worker experiences can be described and measured by their
perceived workload,
intrinsic motivation, and
subjective engagement. Furthermore, we aim to analyze the impact of these interventions on task-related outcomes, such as retention, accuracy, and overall task execution time. To this end, we carried out a between-subjects study by recruiting workers from the Prolific crowdsourcing platform (
N = 680), spanning five experimental conditions and considering two popular types of tasks (information finding and credibility analysis). We found that evolving and customizable worker avatars can increase worker retention. Although the worker community space was not successful in fostering an increased sense of group identification among crowd workers, we found that this varied across workers based on the extent to which they considered themselves as crowd workers. Workers who identify themselves as crowd workers experience a significantly greater perceived workload, intrinsic motivation, and subjective engagement. Our findings have important implications for the design of future conversational crowdsourcing tasks and for crowdsourcing platforms, with an aim to improve worker experiences and foster a sense of community. All code and data pertaining to this work can be found in the OSF repository for the benefit of the community and in the spirit of open science.
44 Results and Analysis
4.1 Demographic Distribution
A total of 680 workers participated in our experiment, equally divided across both task types. One worker was excluded due to technical problems, and three workers were excluded due to invalid answers (all workers from the information finding task). This resulted in a final number of workers of 676 (mean age = 33.83,
SD = 11.23). Of those workers, 61.5% identified as
male (416 workers), 37.3% as
female (252 workers), 1% as
non-binary (7 workers), and 0.1% as
other (1 worker). For the information finding task, 66 workers participated in the
Control condition, 67 in the
Basic condition, 68 in the
Basic⊕ Comm condition, 67 in the
Evolving condition, and 68 in the
Evolving⊕ Comm condition. For the credibility analysis task, this was 68, 67, 68, 69, and 68 respectively. Descriptive statistics related to the use of the avatar editor can be found in the Appendix, Section
B.1. Based on the Shapiro-Wilk tests for normality, none of our dependent measurements were normally distributed for each condition (
p <.05). Therefore, we employed Kruskal Wallis tests to verify our hypotheses.
4.2 Perceived Workload
A non-parametric Kruskal-Wallis test was performed to investigate whether the overall TLX score and its different dimensions differ significantly across the conditions. For both tasks, the overall TLX score and the TLX dimensions did not differ across the different conditions (α = 0.05). Thus, no significant effect was found of evolving avatars and the worker community space on workers’ perceived workload.
4.3 Intrinsic Motivation
A non-parametric Kruskal-Wallis test was performed to investigate whether the overall IMI score and its dimensions differ significantly across the conditions. For both tasks, there were no significant differences found between the conditions for the overall IMI score and its subdimensions (α = 0.05). Thus, no significant effect was found of evolving avatars and a worker community space on workers’ intrinsic motivation.
4.4 Subjective Worker Engagement
A non-parametric Kruskal-Wallis test was performed to investigate whether the overall UES score and its dimensions differ significantly across the experimental conditions (
H1c and
H4c). For the credibility task, we found a significant difference between conditions for the aesthetic appeal (AE) dimension (
df = 4,
H = 9.739,
p =.045,
α = 0.05). A Dunn test was performed with a Bonferroni correction for the
p-value to test which conditions differ significantly. Workers in the credibility analysis task with evolving avatars had a significantly higher aesthetic appeal score compared to workers without an avatar (
Z = −3.029,
p =.025,
α = 0.05; cf. Figure
6b). In contrast, there was no significant difference in aesthetic appeal for the information finding task (cf. Figure
6a).
4.5 Worker Retention
A non-parametric Kruskal Wallis test was performed to investigate whether the retention differs significantly across the conditions. The Kruskal-Wallis test showed no significant differences between the conditions for the information finding task (
H = 8.657,
df = 4,
p =.070,
α = 0.05; see figure
7a). For the credibility analysis task, the Kruskal-Wallis test showed significant differences between the conditions (
H = 13.848,
df = 4,
p =.008,
α = 0.05; see Figure
7b). Based on the Dunn test with a Bonferroni corrected p-value, workers with an evolving avatar had significantly higher retention than workers without an avatar (
Z = −3.121,
p =.018,
α = 0.05). Interestingly, workers with an evolving avatar and the worker community space did not have significantly higher worker retention compared to workers without an avatar (
Z = −2.684,
p =.073).
To further understand our results and their effect sizes, Figure
8 shows the estimation plots for worker retention [
36]. The
Control condition is compared to the other conditions. Based on these plots, we see larger effect sizes for the
Evolving condition of the information finding task, and the
Basic,
Evolving, and
Evolving⊕ Comm conditions for the credibility analysis task.
4.6 Worker Accuracy
A non-parametric Kruskal Wallis test was performed to investigate whether the accuracy differs significantly across the conditions. There were no significant differences found between the conditions for the accuracy of the information finding task (H = 1.287, df = 4, p = 0.864) and the credibility analysis task (H = 4.733, df = 4, p = 0.316).
4.7 Task Execution Time
For the analysis of task execution time, we removed outliers outside the whiskers of the boxplot (
Q3 + 1.5*
IQR;
Q1 − 1.5*
IQR) for both tasks, since these long task execution times could be an artifact of different external factors such as workers completing multiple tasks simultaneously [
29], using different working strategies [
33], a function of their work environments [
24], and so forth. This resulted in 18 outliers being removed from the information finding task across all experimental conditions, and 12 outliers being removed from the credibility analysis task. For the information finding task, this resulted in 64 workers in the
Control condition, 67 workers in
Basic, 62 workers in
Basic⊕ Comm, 62 workers in
Evolving, and 63 workers in
Evolving⊕ Comm. For the credibility task, this was 65, 65, 66, 67, and 65 respectively.
A Kruskal-Wallis test was performed to investigate whether there are significant differences in task duration across the conditions. The Kruskal-Wallis test revealed significant differences between the conditions for the information finding task (
H = 15.84,
df = 4,
p = 0.003; cf. Figure
9a) and the credibility analysis task (
H = 36.977,
df = 4,
p <.001; cf. Figure
9b). For the information finding task, the Dunn test with a Bonferroni corrected
p-value showed that workers in the
Evolving condition had a significantly longer task execution time than the
Control condition (
Z = −3.298,
p =.01,
α = 0.05) and the
Basic⊕ Comm condition (
Z = −3.143,
p =.017,
α = 0.05). For the credibility analysis task, the Dunn test with a Bonferroni corrected
p-value showed that workers in the
Control condition had a significantly lower task execution time than workers in the
Basic condition (
Z = −2.863,
p =.042,
α = 0.05),
Basic⊕ Comm condition (
Z = −4.173,
p <.001,
α = 0.05),
Evolving condition (
Z = −5.091,
p <.001,
α = 0.05), and the
Evolving⊕ Comm condition (
Z = −5.207,
p <.001,
α = 0.05).
4.8 Group Identification
A non-parametric Kruskal-Wallis test was performed to investigate whether the GIM score and the connected question differ significantly across the conditions (H3). There were no significant differences found across conditions for the GIM score and the connected question (α = 0.05).
To explore why workers did or did not feel connected to the other crowd workers who worked on the same tasks and whether this was related to the worker community space, the answers to the open-ended question were manually coded into categories for workers in a condition that included the worker community space. Furthermore, workers are classified based on their responses on the 7-point Likert scale as either not feeling connected (
Connected < 4) or feeling connected (
Connected > 4) to differentiate between the workers who felt connected or not. Open-coding was used to define different categories based on the open-ended questions of both the credibility task and the information finding task, similar to the methods of a conventional qualitative content analysis [
37]. Some responses could be categorized into two different categories. The open-ended questions from both tasks were categorized using these created categories. Subsequently, a second coder used the same defined categories to categorize roughly half of the data, consisting of the open-ended questions from the credibility task (
n = 136). A substantial inter-annotator agreement was found between the two coders, as measured with Cohen’s Kappa (
κ = 0.744) [
51]. An overview of the description of the categories and the results can be found in the Appendix, Section
B.3.
Information finding tasks. Of all the workers who worked on the information finding task that reported not feeling connected to the other workers (n = 63), most workers ( \(65\%, n = 41\) ) did not feel connected because of a lack of direct interaction with other workers. Some workers ( \(13\%, n = 8\) ) did not believe that the workers in the worker community space were indeed other workers. A smaller group of workers ( \(6\%, n = 4\) ) did not feel connected because of the feelings shown in the worker community space. From the workers that did feel connected (n = 43), the majority of the workers felt connected because they shared a similar goal ( \(28\%, n = 12\) ) or because of the feelings on the worker community space ( \(23\%, n = 10\) ). A smaller fraction of the workers ( \(9\%, n = 4\) ) felt connected due to the avatars in the worker community space.
Credibility analysis tasks. Of all workers from the credibility analysis task who did not feel connected to the other workers (n = 63), most of the workers ( \(76\%, n = 48\) ) did not feel connected because there was a lack of interaction with the other workers. They felt like they were completing the tasks on their own. A smaller fraction of the workers did not feel connected because other workers mentioned they felt differently about the task ( \(6\%, n = 4\) ), or the avatar was too basic an instrument to make them feel connected to other workers ( \(6\%, n = 4\) ). The majority of the workers who felt connected (n = 50) did so because they all shared the same goal when working on the task ( \(36\%, n = 18\) ). Furthermore, some workers ( \(20\%, n = 10\) ) felt connected because they saw other workers reporting the same feelings about the task. Of the workers who did feel connected, a few also mentioned a lack of interaction between them and the other workers ( \(14\%, n = 7\) ).
4.9 Exploratory Analysis – Group Identification
We did not find an increased sense of group identification for the conditions containing the worker community space (H3). With an aim to further understand group identification in our study, we explored the differences between workers who reported different levels of group identification across all conditions. To do this, we divided the workers into three groups based on their reported GIM scores: low (1 ≤ GIM ≤ 3.5), mid (3.5 < GIM ≤ 4.5), and high (4.5 < GIM ≤ 7). For the information finding task, 104 workers were found to be in the low group, 102 workers in the mid group, and 130 workers in the high group respectively. For the credibility analysis task, 112 workers were in the low group, 93 in the mid group, and 135 in the high group.
To analyze how the task duration (
i.e., the execution time) varied between these groups, outliers were removed from both tasks. For the information finding task, 27 outliers were removed in a similar way as described in Section
4.7, resulting in 125 workers in the
high GIM group, 91 workers in the
mid GIM group, and 93 workers in the
low GIM group. For the credibility task, 18 workers were removed, resulting in 123 workers in the
high GIM group, 91 workers in the
mid GIM group, and 108 workers in the
low GIM group.
4.9.1 Differences Across GIM Groups: Worker Experiences.
Similar to the experimental conditions, all measurements had at least one group that did not have a normal distribution based on the Shapiro-Wilk test (
p <.05). Therefore, we performed Kruskal-Wallis tests to investigate the differences in task-related outcomes and worker experience measurements between the different GIM groups. The results of the Kruskal-Wallis tests with all our dependent measurements can be found in Table
1. For the information finding task, we found significant differences between workers with different GIM levels for worker retention, task duration, overall TLX score (and the dimensions of mental demand, physical demand, effort, and frustration), overall IMI score (across all dimensions), and the UES score (across all dimensions). For the credibility task, we found significant differences in the accuracy, overall TLX score (the dimensions of mental demand, physical demand, and effort), overall IMI score (across all dimensions), the overall UES score (and the dimensions of FA, AE, and RW).
The results of the Dunn test for the worker experience measures, based on the Bonferroni corrected
p-values, are visualized in Figure
10 (metrics for all tests can be found in the appendix, Table
6 and Table
7). For the information finding task, the workers in the
high GIM group (
Z = 4.708,
p <.001) and the
mid GIM group (
Z = −3.26,
p =.003) had a significantly lower TLX score than the
low GIM group. For the credibility analysis task, the
high GIM group had a significantly higher TLX score than the
low GIM group (
Z = 3.64,
p =.001). For both tasks, workers in the
high GIM group reported a significantly higher IMI score than the
mid GIM group (information finding:
Z = 4.729,
p <.001; credibility analysis:
Z = 3.29,
p =.003) and the
low GIM group (information finding:
Z = 9.023,
p <.001; credibility analysis:
Z = 7.914,
p <.001). Moreover, the
mid GIM group reported significantly higher overall IMI than the
low GIM group (information finding:
Z = −4.029,
p <.001; credibility analysis:
Z = −4.05,
p <.001). For the UES score, workers in the
high GIM group reported significantly higher than the
low GIM group for both tasks (information finding:
Z = 7.83,
p <.001; credibility analysis:
Z = 4.813,
p <.001). Moreover, for the information finding task, the
high GIM group reported significantly higher than the
mid GIM group (
Z = 3.646,
p =.001), and the
mid GIM group reported significantly higher than the
low GIM group (
Z = −3.931,
p <.001).
4.9.2 Differences Across GIM Groups: Task-related Outcomes.
The Dunn test with Bonferroni correction showed that workers in the
high GIM group had significantly higher retention than workers in the
low GIM group for the information finding task (
Z = 2.643,
p =.025; see Figure
11 a). Furthermore, the task duration of the
high GIM group was significantly longer than the task duration of the
low GIM group (
Z = 4.162,
p <.001) and the
mid GIM group (
Z = 3.117,
p =.005) for the information finding task (see figure
11 b). For the credibility task, the accuracy of the
high GIM group was significantly lower than the
mid GIM group (
Z = −2.733,
p =.019; see Figure
12).
4.10 Exploratory Analysis – Task Differences
Following our results which revealed differences between the credibility task and the information finding task, an exploratory analysis was carried out to further investigate how these two types of tasks were perceived differently by workers (see Figure
14 in the Appendix). Based on Wilcoxon rank tests, we found that the credibility analysis task had a significantly lower (
p =.018) perceived workload compared to the information finding task, caused by a lower level of frustration (
p <.001) and temporal demand (
p <.001). Furthermore, the credibility analysis task scored higher in intrinsic motivation (
p =.004), caused by greater interest and enjoyment (
p <.001). In line, user engagement was greater for the credibility analysis task (
p <.001), caused by greater perceived usefulness (
p <.001), aesthetic appeal (
p <.001), and reward (
p <.001).
6 Conclusions
Our first research question was to investigate the effect of evolving and customizable worker avatars on worker experience and task-related outcomes (RQ1). To address this question, we created a conversational crowdsourcing task where workers were able to customize their worker avatars, and as they progressed through the task batches, they unlocked new levels that allowed them to use new features to customize their avatars. We measured task-related outcomes, such as worker retention, accuracy, and total task execution time. The worker experience was measured by perceived workload, intrinsic motivation, and subjective engagement. Our results suggest that evolving and customizable worker avatars can increase worker retention. Our second research question addressed the extent to which the sharing of worker avatars and task-related feelings in a worker community space could foster a sense of group identification among crowd workers (RQ2). We created an interactive worker community space where workers shared their personalized worker avatars with their feelings on the task. However, the worker community space did not successfully foster an increased sense of group identification among crowd workers, although exploratory findings revealed that this could be a function of individual differences among crowd workers. With our third research question, we investigated the effect of group identification, induced by the worker community space, on worker experience and task-related outcomes (RQ3). We found that the worker community space did not improve group identification among the crowd workers. We conducted an exploratory analysis to investigate the effect of different levels of group identification across all workers on task-related outcomes and worker experience. Our results indicated that workers who identify themselves as crowd workers experience a significantly greater perceived workload, intrinsic motivation, and subjective engagement. Our study contributes to extending the understanding of designing future crowdsourcing tasks. It sheds light on new directions to improve the sustainability of the crowdsourcing paradigm for crowd workers, task requesters, and crowdsourcing platforms.