1 Introduction
Crowdsourcing systems have transformed distributed human
problem-solving, enabling large-scale collaborations that were previously infeasible [
41]. Quality control, however, remains a persistent challenge leading to noisy or unusable data [
16,
48]. Existing quality control measures such as prescreening crowdworkers [
20,
44], refining instructions [
24,
47,
81], manipulating incentives [
24,
47,
81], and majority vote filtering are designed to optimize economic output: data quality and worker efficiency. Our research explores a subset of crowdsourcing that focuses on community science, or
crowdsourced science [
74]. Platforms like
Zooniverse [
84] and
FoldIt [
46] engage non-professionals in scientific tasks and serve as important means of public engagement and education [
74,
90]. Since participants are primarily volunteers, crowdsourced science presents unique quality control challenges: users are primarily motivated by intrinsic interest, learning opportunities, and making a difference but may be unfamiliar with the domain [
68]. Previous work in crowdsourcing has explored the dual objectives of enhancing work quality as well as learning experience in crowdsourcing systems by providing feedback to crowdworkers [
21,
22,
23,
90,
99]. Yet, these approaches are less scalable because they require additional commitments from either crowdworker peers or external experts [
22,
23,
90,
99].
Building on this prior work, we present
LabelAId, a real-time inference model for providing just-in-time feedback during crowdsource labeling to improve data quality and worker expertise. LabelAId is composed of two parts: (1) a novel machine learning (ML) based pipeline for detecting labeling mistakes, which is efficiently trained on unannotated data that contain those very mistakes; (2) a real-time system that tracks worker behavior and intervenes when an inferred mistake occurs. Unlike previous approaches that improve crowdworkers’ learning experience through peer or expert feedback [
23,
99], LabelAId reduces the reliance on human input, leveraging human-AI collaboration to provide targeted feedback for enhancing crowdworker performance and domain knowledge.
To study LabelAId in a real crowdsourcing context, we instrumented the open-source crowdsourcing tool,
Project Sidewalk, where online users virtually explore streetscape imagery to find, label, and assess sidewalk accessibility problems for people with mobility disabilities [
80]. Since its launch in 2015, over 13,000 people across the world have used Project Sidewalk to audit 17,000
km of streets across 20 cities in eight countries including the US, Mexico, Ecuador, Switzerland, New Zealand, and Taiwan, contributing over 1.5 million data points
1.
Project Sidewalk provides a compelling use case for LabelAId because, unlike traditional image labeling tasks for object detection (
e.g.,
ImageNet [
19],
COCO [
58],
Open Images Dataset [
30]), crowdworkers are asked to make careful judgments about a labeling target, which requires domain knowledge and training—similar to agricultural image recognition [
29], medical imagery labeling [
79,
98], and wildlife image categorization [
4]. Such labeling tasks reflect a broader trend of crowdwork becoming increasingly complex, domain-specific, and potentially error prone [
48]. Second, as a community science project, Project Sidewalk aligns with the growing emphasis on both educational impact and data quality in crowdsourcing [
21,
22,
23,
90,
99], which LabelAId provides. Finally, Project Sidewalk currently employs a common but limited quality control mechanism: users validate labeled images by other users. Since both labelers and validators are drawn from the same user population, repeated errors can pervade the system.
To evaluate LabelAId, we conducted: (1) a technical performance evaluation of LabelAId’s inference model; and (2) a between-subjects user study of 34 participants. For the former, we demonstrate that the LabelAId pipeline consistently outperforms state-of-the-art baselines and can improve mistake inference accuracy by up to 36.7%. With fine-tuning on as few as 50 expert-validated labels, LabelAId outperforms traditional ML models such as
XGBoost [
17] and
Multi-layer Perceptron (MLP) [
87] trained on 20 times the amount of expert-validated labels. Furthermore, we showcase the robust generalizability of our pipeline across different deployment cities in Project Sidewalk. Since its initial deployment in Washington D.C., Project Sidewalk has expanded to 20 cities, with ongoing plans for further growth. To support future city deployments, it is important to minimize labor and configuration overhead of the mistake inference model in new cities. Our study shows that LabelAId, even without fine-tuning, performs comparably in a new city to those in the pre-training set.
For the between-subjects user study, participants were randomly assigned to one of two conditions: using Project Sidewalk in its original form (control) or using Project Sidewalk with LabelAId (intervention). Our findings reveal that the intervention group achieved significantly higher label precision without sacrificing labeling speed. While using Project Sidewalk enhanced participants’ understanding of urban accessibility and their confidence in identifying sidewalk problems in both groups, participants in the intervention group reported that LabelAId was helpful with decision-making, particularly in situations where they were initially uncertain.
To summarize, our contributions are as follows:
•
A novel ML pipeline that allows for the integration of domain-specific knowledge and heuristics into the data annotation process, which facilitates the training of AI-based inference models for detecting crowdworker labeling mistakes across various contexts, while minimizing the need for manual intervention in downstream tasks.
•
A human-AI (HAI) collaborative system designed to create teachable moments in crowdsourcing workflows. This system not only improves the quality of crowdsourced data, but also enriches the learning experience for participants.
•
A between-subjects user study involving 34 participants with no prior experience using Project Sidewalk, demonstrating that LabelAId significantly improves label precision by 19.2% without compromising efficiency.
While our empirical results focused on the performance of LabelAId within the context of Project Sidewalk, we believe our framework can be generalizable to other crowdsourcing platforms as well as the PWS-based ML pipeline and the two-step module design intervention are easily replicable and tailorable in different contexts.
4 LabelAId: Implementation & User Evaluation
Having demonstrated the technical efficacy of our LabelAId system in inferring label correctness, we implemented the LabelAId inference model in Project Sidewalk, and evaluated the user experience and performance of the end-to-end system with users in the loop. Our study aimed to answer the following questions:
•
RQ1: Can LabelAId’s feedback improve the performance of minimally-trained crowdworkers in labeling urban accessibility issues compared to a no feedback condition?
•
RQ2: Can LabelAId’s feedback enhance minimally-trained crowdworkers’ self-efficacy and perceived learning when labeling urban accessibility issues compared to a no feedback condition?
•
RQ3: How do participants perceive LabelAId’s feedback in terms of usefulness, content, and frequency?
To address these questions, we designed and conducted a between-subjects study of our LabelAId implementation, described below.
4.1 Implementing LabelAId in Project Sidewalk
To incorporate LabelAId into Project Sidewalk, we needed to integrate a real-time mistake inference model (as described in section
3) and to design and develop a just-in-time UI intervention to help warn users of potential labeling mistakes (using the said inference model). We first highlight design considerations situated in the literature, before describing implementation details.
Design considerations. To design LabelAId’s UI intervention, we first reviewed literature regarding the design space for crowd feedback [
22,
23,
60,
90,
99] and guidelines for HAI design [
1]. Studies have emphasized the importance of timeliness in feedback delivery [
23], which led us to opt for real-time feedback, as it delivers feedback during a
teachable moment when people are still thinking about the task. Additionally, the importance of contextual help for learning assistance has been well-documented in psychology literature [
2] and demonstrated through HCI work (
e.g., [
33,
95]). To further refine the user interface, we consulted best practices for dialog design [
64], emphasizing specific response options that clearly outline the consequences of each choice, as well as employing progressive disclosure techniques [
63] to help users understand the implications of their actions before committing to them [
1]. Based on these insights, we iteratively designed LabelAId, starting with hand sketches and
Figma mock-ups before implementing the tool in
JavaScript (front-end) and
Scala with
QGIS (back-end).
System implementation. We integrated the city-specific, fine-tuned FT-Transformer into LabelAId using the Open Neural Network Exchange (ONNX) runtime standard. An important objective is to reduce latency and facilitate seamless HAI collaboration.The most time consuming step in the preparation stage is to assess whether the label belongs to a pre-existing cluster. To expedite calculation time, we simplified by calculating the spatial haversine distance of the input to a pre-computed cluster centroid, maintaining a threshold consistent with the clustering algorithm at 10 meters. We found in off-line experiments that this approach was 8-20 times faster (speed varies based on label type) and a mere 1.6% of labels (27 out of 1659) had a different clustering result.
We implemented the inference model on the front-end rather than server-side for the following reasons: (1) Latency: considering the small model size (~100 KB), inference can be performed locally in the user’s browser, thereby avoiding communication with a remote server and network latency. (2) Privacy: we reduced potential user privacy concerns, as no data is transmitted to a remote server for processing. Notably, during the user study, we found an average preparation time of 1.5 ms and an average model inference time of 1.7 ms across various hardware and platforms.
User flow. Drawing on previous research on crowdworker feedback [
23,
39], HAI [
1], and UI design [
63], we provide a two-stage intervention. After a user places a label, if LabelAId infers a mistake, we pop-up a just-in-time intervention dialog (Figure
9A) composed of three parts: a mistake title, a rotating set of labeling tips for that label type (
e.g., "Do not label driveways as curb ramps."; see Figure
9A), and three buttons: "Yes, I am sure," "No, remove the label" or "View Common Mistakes". Hovering over the "i" icon beside the mistake title will display an explanation that the reminder system is powered by AI and may make mistakes. If the user selects "View Common Mistake", they enter the second stage of customized information about common mistakes and correct examples for that label type. To minimize users’ cognitive load [
8], both the "View Common Mistakes" and "View Correct Examples" screens present a screen capture of the user’s current label alongside three to four example labels, facilitating more straightforward comparison. These example images are curated based on an analysis of frequent mistakes and effective labeling practices on Project Sidewalk. Our user flow (
Figure 8) prompts users to reflect on their labeling decisions and then educate them through examples, both of which have been proven to enhance crowdwork quality [
23,
99].
4.2 Study Design
To examine our research questions, we conducted a between-subjects study with and without LabelAId. Inspired by previous Project Sidewalk mapathons, the study sessions were conducted in groups via Zoom based on condition. While this setup differs from traditional crowdsourcing studies conducted on platforms like MTurk or Prolific, mapathons and other synchronous social data collection events are key methods for participant involvement in crowdsourced mapping projects like Project Sidewalk and OpenStreetMap
3. For example, in Project Sidewalk’s 18-month deployment in Oradell, NJ, two single-day mapathons contributed over 2,056 labels, accounting for 22% of all labels [
57].
Prior to the actual study sessions, we conducted pilot studies with one participant for each condition, during which two researchers observed the participants’ labeling behaviors in-person and screen-recorded the process for post-analysis. Based on insights from these pilot studies, we refined the moderation workflow.
For the actual study, two study moderators led six online sessions, three for each condition. Each session had five to seven participants and lasted between 90 and 120 minutes. The sessions were composed of three parts, and the moderator adhered to a script to ensure consistency. First, we provided a brief orientation of urban accessibility and disability, guided the participants through platform account registration, and asked the participants to finish Project Sidewalk’s standard ~5-minute interactive tutorial. Second, participants labeled eight curated routes on Project Sidewalk; the routes were carefully chosen by the research team to ensure they included frequent sidewalk accessibility features and problems. Both groups labeled identical routes. Participants were asked to mute themselves during the labeling tasks, and any questions were addressed privately via Zoom chat or in a breakout room. Although the intervention group had access to correct and incorrect examples through the LabelAId UI flow, both groups were shown illustrated tutorial screens in the beginning of each route, which is the standard Project Sidewalk UI (Figure
11). Furthermore, all participants could refer to these examples as well as the
How to Label section on the platform during labeling (Figure
11), a practice we observed in both groups during the pilot studies. Third, after completing their routes, participants filled out a post-study questionnaire followed by a semi-structured group debriefing session. The debriefing sessions were video and audio recorded. Please see the supplementary materials for our orientation slide deck and pre- and post-study questionnaires.
4.3 Participant Recruitment
For our user study, we recruited participants via university mailing lists and snowball sampling. Our study size of 34 participants was determined through a power analysis using
G*Power [
25], aiming for an effect size of 1 and a statistical power of 0.8. Participants were randomly assigned to either the control or the intervention group depending on their availability. Based on self-reported demographics, we had 21 participants aged 18-24 (12 in the control group), 11 aged 25-34 (5 in the control group), and 2 aged 35-44 (none in the control group); 18 women (10 in the control group), 15 men (6 in the control group), and 1 non-binary individual (1 in the control group). As for computer experience, 2 participants reported having basic skills, 4 had intermediate skills, and 28 considered themselves experts; these numbers were evenly split between the two groups. Before the study session, all participants were required to sign a research consent form and complete a pre-study questionnaire. Each participant was compensated at a rate of $30 per hour for their participation.
4.4 Evaluation Measures
Our study had a dual focus of understanding the objective performance of LabelAId users compared to the baseline as well as to examine their subjective experiences. For our objective measures, we collected and examined:
•
Labeling precision. The number of correct labels compared to the total number of labels, measuring the correctness of user input.
•
Labeling time. Time for participants to complete the labeling tasks, recorded per each route.
•
Learning gain in urban accessibility. We designed quiz questions that were included in both pre- and post-study questionnaires (see supplementary materials). Participants were shown four images for each of the five label types and were asked to select the correct ones. A sum score was calculated for all participants: each correct answer earned 1 point, and each incorrect answer was penalized with -1 point.
We also captured subjective measures through 5-point Likert scale questions:
•
Confidence in response. e.g., “How confident are you in labeling curb ramps?”
•
Self-efficacy gain. e.g. “I feel more confident about identifying problems on sidewalks faced by people with disabilities.”
•
Perceived learning gains in urban accessibility. e.g. “Participating in the study gave me more ideas to make sidewalks accessible for people with disabilities.”
•
Perceived usefulness. e.g. “I liked the pop-up prompts.”
•
Perceived AI intervention. “I felt that an AI agent was watching my performance/helping me while I was labeling.”
Full list of questions can be found in our supplementary materials.
4.5 Analysis Approach
To analyze our results, two researchers independently validated all participant labels (N=3,574). In cases of disagreements (N=74, IRR=0.98), a third researcher was consulted to reach a consensus. Validations were then used to calculate the precision of user input. For subjective measures captured through Likert scale questions, we mapped responses such as "Strongly disagree" to "Strongly agree" or "Not confident at all" to "Very confident" onto a numerical scale ranging from 1 to 5. We then use descriptive statistics to explore the dataset and to assess the participant performance across different conditions. Due to the between-subjects study and the distribution of the data, we use
Mann-Whitney U tests to compare label precision, labeling time, and Likert scale responses between the two groups [
75]. Additionally, both the debriefing sessions and the post-study questionnaire included open-ended questions to capture nuanced feedback about perceived learning experience, self-efficacy, and overall user experience. Our analysis for these responses focused on summarizing high-level themes. One researcher developed a set of themes through qualitative open coding [
15] based on the video transcript and the questionnaire responses, then coded the responses according to the themes. Participant quotes have been slightly modified for concision, grammar, and anonymity.
4.6 Results
During the study, participants contributed a total of 3,574 labels, with 2,091 from the control group and 1,483 from the intervention group. A detailed breakdown of the labels’ types and their correctness can be found in Table
6. Our open-encoding process highlighted several key themes, as outlined in Table
5. When asked what helped the participants to label, a majority of intervention participants mentioned the pop-up screens. Regarding labeling confidence, they reported that their confidence varied across different label types and generally increased as they progressed through the tasks. In terms of future improvements, many suggested implementing AI-assisted labeling followed by human verification. Below, we delve into an in-depth analysis that integrates both qualitative and quantitative evaluations to address each research question.
4.6.1 Task Performance (RQ1).
We first seek to examine whether there are significant differences between groups in task performance and how intervention level correlates with labeling precision within the intervention group.
Labeling precision and task completion time. As summarized in Figure
10, the intervention group demonstrated higher precision overall and across all label types compared to the control group. The Mann-Whitney U results indicate a significant difference in precision between the two groups both overall (
p ≤ 0.01) and for
Curb Ramp (
p ≤ 0.05) and
Missing Curb Ramp (
p ≤ 0.05) label types. For route completion time, we found no significant difference between the two groups (
p=0.693). The control group had a mean completion time of 2303.3 seconds (SD=1240.3), while the intervention group spent 2801.4 seconds (SD=2035.3). Similarly, no significant differences were observed when examining the time taken for each of the eight routes (p-values ranged from 0.143 to 0.971). These findings indicate that the use of LabelAId resulted in improved labeling precision without compromising labeling speed.
Labeling precision and level of intervention. While the intervention group clearly performed better, two pertinent questions are: how often did a LabelAId participant receive a just-in-time AI-assisted prompt and how accurately did LabelAId perform, i.e., what was the true positive and false positive rate for intervening?
Towards examining the first question: within the intervention group, there were a total of 172 instances where LabelAId intervened with a just-in-time prompt (10.9% of total labels; 10.1 per intervention group participant). When broken down by label type, LabelAId demonstrated high precision in predicting Curb Ramp (0.882), Missing Curb Ramp (0.750), and Missing Sidewalk (1.000) mistakes. However, the model’s precision was notably lower for Obstacle (0.362) and Surface Problem (0.377). Upon closer examination, we found that these less accurate inferences often corresponded with user behaviors that are likely to result in incorrect labels, such as not zooming in or failing to provide severity ratings or tags.
Within the 17 participants in the intervention group, our analysis revealed no significant correlation between the frequency of interventions by LabelAId and participants’ labeling precision, either overall or for specific label types. Similarly, the number of times participants viewed common mistakes or correct examples UI screens did not correlate with their labeling accuracy (Table
10). We will return to this point in section
5.
Despite the relatively low view frequency of the "Common Mistakes" UI screens (24 views in total, 1.4 views per person) and correct examples (6 views in total, 0.4 per person), qualitative feedback indicated their usefulness for those who chose to engage with them. During the debriefing sessions, several participants cited these screens when asked about what helped them during the labeling tasks. For instance, one participant noted a shift in their labeling approach after viewing the AI-triggered common mistake screen, stating, “Midway through, I saw the common mistakes, and it totally shifted my perspective. I had been labeling driveways from houses, but the screen clarified that those should not be labeled as curb cuts.”
4.6.2 Self-efficacy & Learning Gains (RQ2).
While the above findings demonstrate users’ improvements in terms of task performance, we are also interested in self-efficacy and learning.
Self-efficacy. In the post-study questionnaire, we asked all participants about their confidence in identifying sidewalk features or problems. On average, participants rated their self-confidence higher in the intervention group (Avg=4.47; SD=0.88) than the control group (Avg=4.53; SD=0.52) with a statistically significant difference for
Missing Curb Ramps (Avg=4.6; SD=0.7 vs. Avg=3.8; SD=0.9,
p ≤ 0.05), as shown in Table
12. However, when participants were asked if they felt more confident about identifying problems on sidewalks faced by people with disabilities, the difference between groups was not statistically significant (
p=0.721, see Q5 in Table
13).
Perceived learning gains. While task performance serves as one indicator of learning outcomes, we also used quizzes to assess objective learning gains and Likert scale questions to measure perceived learning gains. For objective learning gains, the mean improvement between the pre- and post-study quizzes was 1.35 (SD=1.73) for the control group and 1.31 (SD=1.54) for the intervention group, showing only a minor difference between the two. In terms of perceived learning gains, both groups demonstrated an enhanced understanding of curb ramps and accessibility challenges. Although the means were higher for the intervention group across all questions, no statistically significant difference was observed, except for the question, “Participating in the study gave me more ideas to make sidewalks accessible for people with disabilities.”, where the mean score for the control group was 4.35 (SD=0.7), compared to 4.82 (SD=0.53) for the intervention group (p ≤ 0.05).
4.6.3 Perceived Usefulness & Presence of AI (RQ3).
Having explored the overall user performance, confidence and learning gain, we now turn to the perceived usefulness and presence of AI in LabelAId.
Perceived usefulness. Participants generally expressed a favorable view of LabelAId. When asked to what extent they agreed with the statements that the pop-up prompts were helpful and likable, the majority responded with "Somewhat Agree" or "Strongly Agree" (82.35% and 64.7%, respectively). In the post-study questionnaire and debriefing sessions, 11 out of 17 participants in the intervention group specifically cited the pop-up screens from LabelAId as a feature they appreciated or found helpful for labeling tasks. These timely reminders were particularly valued when participants were uncertain about their initial judgments. One participant mentioned: “There were times when I was not sure if I should label it, and the system popped-up for me and said ‘Are you sure about this?’ I found that really helpful.” When asked about whether the prompts were distracting or appeared too frequently, the responses were more mixed—with a relatively even distribution across Likert responses.
Perceived presence of AI. We asked participants whether they felt an AI agent was observing their performance or assisting them during the labeling task and found a statistically significant differences between the two groups. This suggests that the presence of LabelAId had a noticeable impact on participants’ perception of AI involvement. Interestingly, some participants in the control group explicitly expressed a desire for AI assistance. One control-group participant mentioned, “There was a section [in the post-study questionnaire] asking how I felt about AI helping me to label. Honestly, I didn’t notice any AI while I was labeling. It would be super convenient if there was one that could suggest labels and ask me to correct them or provide a confidence level.” This is exactly the intent of LabelAId.
5 Discussion
Through our technical evaluation and user study, we showed how LabelAId improves both labeling data quality and crowdworkers’ domain knowledge. We now situate our findings in related work, highlight key factors behind LabelAId’s success, its limitations, and directions for future research. We also discuss how LabelAId can be generalized to other domains of crowdsourced science.
5.1 Reflecting on LabelAId’s Performance
Below, we reflect on LabelAI’d performance and its relevance to future research, including comparing the differences between AI and human feedback, minimizing the overreliance on AI, and striking a balance between constructive feedback and perceived surveillance.
Can AI-assistance replicate human-based feedback? Prior work has shown that providing manual feedback to crowdworkers can improve task performance and enhance self-efficacy [
22,
23,
60,
90,
99]. Our study further reveals that AI-feedback can improve labeling performance, increase participants’ confidence, and enhance their domain knowledge—even with an imperfect ML inference model. While the nuances between human and AI-feedback in crowdsourcing have yet to be comprehensively studied, researchers in education have assessed the usage of automatic feedback as a learning tool [
34,
38,
55,
92]. Findings suggest that automatic feedback can reduce bias and increase consistency in grading [
38], liberate the instructor from grading to focus on other tasks [
92], and allow more students to receive education simultaneously [
85]. We believe that these benefits can well be extended to AI-generated feedback in crowdsourcing systems.
Yet, automated feedback in education contexts has limitations. It excels in grading tasks with clear-cut solutions (
e.g. programming questions), but may be challenging to implement in more subjective disciplines [
34]. Moreover, automatic graders fail to recognize when students are very close to meeting the criteria, whereas human graders would identify and assign partial grades accordingly [
55]. Future research in crowdsourcing should incorporate these insights from education science when designing AI-based feedback systems, and borrow approaches such as AI-feedback combined with human feedback on request [
55].
Cognitive forcing function reduces overreliance on AI. An overarching concern with AI-based assistance—including systems like LabelAId—is how the presence and behavior of AI may actually
reduce active cognitive functioning in humans as they defer to AI’s recommendations, which can then negatively impact overall task performance [
42,
52]. For example, [
9,
42] showed how users tend to overly depend on AI, following its suggestions even when their own judgment might be superior. Such a tendency is particularly problematic when the AI is inconsistent (
e.g., across class categories), as in our case. Recent work has explored
cognitive forcing functions [
10]—functions that elicit thinking at decision-making time. Because there is an
anchoring bias [
32] that occurs when presenting users with AI’s recommendations, one effective strategy is to ask the user to make a decision prior to seeing the AI’s recommendation [
10]. Indeed, this is how LabelAId works: presenting suggestions only
after the user makes an initial decision and places a label—which may mitigate such bias.
Specifically, in our user study, LabelAId performed particularly poorly for two label types Obstacles and Surface Problems with false positive feedback rates of 36.2% and 37.7% respectively. However, users rejected these suggestions 83% and 73% of the time, indicating that they preferred their own judgments to the AI. Although this design choice was dictated by LabelAI’s model requirements, it encouraged analytical thinking that boosted participants’ confidence in their own decisions. Our study contributes to the broader discourse of HAI, highlighting how system design can elicit analytical reasoning and reduce cognitive biases in decision-making.
Striking a balance between constructive feedback and perceived surveillance. We found a significant difference between the two groups regarding the perceived presence of AI (Section
4.6.3). Out of the 17 participants in the intervention group, eight felt observed and nine felt assisted by an AI agent, while in the control group, none felt observed and only three sensed AI assistance. We speculate that this difference in perceived surveillance also contributed to better intervention group performance, since they felt their work was being scrutinized. This observation raises questions regarding AI agents as a form of surveillance in crowdsourcing environments. When scholars apply a Foucauldian lens [
26] to monitoring technology, some see AI monitoring as social control from existing power hierarchies [
12], while others argue it can both restrict and empower individuals [
50]. This dichotomy implies that, if well-implemented, AI can encourage self-regulation among crowdworkers. A recent study confirms that digital feedback improves crowdwork outcomes when learning is the primary objective [
94], which is often the case in crowdsourcing in community science. Therefore, we advocate for crowdsourcing platforms where the AI system strikes a balance between constructive feedback and perceived surveillance.
5.2 LabelAId Limitations and Future Research
We now reflect on LabelAId’s limitations and future work, focusing on designing interactions with imperfect ML models, promoting user agency in mixed-initiative interfaces, improving interaction efficiency in providing learning aids, and expanding participant diversity in future research.
Designing interactions with imperfect ML models. With LabelAId, we were able to determine when a user likely made a mistake, but not the exact source of the error, which limited the types of prompting we could provide. As one participant mentioned:
"It’ll be great to provide some rationale or explanation on why there’s a pop-up. Like maybe the location I placed my label is too far away from the obstacle.” Current approaches of offering AI explainability falls into two categories: communicating information about the model inferences on a local level (
e.g. confidence score and local feature importance) and communicating information about the model itself on a global level (
e.g. model accuracy and global explanations) [
51]. However, LabelAId’s current implementation does not incorporate explainability features.
On a global level, we recognize that our implementation could better communicate the model’s varying accuracy levels across different label types. Despite a detailed technical analysis of LabelAId’s performance in section
3, we did not surface accuracy scores or global feature importance to participants. Future iterations should address this shortcoming. On a local level, we intentionally excluded confidence scores. This choice was informed by research indicating that confidence scores have limited impact on improving HAI collaboration [
3,
10], coupled with our concern about over-cluttering the already busy UI. Future work may incorporate recent approaches to model the user’s level of confidence and provide adaptive recommendations,
i.e., only display AI’s recommendations when the AI’s confidence level is higher than the human’s [
59].
In summary, while our current design decisions were informed by a balance of user cognitive load considerations and technical constraints, future work should explore other methods to provide users with tailored explanations and rationale, enhancing their understanding and interaction with the ML model.
Promoting user agency in mixed-initiative interfaces. Participants had mixed opinions about the frequency of AI interventions, with some finding them distracting. One participant noted, “
Sometimes the pop-ups were too frequent, so it might be helpful to give the user the option to disable them.” In addition, we also noticed the diminishing returns of increased intervention. During the study, there is no significant correlation between the frequency of intervention and task performance (Table
10). One potential explanation is that users understood their mistakes after the first few interventions, thereby making fewer mistakes in subsequent tasks. These findings, consistent with learning science research demonstrating that additional exposure or intervention does not necessarily improve performance (known as the
saturation effect [
37]), are also supported by ongoing HAI research exploring ways to enhance human agency in mixed-initiative interfaces [
1,
51,
82]. In future iterations, we would like to explore offering users overall control to enable or disable AI, to provide adaptive suggestion frequency based on labeling rate, and to allow users to request AI assistance only when needed [
10].
Designing efficient UI for learning aids. In addition to a lack of correlation between how often participants viewed example screens and their performance levels (Table
10), we observed that common mistakes and correct examples were only viewed a total of 30 times–six of the 17 intervention participants never viewed either of the screens. This could be due to the
interaction cost [
7]: the common mistakes screen requires two clicks and the correct examples screen three. While click count alone is not a meaningful metric [
53], it is important to minimize interaction costs [
7] by making key information easily accessible. Future work should explore developing effective methods for presenting examples to crowdworkers while they are balancing high cognitive load tasks.
Expanding participant diversity in future research. While our study size of 34 aligns with typical HCI between-subjects studies (
e.g, [
43,
67]), it is on the lower end for crowdsourcing research [
48]. However, our study design choice facilitated in-depth interviews and focused analysis, allowing us to gather qualitative insights not typical in crowdsourcing studies. Participants were recruited through snowball sampling from the research team’s contacts and university mailing lists, which may not represent the comprehensive user base of Project Sidewalk including disability advocates. In future studies, we aim to enhance the applicability of our findings by expanding our participant base.
5.3 Generalizability to Other Domains
Our study demonstrates the effectiveness of LabelAId in a crowdsourcing tool for urban accessibility, yet, its generalizability remains an open question. We believe there are two primary generalizable components:
•
LabelAId’s PWS based ML pipeline. PWS does not require annotated data, it works on a set of LFs generalized from domain knowledge and user behavior. This is particularly useful for crowdsourced community science because it allows organizers to transform their expertise and heuristic into LFs, which can then programmatically label large quantities of data. It is also more cost-effective compared to traditional ML models, as LabelAId improves inference accuracy by 36.7% with only 50 downstream data points.
•
LabelAId’s mistake intervention design. LabelAId’s in-situ intervention design is rooted in literature on crowd feedback and contextual assistance, and aligns with recent HAI research on using cognitive theories to reduce over reliance on AI. Its simple two-step formula can be easily replicated in other platforms.
We believe our technique is most applicable to areas that require domain expertise and contextual understanding, such as medical image labeling [
79,
98], galaxy classification [
84], and wildlife categorization [
4]. For example, the crowdsourcing application
iNaturalist uses identification technology and taxonomic experts to assist people in identifying natural species, and it achieves the best results when combined with traditional field guides [
88]. We envision these guides and knowledge from experts being translated into LFs in our pipeline, and with similar mistake intervention design, LabelAId can help iNaturalist users contribute data more effectively while learning more about biodiversity.