1 Introduction
Technologies such as big data, data mining, and artificial intelligence are promising forces for improving data-driven decision-making. But there are also several societal concerns. For example, collecting user data through surveillance, without fair compensation and autonomy, has been a common practice in the tech industry [Zuboff
2019]. Similarly, the issue of data quality and its negative impacts often do not receive sufficient attention [Sambasivan et al.
2021]. However, with several recently proposed and enacted regulations in the
European Union (EU), such as the
General Data Protection Regulation (GDPR), the AI Act, the
Data Governance Act (DGA), and the
Digital Services Act (DSA), these practices may be about to change. The regulations stress social responsibility, trustworthiness, consumer protection, and users’ privacy, data autonomy, and informational self-determination. Data minimization, fairness, and shareability become features of future-proof good practices [European Commission
2022]. Since these legal and ethical norms cannot be incorporated into data-driven systems in hindsight, recent research suggests viewing it as a system requirement calling for responsibility by design [Abiteboul and Stoyanovich
2019]. Motivated by this, we address the problem of designing for trustworthy data collection in this article.
Specifically, we study incentive mechanism design for sharing high-quality personally measured fitness data. Instead of automatically collecting data from a tracked device, we preserve the users’ autonomy to measure and report data voluntarily. While some data providers may be intrinsically motivated to measure and report correct data, others may not be willing to do so without incentives. This may lead to low-quality data that may be difficult or impossible to improve at a later stage [Mohan and Pearl
2021]. Incentives can be manipulated by strategic misreporting, further degrading data quality. Explainable incentive-compatible mechanisms with game-theoretic guarantees [Faltings et al.
2017] are a promising way to address these problems. Using fixed incentives and an incentive-compatible peer-consistency mechanism designed to elicit subjective and unverifiable data by incentivizing truthful data reporting [Goel and Faltings
2019a], we conducted an explorative field experiment. The research questions that we aimed to explore in this experiment are (1) what is the effect of varying incentives on the quality of reported data and (2) whether the way we implemented the incentive-compatible design leads to a good user experience.
The experimental design and analysis focus on measuring and assessing data quality [Heinrich et al.
2018; Madnick et al.
2009]. For the analysis, we used a simple quality difference measure, which enabled the comparison of the fixed incentives and the incentive-compatible mechanism groups. The quality difference measure required a third group as a reference. This third group delivered proxy ground-truth observations. We find in the specific study context that the quality of the data between the two incentive groups does not differ, while data quality improves if extreme outliers are excluded from both the groups for analysis. Further, we find that the incentive-compatible mechanism provides a good user experience and compensates fairly. Based on our insights, we discuss specific directions that future studies may focus on, such as design improvements when applying the incentive-compatible mechanism.
Data collection is one of the earlier but important stages of data governance. If an overall responsible-by-design data governance is implemented, individuals are likely to act based on the applied incentives and provide trustworthy data. Our research contributes to real-world applications of data-based technologies while informing future research and complementing regulatory requirements.
2 Related Work
The design of robust incentive-compatible mechanisms is a central theme in economics and computation. The mechanisms of interest in the context of this article are the mechanisms for information elicitation without verification. The pioneering work in this field is due to Prelec [
2004] on the Bayesian Truth Serum and Miller et al. [
2005] on the peer prediction method. A lot of progress [Liu and Chen
2017; Radanovic et al.
2016; Shnayder et al.
2016] has since been made to make the mechanisms suitable for practical use in a variety of scenarios like opinion feedback elicitation (e.g., product reviews on e-commerce websites), participatory sensing (e.g., pollution measurements), human computation (e.g., microwork), and so forth. Faltings and Radanovic [
2017] provide a comprehensive overview. Much of the work in this field is devoted to theoretical analysis and guarantees, but there have also been a few empirical studies in this area [Faltings et al.
2014; Gao et al.
2014]. These mechanisms work for discrete signals about phenomena that can be observed by multiple agents.
One exception is the
Personalized Peer Truth Serum (PPTS) of Goel and Faltings [
2019a]. PPTS is a game-theoretic incentive mechanism to elicit multi-attribute personal measurements from rational agents. Personal measurements may be about a phenomenon that can be observed only personally. To the best of our knowledge, this is the only incentive mechanism in the literature that meets the requirements of our experiment, in which we ask the participants to share their physical activity measurements. The mechanism is based on the logarithmic scoring rule [Gneiting and Raftery
2007] and is a peer-consistency mechanism [Faltings and Radanovic
2017]. Peer-consistency mechanisms are incentive mechanisms to elicit correct information when there is no way to determine the correctness of the information, which is the case with subjective information related to physical activity.
The idea behind these mechanisms uses the fact that the information is often provided by multiple agents, the peers. A naive mechanism is the output agreement mechanism [Waggoner and Chen
2014]. In the output agreement, two agents may be asked to review the same product, and if they both provide the same review, they both get $1. Otherwise, they both get $0. It is obvious that this naive mechanism works only under very strong assumptions. If an agent believes that the other agent is most likely to have the same opinion, then a truth-telling equilibrium prevails. This basic idea has been improved significantly in the literature and there are now several mechanisms that make truth-telling an equilibrium even under weaker belief assumptions. In case of personal measurements such as physical activity data, the peer relationship between the agents is not clear because every agent measures and shares data about its own body (unlike products on an e-commerce website that can be used and experienced by many agents).
PPTS defines the peer relationship by clustering agents based on similarity in correlated attributes. When agents share data about multiple correlated attributes (say, X
1, X
2, X
3), the agents can be clustered based on similarity in the shared data in other attributes (say, X
2, X
3). Now, the rewards of each of the agents for the remaining attribute (X
1) can be calculated. The score of an agent for an attribute is the ratio of the likelihood of the shared value of the attribute in the cluster of the agent and the likelihood of the shared value of the attribute in the overall population. Informally:
The higher this ratio, the higher is the score of the agent. This mechanism is incentive compatible; i.e., truth-telling equilibrium prevails and other (non-truthful) equilibria are not more profitable in this mechanism [Goel and Faltings
2019a]. The score outcomes can then be scaled appropriately (as per budget constraints and fairness requirements) to calculate the rewards in euros for each agent. Depending on the structure of the collected correlated data, clustering may require large samples, which is usually the case in crowdsourcing and big data technologies.
3 Experimental Design and Hypotheses
3.1 Framing and Task
We designed a web-based framed field experiment [Harrison and List
2004], which had a potential real-world context. We used oTree [Chen et al.
2016] to implement the experiment. Participants were recruited between November 2020 and June 2021 through the Clickworker platform in Germany. They were presented with a crowdsourcing task aiming to collect fitness data suitable for training artificial intelligence algorithms. The task asked to report fitness data generated through a 15 to 20-minute light or moderate outdoor activity. Light outdoor activity was defined as walking, and moderate outdoor activity as an activity that noticeably accelerates the heart rate, like Nordic walking, brisk walking, or light running. Fitness data to be reported contained seven measures correlated by individual physiology:
Walk or Run Time, Distance, Average Pace, Fastest Pace, Ascent, Descent, and
Energy Burn. The design did not require any personal identifiers or characteristics such as step length or body mass index, and there was no need to automatically track participants’ fitness wearables. Thus, the design respected data regulations.
The study consisted of three easy steps: (1) agreeing to the informed consent, (2) downloading the Walkmeter app to the smartphone and collecting the fitness data, and (3) reporting the data with a chance to win 50 euros. The informed consent highlighted that we do not ask participants to do any outdoor physical activity that they would not usually do, and therefore we do not incentivize the execution of the physical activity. As a reference, crowdworking platforms usually require the minimum hourly wage, which then amounted to around 9.50 euros. By paying a fixed amount of 1.05 euros, we incentivized only data collection and data reporting, which took about 3 to 5 minutes. In a real-world application, additional factors such as value of data would also be considered. The freely available Walkmeter app ensured that all participants use the same means of data collection. Further, we explained to the participants that we decided on the Walkmeter app, as it is commonly used and the free version does not require registration with personally identifiable data such as an email address. Thus, we preserved anonymous data collection. For reporting the fitness data, participants had 3 days’ time to return to the study website and submit their report. The chance to win one of 10 lotteries worth 50 euros was available in each treatment group but varied by group. We did not offer the participants any personalized service; hence, they did not benefit directly from measuring and reporting the data correctly and truthfully.
We also collected demographic and post-study feedback data. Demographic data includes information on weight, height, age, and gender. Post-study data explores feedback on the user experience related to the incentive design. For more details on the instructions, see Appendix
B.
3.2 Treatments and Hypotheses
The experiment had three treatment groups: the Proxy ground-truth (P), the Fixed incentives (F), and the Quality incentives (Q) groups. In the Quality incentives group, we rewarded participants based on the above-described PPTS mechanism, so that increasing incentives were aligned to the increasing quality of the reported data. Depending on their PPTS score for all the seven fitness data entries, participants had varying chances to win 50 euros. Using comprehension checks, we made sure that they understood the basic features of the PPTS mechanism (i.e., truthful and accurate data entries increase the quality of the automated peer grouping and entries score higher if they are more common in their own group than overall). Moreover, we made sure that they understood that they could influence their PPTS scores, so that truthfulness and accuracy of their entries increase their PPTS scores, and thus individual chances of winning. In the Proxy ground-truth and Fixed incentives groups, the chance of winning the 50 euros was fixed (distributed equally among the participants) and independent from the content of the data entries. Comprehension checks made sure that participants understood this. In the Proxy ground-truth group, we additionally required participants to submit a screenshot of the reported data, which served as a proof for data correctness. Thus, the Proxy ground-truth group delivered a proxy of the ground-truth data distribution to compare against. In the Fixed incentives group, we made it (by the instructions) as salient as possible that data collection is uncontrolled. So, we expected to observe the most dishonesty and inaccuracy in this group.
In order to estimate the assumed fixed effect of quality difference (by lying or inaccuracy) in the Quality incentives and Fixed incentives groups, we normalize the data in these groups by the outcome of the Proxy ground-truth group and pool the data as a panel. We then hypothesize that the quality difference in the Quality incentives group is below the quality difference in the Fixed incentives group. Since the PPTS mechanism requires clustering of agents, ideally, we targeted recruiting 500 participants equally in each treatment group. In this way, the experimental design, including the treatment groups, the analysis plan, and the hypothesis, are consistently aligned with each other.
However, recruitment turned out to be very challenging and we will discuss in Section
5 how to improve this in a future study or real-world application. We ended up recruiting the groups sequentially, and recruited only 691 participants altogether, out of which 501 were in (Q), 90 in (F), and 100 in (P). Each group was gender-balanced. In the Quality incentives and Proxy ground-truth groups, the participation rate was around 60%, compared to the roughly 80% in the Fixed incentives group. We will discuss in Section
5 whether the reason for this difference in the participation rate might be due to a self-selection bias and how this could be addressed. For an overview of the treatment conditions see Table
1. For details on the recruitment and participation rates see Section
A.1 in Appendix
A, and for the balance table see Table
A.2, also in Appendix
A. For further details on how the 50 euros lotteries were paid in the Quality incentives group at the end of the experiment, please see Section
A.3 in Appendix
A.
5 Discussion
Studies that target the reduction of dishonest behavior [Frank et al.
2017; Hussam et al.
2017; John et al.
2012, Rigol and Roth
2016], e.g., in reporting questionable research practices, show positive effectiveness of incentive mechanism designs based on peer-consistency methods. Despite their great potential, incentive designs are seldom validated nor adapted in the field and existing field studies rarely get replicated. Underlying algorithms are deemed to be too complex, so that, e.g., in John et al. [
2012] they are not explained in detail. Other barriers are the large sample size that is required by these big data algorithms and the lack of suitable platforms meeting the recruitment requirements. Conventional crowdsourcing platforms might have established cultures or recruitment difficulties, while online research platforms might have a strong culture to deliver reliable and high-quality data, which would require to design additional incentives to lie [Gneezy et al.
2018] or defaults to nudge certain choices [Baillon et al.
2022]. The present study is unprecedented in its design and was therefore risky to conduct, as the chances were high that it will be a learning-from-failure study. Although we find no significant quality differences between the studied incentive mechanisms, the applied quality-dependent incentives can elicit good data and yield a good user experience. Thus, we provide a first transparent proof of concept as a practical contribution and identify dimensions that need improvement and might cause data quality problems [Madnick et al.
2009]. These dimensions cover topics from sampling through design details to feedback loops and data cleaning.
The recruitment constraints during the data collection caused most limitations to our study. Thus, even though field experiments have a relatively high internal and external validity, their success is limited by the researchers’ connections and recruitment possibilities [Roe and Just
2009]. The conventional crowdsourcing platform that we chose could not deliver, under pandemic conditions, the required large sample. The resulting different sample sizes and room for inaccuracy by design underpower the results. For example, checking for the validity of the screenshots could have been done before calculating the payoffs. Moreover, extreme outliers could have been excluded by design, e.g., by implementing reasonable ranges and asking participants to cross-check their results for accuracy before submission. The problem of the extreme outliers seems to root in participants’ negligence, inattention, and not taking the task seriously, rather than not having understood the task. What speaks for this are the descriptive statistics on explainability, which did not change in their composition after excluding extreme outliers.
Low incentives might be another problem. One way to increase incentives would be to implement the incentive design repeatedly and put individual reputation as a “high-quality data provider” at stake in a realistic setting. Feedback on the reputation could be used as part of individual progress monitoring or in social comparison. Repeated interaction would also increase the chances for learning effects and reduce the costs of initial learning invested to understand the incentive design. Implementation in a real-world setting would address the challenging problem of recruiting a large number of participants too. Moreover, it would allow establishing a new and purposive crowdsourcing culture, aligned with the incentive design and its trustworthy and responsible framing. In contrast, conventional crowdsourcing platforms often already have a disadvantageous culture, such as inattention, self-misrepresentation, high attrition, social desirability bias, and so forth that impacts data quality [Agley et al.
2022; Aguinis et al.
2021; Saravanos et al.
2021].
If not caused by recruitment anomalies on the part of the crowdsourcing platform, the reported difference in the participation rates of 25 percentage points provides some evidence for a self-selection bias in the treatment groups. The self-selection might be driven by the varying complexity of the task. Any cognitive overload due to task complexity could be reduced by repeated interaction in a realistic setting, as suggested. Supporting this suggestion, Weaver and Prelec [
2013] found that truthful behavior improved as participants learned the incentive mechanism through repeated interaction, even in the absence of explicit guidance. In another algorithmic context, Biermann et al. [
2022] also showed that providing feedback to participants through repeated interaction is more effective than explanation of the underlying mechanism. A future experiment may focus on, for example, the behavior of the participants in both groups as they learn through repeated interaction the incentive compatibility of the peer-consistency mechanism and the absence of incentive compatibility in case of fixed incentives.
It is worth noticing that even though we excluded outliers after and not during data collection, this procedure reduced the impact of unsophisticated cheating and inaccuracy. In order to have enough observations for the peer clustering among a group sample of 500 observations, we initially narrowed down walk time to a 15- to 20-minute range. This way, all participants also got an anchor for applying any heuristic reporting strategy, rather than completing the task. Thus, in real-world applications, it may be a good idea to use not only individual reputation as an incentive but also additional measures (e.g., including limited ground truth [Goel and Faltings
2019b]) to make sophisticated cheating too complex and costly enough to raise the overall trustworthiness of the mechanism design.
In the context of fitness data, Zhou and Zhu [
2022] recently showed that presenting calorie-equivalent exercise data is effectively nudging consumers to healthier food choices. Specifically, if food labels contain precise rather than rounded exercise data, the intervention is more effective. “For instance, a chocolate bar with a calorie content of approximately 300 kcal may have a 5 km walk shown in its exercise data or, more precisely, a 4.87 km walk or a 5.13 km walk” [Zhou and Zhu
2022]. Thus, data quality matters in our context for the most various applications, and its requirements (e.g., range, scale, interpretability, feasibility, acceptability, etc.) need to be defined in advance [Heinrich et al.
2018].
Finally, the post-study survey could additionally ask whether participants were aware that they were in the respective uncontrolled group, as well as collect qualitative data via interviews or focus groups. Such feedback loops could be of key importance in repeated interaction settings in order to monitor how participants learn over time. Learning might also lead to abusing the mechanism, and thus further adjustment of the incentive design might be necessary. For monitoring continual and personalized adjustment in repeated and adaptive intervention settings, sequential randomized trials proved to be less limiting and more advantageous than traditional A/B-testing [NeCamp et al.
2019]. Again, for this purpose, real settings can be more adequate than conventional crowdsourcing platforms. Another improvement might be to initially explain the motivation behind using the quality-dependent incentives. Motivations are, for example, the consumer protection requirements by the new EU regulations or simply the aims to increase data quality and trustworthiness in the data collection procedure. Moreover, it might be helpful to explain why the mechanism is complex. For example, PPTS does not require the participants to submit proof of correctness of their data points or execution of any spot-checking or surveillance on the participants to calculate incentives, which is compensated by comparing peers based on correlated data, making cheating complex and unnecessary. On top of that, PPTS can also be used for data cleaning, not only for incentives [Goel
2020]. Data cleaning by design may require telling the participants in advance about its procedures if, for example, this procedure may affect their payoffs or behavior.
6 Conclusion
Via a large-scale framed field experiment, we demonstrate and discuss the challenges and opportunities that lie in incentive mechanism design for high-quality data sharing or collection, respecting the human-centric European responsible data governance principles. We implemented and explained the incentive mechanism in a transparent and easily comprehensive way, informing about the payoff risks of dishonest behavior and rewards for honest behavior, and leaving participation anonymous, voluntary, and untracked. We observe that incentive design can be effective in eliciting high-quality data in the context of unverifiable and personally measured fitness (physical activity) data. Moreover, we discussed the pitfalls of a challenging large-scale experiment related to design, recruitment, and analysis. In the post-study survey, participants reported they had a good user experience. Most of them felt properly instructed with a relatively high perceived effectiveness and fairness of the incentive mechanism. An ideal future experiment would include a repeated interaction, individual reputation as an additional stake, and a real-world setting by design, in addition to a larger number of participants. Our research thus contributes to improving the design and processes of future experiments and real-world applications utilizing incentive-compatible peer-consistency mechanisms for improving the quality of data.