Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3485447.3512239acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article
Open access

The Influences of Task Design on Crowdsourced Judgement: A Case Study of Recidivism Risk Evaluation

Published: 25 April 2022 Publication History

Abstract

Crowdsourcing is widely used to solicit judgement from people in diverse applications ranging from evaluating information quality to rating gig worker performance. To encourage the crowd to put in genuine effort in the judgement tasks, various ways to structure and organize these tasks have been explored, though the understandings of how these task design choices influence the crowd’s judgement are still largely lacking. In this paper, using recidivism risk evaluation as an example, we conduct a randomized experiment to examine the effects of two common designs of crowdsourcing judgement tasks—encouraging the crowd to deliberate and providing feedback to the crowd—on the quality, strictness, and fairness of the crowd’s recidivism risk judgements. Our results show that different designs of the judgement tasks significantly affect the strictness of the crowd’s judgements. Moreover, task designs also have the potential to significantly influence how fairly the crowd judges defendants from different racial groups, on those cases where the crowd exhibits substantial in-group bias. Finally, we find that the impacts of task designs on the judgement also vary with the crowd workers’ own characteristics, such as their cognitive reflection levels. Together, these results highlight the importance of obtaining a nuanced understanding on the relationship between task designs and properties of the crowdsourced judgements.

References

[1]
Harini Alagarai Sampath, Rajeev Rajeshuni, and Bipin Indurkhya. 2014. Cognitively inspired task design to improve user performance on crowdsourcing platforms. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 3665–3674.
[2]
Jennifer Allen, Antonio A. Arechar, Gordon Pennycook, and David G. Rand. 2021. Scaling up fact-checking using the wisdom of crowds. Science Advances 7, 36 (2021), eabf4393. https://doi.org/10.1126/sciadv.abf4393 arXiv:https://www.science.org/doi/pdf/10.1126/sciadv.abf4393
[3]
Hal R Arkes. 1991. Costs and benefits of judgment errors: Implications for debiasing.Psychological bulletin 110, 3 (1991), 486.
[4]
Solon Barocas, Moritz Hardt, and Arvind Narayanan. 2017. Fairness in machine learning. Nips tutorial 1(2017), 2017.
[5]
Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 57, 1(1995), 289–300.
[6]
Sarah Bird, Miro Dudík, Richard Edgar, Brandon Horn, Roman Lutz, Vanessa Milan, Mehrnoosh Sameki, Hanna Wallach, and Kathleen Walker. 2020. Fairlearn: A toolkit for assessing and improving fairness in AI. Technical Report MSR-TR-2020-32. Microsoft. https://www.microsoft.com/en-us/research/publication/fairlearn-a-toolkit-for-assessing-and-improving-fairness-in-ai/
[7]
Arpita Biswas, Marta Kolczynska, Saana Rantanen, and Polina Rozenshtein. 2020. The Role of In-Group Bias and Balanced Data: A Comparison of Human and Machine Recidivism Risk Predictions. In Proceedings of the 3rd ACM SIGCAS Conference on Computing and Sustainable Societies. 97–104.
[8]
Toon Calders, Faisal Kamiran, and Mykola Pechenizkiy. 2009. Building classifiers with independency constraints. In 2009 IEEE International Conference on Data Mining Workshops. IEEE, 13–18.
[9]
Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. 2017. Semantics derived automatically from language corpora contain human-like biases. Science 356, 6334 (2017), 183–186.
[10]
Joseph Chee Chang, Saleema Amershi, and Ece Kamar. 2017. Revolt: Collaborative crowdsourcing for labeling machine learning datasets. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. 2334–2346.
[11]
Chun-Wei Chiang, Anna Kasunic, and Saiph Savage. 2018. Crowd coach: Peer coaching for crowd workers’ skill growth. Proceedings of the ACM on Human-Computer Interaction 2, CSCW(2018), 1–17.
[12]
Sharath R. Cholleti, Sally A. Goldman, Avrim Blum, David G. Politte, and Steven Don. 2008. Veritas: Combining expert opinions without labeled data. In Proceedings 20th IEEE international Conference on Tools with Artificial intelligence (ICTAI).
[13]
Pat Croskerry, Geeta Singhal, and Sílvia Mamede. 2013. Cognitive debiasing 1: origins of bias and theory of debiasing. BMJ quality & safety 22, Suppl 2 (2013), ii58–ii64.
[14]
A. P. Dawid and A. M. Skene. 1979. Maximum likeihood estimation of observer error-rates using the EM algorithm. Applied Statistics 28(1979), 20–28.
[15]
Luca De Alfaro and Michael Shavlovsky. 2014. CrowdGrader: A tool for crowdsourcing the evaluation of homework assignments. In Proceedings of the 45th ACM technical symposium on Computer science education. 415–420.
[16]
Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B 39 (1977), 1–38.
[17]
Steven Dow, Anand Kulkarni, Scott Klemmer, and Björn Hartmann. 2012. Shepherding the crowd yields better work. In Proceedings of the ACM 2012 conference on computer supported cooperative work. 1013–1022.
[18]
Ryan Drapeau, Lydia Chilton, Jonathan Bragg, and Daniel Weld. 2016. Microtalk: Using argumentation to improve crowdsourcing accuracy. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, Vol. 4.
[19]
Julia Dressel and Hany Farid. 2018. The accuracy, fairness, and limits of predicting recidivism. Science advances 4, 1 (2018), eaao5580.
[20]
Xiaoni Duan, Chien-Ju Ho, and Ming Yin. 2020. Does Exposure to Diverse Perspectives Mitigate Biases in Crowdwork? An Explorative Study. In Proceedings of the 8th AAAI Conference on Human Computation and Crowdsourcing, Vol. 8. 155–158.
[21]
Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference. 214–226.
[22]
Carsten Eickhoff. 2018. Cognitive biases in crowdsourcing. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 162–170.
[23]
Jonathan St BT Evans and Keith Ed Frankish. 2009. In two minds: Dual processes and beyond.Oxford University Press.
[24]
Ailbhe Finnerty, Pavel Kucherbaev, Stefano Tranquillini, and Gregorio Convertino. 2013. Keep it simple: Reward and task design in crowdsourcing. In Proceedings of the Biannual Conference of the Italian Chapter of SIGCHI. 1–4.
[25]
Shane Frederick. 2005. Cognitive reflection and decision making. Journal of Economic perspectives 19, 4 (2005), 25–42.
[26]
Meric Altug Gemalmaz and Ming Yin. 2021. Accounting for Confirmation Bias in Crowdsourced Label Aggregation. In Proceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI).
[27]
Ben Green and Yiling Chen. 2019. The principles and limits of algorithm-in-the-loop decision making. Proceedings of the ACM on Human-Computer Interaction 3, CSCW(2019), 1–24.
[28]
Ben Green and Yiling Chen. 2020. Algorithmic risk assessments can alter human decision-making processes in high-stakes government contexts. arXiv preprint arXiv:2012.05370(2020).
[29]
Nina Grgić-Hlača, Christoph Engel, and Krishna P Gummadi. 2019. Human decision making with machine assistance: An experiment on bailing and jailing. Proceedings of the ACM on Human-Computer Interaction 3, CSCW(2019), 1–25.
[30]
Anikó Hannák, Claudia Wagner, David Garcia, Alan Mislove, Markus Strohmaier, and Christo Wilson. 2017. Bias in online freelance marketplaces: Evidence from taskrabbit and fiverr. In Proceedings of the 2017 ACM conference on computer supported cooperative work and social computing. 1914–1933.
[31]
Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equality of opportunity in supervised learning. Advances in neural information processing systems 29 (2016), 3315–3323.
[32]
Ethan Haworth, Ted Grover, Justin Langston, Ankush Patel, Joseph West, and Alex C Williams. 2021. Classifying Reasonability in Retellings of Personal Events Shared on Social Media: A Preliminary Case Study with/r/AmITheAsshole. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 15. 1075–1079.
[33]
Chien-Ju Ho, Rafael Frongillo, and Yiling Chen. 2016. Eliciting categorical data for optimal aggregation. In Proceedings of the 30th International Conference on Neural Information Processing Systems. 2450–2458.
[34]
Chien-Ju Ho, Shahin Jabbari, and Jennifer Wortman Vaughan. 2013. Adaptive task assignment for crowdsourced classification. In Proceedings of the 30th International Conference on Machine Learning. 534–542.
[35]
Chien-Ju Ho, Aleksandrs Slivkins, Siddharth Suri, and Jennifer Wortman Vaughan. 2015. Incentivizing High Quality Crowdwork. In Proceedings of the 24th International Conference on World Wide Web (WWW).
[36]
Chien-Ju Ho, Aleksandrs Slivkins, and Jennifer Wortman Vaughan. 2016. Adaptive contract design for crowdsourcing markets: Bandit algorithms for repeated principal-agent problems. Journal of Artificial Intelligence Research 55 (2016), 317–359.
[37]
Chien-Ju Ho, Yu Zhang, Jennifer Wortman Vaughan, and Mihaela Van Der Schaar. 2012. Towards social norm design for crowdsourcing markets. In Proceedings of the 4th Human Computation Workshop.
[38]
John Joseph Horton and Lydia B. Chilton. 2010. The labor economics of paid crowdsourcing. In Proceedings of the 11th ACM conference on Electronic commerce (EC).
[39]
Xinlan Emily Hu, Mark E Whiting, and Michael S Bernstein. 2021. Can Online Juries Make Consistent, Repeatable Decisions?. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–16.
[40]
Christoph Hube, Besnik Fetahu, and Ujwal Gadiraju. 2019. Understanding and mitigating worker biases in the crowdsourced collection of subjective judgments. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. ACM, 407.
[41]
Jongbin Jung, Sharad Goel, Jennifer Skeem, 2020. The limits of human predictions of recidivism. Science advances 6, 7 (2020).
[42]
Daniel Kahneman. 2011. Thinking, fast and slow. Macmillan.
[43]
William H Kruskal and W Allen Wallis. 1952. Use of ranks in one-criterion variance analysis. Journal of the American statistical Association 47, 260(1952), 583–621.
[44]
David La Barbera, Kevin Roitero, Gianluca Demartini, Stefano Mizzaro, and Damiano Spina. 2020. Crowdsourcing Truthfulness: The Impact of Judgment Scale and Assessor Bias. Advances in Information Retrieval 12036 (2020), 207.
[45]
Vivian Lai and Chenhao Tan. 2019. On human predictions with explanations and predictions of machine learning models: A case study on deception detection. In Proceedings of the conference on fairness, accountability, and transparency. 29–38.
[46]
Jeff Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin. 2016. How we analyzed the COMPAS recidivism algorithm. ProPublica (5 2016) 9, 1 (2016).
[47]
Tianyi Li, Chandler J Manns, Chris North, and Kurt Luther. 2019. Dropping the baton? Understanding errors and bottlenecks in a crowdsourced sensemaking pipeline. Proceedings of the ACM on Human-Computer Interaction 3, CSCW(2019), 1–26.
[48]
Keri Mallari, Kori Inkpen, Paul Johns, Sarah Tan, Divya Ramesh, and Ece Kamar. 2020. Do I Look Like a Criminal? Examining how Race Presentation Impacts Human Judgement of Recidivism. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–13.
[49]
Lena Mamykina, Thomas N Smyth, Jill P Dimond, and Krzysztof Z Gajos. 2016. Learning from the crowd: Observational learning in crowdsourcing communities. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. 2635–2644.
[50]
Winter Mason and Duncane Watts. 2009. Financial Incentives and the “Performance of Crowds”. In Proceedings of the 1st Human Computation Workshop (HCOMP).
[51]
Jeffrey J Rachlinski, Sheri Lynn Johnson, Andrew J Wistrich, and Chris Guthrie. 2008. Does unconscious racial bias affect trial judges. Notre Dame L. Rev. 84(2008), 1195.
[52]
Vikas Raykar, Shipeng Yu, Linda Zhao, Gerardo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. 2010. Learning from crowds. Journal of Machine Learning Research 11 (2010), 1297–1322.
[53]
Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M Aroyo. 2021. “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–15.
[54]
Mike Schaekermann, Joslin Goh, Kate Larson, and Edith Law. 2018. Resolvable vs. irresolvable disagreement: A study on worker deliberation in crowd work. Proceedings of the ACM on Human-Computer Interaction 2, CSCW(2018), 1–19.
[55]
Aleksandr Sinayev and Ellen Peters. 2015. Cognitive reflection vs. calculation in decision making. Frontiers in psychology 6 (2015), 532.
[56]
Michael Soprano, Kevin Roitero, David La Barbera, Davide Ceolin, Damiano Spina, Stefano Mizzaro, and Gianluca Demartini. 2021. The many dimensions of truthfulness: Crowdsourcing misinformation assessments on a multidimensional scale. Information Processing & Management 58, 6 (2021), 102710.
[57]
Keith E Stanovich and Richard F West. 2000. Individual differences in reasoning: Implications for the rationality debate?Behavioral and brain sciences 23, 5 (2000), 645–665.
[58]
Wei Tang and Chien-Ju Ho. 2019. Bandit Learning with Biased Human Feedback. In Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems. 1324–1332.
[59]
Wei Tang, Ming Yin, and Chien-Ju Ho. 2019. Leveraging peer communication to enhance crowdsourcing. In The World Wide Web Conference. 1794–1805.
[60]
Carlos Toxtli, Angela Richmond-Fuller, and Saiph Savage. 2020. Reputation Agent: Prompting Fair Reviews in Gig Markets. In Proceedings of The Web Conference 2020. 1228–1240.
[61]
Xinru Wang and Ming Yin. 2021. Are Explanations Helpful? A Comparative Study of the Effects of Explanations in AI-Assisted Decision-Making. In 26th International Conference on Intelligent User Interfaces. 318–328.
[62]
Larry Wasserman. 2004. All of statistics: a concise course in statistical inference. Vol. 26. Springer.
[63]
Jacob Whitehill, Ting fan Wu, Jacob Bergsma, Javier R. Movellan, and Paul L. Ruvolo. 2009. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in Neural Information Processing Systems (NIPS).
[64]
Kaiyu Yang, Klint Qinami, Li Fei-Fei, Jia Deng, and Olga Russakovsky. 2020. Towards fairer datasets: Filtering and balancing the distribution of the people subtree in the imagenet hierarchy. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. 547–558.
[65]
Ming Yin and Yiling Chen. 2016. Predicting crowd work quality under monetary interventions. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, Vol. 4. 259–268.
[66]
Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. 2017. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th international conference on world wide web. 1171–1180.
[67]
Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2017. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. arXiv preprint arXiv:1707.09457(2017).
[68]
Yudian Zheng, Guoliang Li, Yuanbing Li, Caihua Shan, and Reynold Cheng. 2017. Truth inference in crowdsourcing: Is the problem solved?Proceedings of the VLDB Endowment 10, 5 (2017), 541–552.
[69]
Haiyi Zhu, Steven P Dow, Robert E Kraut, and Aniket Kittur. 2014. Reviewing versus doing: Learning and performance in crowd assessment. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing. 1445–1455.

Cited By

View all
  • (2024)On the Impact of Showing Evidence from Peers in Crowdsourced Truthfulness AssessmentsACM Transactions on Information Systems10.1145/363787242:3(1-26)Online publication date: 22-Jan-2024
  • (2024)CrowdDC: Ranking From Crowdsourced Paired Comparison With Divide-and-ConquerIEEE Transactions on Computational Social Systems10.1109/TCSS.2023.329663211:2(3015-3021)Online publication date: Apr-2024
  • (2023)How does Value Similarity affect Human Reliance in AI-Assisted Ethical Decision Making?Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society10.1145/3600211.3604709(49-57)Online publication date: 8-Aug-2023
  • Show More Cited By

Index Terms

  1. The Influences of Task Design on Crowdsourced Judgement: A Case Study of Recidivism Risk Evaluation
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        WWW '22: Proceedings of the ACM Web Conference 2022
        April 2022
        3764 pages
        ISBN:9781450390965
        DOI:10.1145/3485447
        This work is licensed under a Creative Commons Attribution International 4.0 License.

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 25 April 2022

        Check for updates

        Author Tags

        1. Crowdsourcing
        2. bias
        3. fairness
        4. quality
        5. task design

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Conference

        WWW '22
        Sponsor:
        WWW '22: The ACM Web Conference 2022
        April 25 - 29, 2022
        Virtual Event, Lyon, France

        Acceptance Rates

        Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)172
        • Downloads (Last 6 weeks)28
        Reflects downloads up to 03 Oct 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)On the Impact of Showing Evidence from Peers in Crowdsourced Truthfulness AssessmentsACM Transactions on Information Systems10.1145/363787242:3(1-26)Online publication date: 22-Jan-2024
        • (2024)CrowdDC: Ranking From Crowdsourced Paired Comparison With Divide-and-ConquerIEEE Transactions on Computational Social Systems10.1109/TCSS.2023.329663211:2(3015-3021)Online publication date: Apr-2024
        • (2023)How does Value Similarity affect Human Reliance in AI-Assisted Ethical Decision Making?Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society10.1145/3600211.3604709(49-57)Online publication date: 8-Aug-2023
        • (2023)Are Two Heads Better Than One in AI-Assisted Decision Making? Comparing the Behavior and Performance of Groups and Individuals in Human-AI Collaborative Recidivism Risk AssessmentProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3581015(1-18)Online publication date: 19-Apr-2023
        • (2023)On the role of human and machine metadata in relevance judgment tasksInformation Processing and Management: an International Journal10.1016/j.ipm.2022.10317760:2Online publication date: 1-Mar-2023

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media