Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3625834.3625857guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article

Testing conventional wisdom (of the crowd)

Published: 31 July 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Do common assumptions about the way that crowd workers make mistakes in microtask (labeling) applications manifest in real crowdsourcing data? Prior work only addresses this question indirectly. Instead, it primarily focuses on designing new label aggregation algorithms, seeming to imply that better performance justifies any additional assumptions. However, empirical evidence in past instances has raised significant challenges to common assumptions. We continue this line of work, using crowdsourcing data itself as directly as possible to interrogate several basic assumptions about workers and tasks. We find strong evidence that the assumption that workers respond correctly to each task with a constant probability, which is common in theoretical work, is implausible in real data. We also illustrate how heterogeneity among tasks and workers can take different forms, which have different implications for the design and evaluation of label aggregation algorithms.

    Supplementary Material

    Additional material (3625834.3625857_supp.pdf)
    Supplemental material.

    References

    [1]
    Yoram Bachrach, Thore Graepel, Tom Minka, and John Guiver. How to grade a test without knowing the answers ߜ a bayesian graphical model for adaptive crowdsourcing and aptitude testing. In 29th International Conference on Machine Learning, ICML 2012, 2012. arXiv:1206.6386.
    [2]
    Valerio Basile, Federico Cabitza, Andrea Campagner, and Michael Fell. Toward a perspectivist turn in ground truthing for predictive computing, 2021. arXiv:2109.04270.
    [3]
    Alex Burnap, Yi Ren, Richard Gerth, Giannis Papazoglou, Richard Gonzalez, and Panos Y. Papalambros. When crowdsourcing fails: A study of expertise on crowd-sourced design evaluation. Journal of Mechanical Design, 137(3), 03 2015. ISSN 1050-0472.
    [4]
    Jesus Cerquides, Mehmet Oğuz Mülâyim, Jerónimo Hernández-González, Amudha Ravi Shankar, and Jose Luis Fernandez-Marquez. A conceptual probabilistic framework for annotation aggregation of citizen science data. Mathematics, 9(8), 2021. ISSN 2227-7390.
    [5]
    A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):20-28, 1979. ISSN 00359254, 14679876.
    [6]
    Susan E. Embretson and Steven P. Reise. Item Response Theory for Psychologists. Multivariate Applications Book Series. Psychology Press, 2000. ISBN 9780805828184.
    [7]
    Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. Bayesian Data Analysis. Chapman & Hall/CRC Texts in Statistical Science. Chapman and Hall/CRC, an imprint of Taylor and Francis, Boca Raton, FL, 3rd edition, 2013. ISBN 9780429113079.
    [8]
    Mitchell L. Gordon, Kaitlyn Zhou, Kayur Patel, Tatsunori Hashimoto, and Michael S. Bernstein. The disagreement deconvolution: Bringing machine learning performance metrics in line with reality. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI '21, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380966.
    [9]
    Hyun Joon Jung and Matthew Lease. Improving quality of crowdsourced labels via probabilistic matrix factorization. In Workshops at the Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012. URL https://www.aaai.org/ocs/index.php/WS/AAAIW12/paper/viewFile/5258/5609.
    [10]
    David Karger, Sewoong Oh, and Devavrat Shah. Iterative learning for reliable crowdsourcing systems. Advances in neural information processing systems, 24, 2011. NeurIPS'11:4396.
    [11]
    Kohta Katsuno, Masaki Matsubara, Chiemi Watanabe, and Atsuyuki Morishima. Improving reproducibility of crowdsourcing experiments. (Presented in the Work in Progress and Demo track, HCOMP 2019), 2019. URL https://www.humancomputation.com/2019/assets/papers/119.pdf.
    [12]
    Faiza Khattak, Ansaf Salleb, and Anita Raja. Accurate crowd-labeling using item response theory, 03 2016. URL https://www.researchgate.net/publication/299389507_Accurate_Crowd-labeling_using_Item_Response_Theory.
    [13]
    Himabindu Lakkaraju, Jure Leskovec, Jon Kleinberg, and Sendhil Mullainathan. A bayesian framework for modeling human evaluations. In Proceedings of the 2015 SIAM International Conference on Data Mining, pages 181-189. SIAM, 2015.
    [14]
    John P. Lalor, Hao Wu, and Hong Yu. Learning latent parameters without human response patterns: Item response theory with artificial crowds. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP '19, pages 4249-4259, Hong Kong, China, November 2019. Association for Computational Linguistics. arXiv:1908.11421.
    [15]
    Yuan Li, Benjamin I. P. Rubinstein, and Trevor Cohn. Truth inference at scale: A bayesian model for adjudicating highly redundant crowd annotations. In The World Wide Web Conference, WWW '19, page 1028-1038, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450366748.
    [16]
    Chao Liu and Yi-Min Wang. Truelabel + confusions: A spectrum of probabilistic models in analyzing multiple ratings. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML'12, pages 17-24, Madison, WI, USA, 2012. Omnipress. ISBN 9781450312851. arXiv:1206.4606.
    [17]
    Qiang Liu, Jian Peng, and Alexander Ihler. Variational inference for crowdsourcing. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NeurIPS'12, pages 692-700, Red Hook, NY, USA, 2012. Curran Associates Inc. NeurIPS'12:4627.
    [18]
    Barzan Mozafari, Purnamrita Sarkar, Michael J. Franklin, Michael I. Jordan, and Samuel Madden. BM data set. URL https://github.com/ipeirotis/Get-Another-Label/tree/master/data/BarzanMozafari.
    [19]
    Barzan Mozafari, Purnamrita Sarkar, Michael J. Franklin, Michael I. Jordan, and Samuel Madden. Active learning for crowd-sourced databases, 2012. arXiv:1209.3686.
    [20]
    Barzan Mozafari, Purnamrita Sarkar, Michael J. Franklin, Michael I. Jordan, and Samuel Madden. Scaling up crowd-sourcing to very large datasets: A case for active learning. PVLDB, 8(2):125-136, 2014.
    [21]
    Jungseul Ok, Sewoong Oh, Jinwoo Shin, and Yung Yi. Optimality of belief propagation for crowdsourced classification. In International Conference on Machine Learning, pages 535-544. PMLR, 2016. arXiv:1602.03619.
    [22]
    Naoki Otani, Toshiaki Nakazawa, Daisuke Kawahara, and Sadao Kurohashi. IRT-based aggregation model of crowdsourced pairwise comparison for evaluating machine translations. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 511-520, Austin, Texas, November 2016. Association for Computational Linguistics.
    [23]
    Rebecca J. Passonneau and Bob Carpenter. The benefits of a model of annotation. Transactions of the Association for Computational Linguistics, 2:311-326, 10 2014. ISSN 2307-387X.
    [24]
    Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, and Massimo Poesio. Comparing Bayesian Models of Annotation. Transactions of the Association for Computational Linguistics, 6:571-585, 12 2018. ISSN 2307-387X.
    [25]
    Barbara Plank. The "problem" of human label variation: On ground truth in data, modeling and evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10671-10682, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.731.
    [26]
    Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds. J. Mach. Learn. Res., 11: 1297-1322, aug 2010. ISSN 1532-4435. URL http://jmlr.org/papers/v11/raykar10a.html.
    [27]
    Mark D. Reckase. Multidimensional Item Response Theory. Statistics for Social and Behavioral Sciences. Springer, New York, NY, 1 edition, 2009. ISBN 978-0-387-89975-6.
    [28]
    Ryan Sanchez. Girth: G. item response theory, November 2021. URL https://github.com/eribean/girth. Version 0.8.0.
    [29]
    Documentation: scipy.stats.gaussian_kde, Accessed: Oct. 2022. URL https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html.
    [30]
    Victor S. Sheng, Foster Provost, and Panagiotis G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '08, pages 614-622, New York, NY, USA, 2008. Association for Computing Machinery. ISBN 9781605581934.
    [31]
    Aashish Sheshadri and Matthew Lease. Square: A benchmark for research on computing crowd consensus. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 1(1):156-164, Nov. 2013. URL https://ojs.aaai.org/index.php/HCOMP/article/view/13088.
    [32]
    Bernard W Silverman. Density Estimation for Statistics and Data Analysis, volume 26 of Monographs on Statistics and Applied Probability. Chapman and Hall, London, 1986.
    [33]
    Rion Snow, Brendan O'Connor, Daniel Jurafsky, and Andrew Y. Ng. RTE and TEMP data sets. URL https://github.com/TrentoCrowdAI/crowdsourced-datasets/tree/master/binary-classification.
    [34]
    Rion Snow, Brendan O'Connor, Daniel Jurafsky, and Andrew Y. Ng. Cheap and fastߜbut is it good?: Evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP '08, pages 254-263, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics.
    [35]
    SQUARE. Links to data sets, Accessed: Oct. 2022. URL https://ir.ischool.utexas.edu/square/data.html.
    [36]
    James Surowiecki. The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations. Doubleday, 1st ed. edition, 2004.
    [37]
    Matteo Venanzi, William Teacy, Alexander Rogers, and Nicholas Jennings. Sentiment popularity data set, 2015.
    [38]
    Jiaheng Wei, Zhaowei Zhu, Hao Cheng, Tongliang Liu, Gang Niu, and Yang Liu. Learning with noisy labels revisited: A study using real-world human annotations. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=TBWA6PLJZQm.
    [39]
    Peter Welinder and Pietro Perona. Online crowdsourcing: Rating annotators and obtaining cost-effective labels. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, pages 25-32, June 2010.
    [40]
    Peter Welinder, Steve Branson, Serge Belongie, and Pietro Perona. WB data set. URL https://github.com/welinder/cubam/tree/public/demo/bluebirds.
    [41]
    Peter Welinder, Steve Branson, Pietro Perona, and Serge Belongie. The multidimensional wisdom of crowds. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems, volume 23, pages 2424-2432. Curran Associates, Inc., 2010. URL https://proceedings.neurips.cc/paper/2010/file/0f9cafd014db7a619ddb4276af0d692c-Paper.pdf.
    [42]
    Jacob Whitehill, Paul Ruvolo, Tingfan Wu, Jacob Bergsma, and Javier Movellan. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Proceedings of the 22nd International Conference on Neural Information Processing Systems, NeurIPS'09, page 2035-2043, Red Hook, NY, USA, 2009. Curran Associates Inc. ISBN 9781615679119. URL https://proceedings.nips.cc/paper/2009/file/f899139df5e1059396431415e770c6dd-Paper.pdf.
    [43]
    Ming Yin, Mary L. Gray, Siddharth Suri, and Jennifer Wortman Vaughan. The communication network within the crowd. In Proceedings of the 25th International Conference on World Wide Web, WWW '16, pages 1293-1303, Republic and Canton of Geneva, CHE, 2016. International World Wide Web Conferences Steering Committee. ISBN 9781450341431.
    [44]
    Yuchen Zhang, Xi Chen, Dengyong Zhou, and Michael I Jordan. Spectral methods meet em: A provably optimal algorithm for crowdsourcing. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. arXiv:1406.3824.
    [45]
    Dengyong Zhou, Sumit Basu, Yi Mao, and John Platt. Learning from the wisdom of crowds by minimax entropy. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings. neurips.cc/paper/2012/file/46489c17893dfdcf028883202cefd6d1-Paper.pdf.
    [46]
    Dengyong Zhou, Qiang Liu, John C. Platt, Christopher Meek, and Nihar B. Shah. Regularized minimax conditional entropy for crowdsourcing, 2015. arXiv:1503.07240.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    UAI '23: Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence
    July 2023
    2617 pages

    Publisher

    JMLR.org

    Publication History

    Published: 31 July 2023

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 0
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0

    Other Metrics

    Citations

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media