Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3485447.3512242acmconferencesArticle/Chapter ViewAbstractPublication PageswebconfConference Proceedingsconference-collections
research-article
Open access

Measuring Annotator Agreement Generally across Complex Structured, Multi-object, and Free-text Annotation Tasks

Published: 25 April 2022 Publication History
  • Get Citation Alerts
  • Abstract

    When annotators label data, a key metric for quality assurance is inter-annotator agreement (IAA): the extent to which annotators agree on their labels. Though many IAA measures exist for simple categorical and ordinal labeling tasks, relatively little work has considered more complex labeling tasks, such as structured, multi-object, and free-text annotations. Krippendorff’s α, best known for use with simpler labeling tasks, does have a distance-based formulation with broader applicability, but little work has studied its efficacy and consistency across complex annotation tasks.
    We investigate the design and evaluation of IAA measures for complex annotation tasks, with evaluation spanning seven diverse tasks: image bounding boxes, image keypoints, text sequence tagging, ranked lists, free text translations, numeric vectors, and syntax trees. We identify the difficulty of interpretability and the complexity of choosing a distance function as key obstacles in applying Krippendorff’s α generally across these tasks. We propose two novel, more interpretable measures, showing they yield more consistent IAA measures across tasks and annotation distance functions.

    References

    [1]
    Omar Alonso. 2013. Implementing crowdsourcing-based relevance experimentation: an industrial perspective. Information retrieval 16, 2 (2013), 101–120.
    [2]
    Omar Alonso. 2019. The practice of crowdsourcing. Synthesis Lectures on Information Concepts, Retrieval, and Services 11, 1(2019), 1–149.
    [3]
    Jean-Yves Antoine, Jeanne Villaneau, and Anaïs Lefeuvre. 2014. Weighted Krippendorff’s alpha is a more reliable metrics for multi-coders ordinal annotations: experimental studies on emotion, opinion and coreference annotation. In EACL 2014. 10–p.
    [4]
    Ines Arous, Jie Yang, Mourad Khayati, and Philippe Cudré-Mauroux. 2020. Opencrowd: A human-ai collaborative approach for finding social influencers via open-ended answers aggregation. In Proceedings of The Web Conference 2020. 1851–1862.
    [5]
    Lora Aroyo, Matthew Lease, Praveen Paritosh, and Mike Schaekermann. 2022. Data Excellence for AI: Why Should You Care. ACM Interactions 29, 2 (2022). March-April.
    [6]
    Lora Aroyo and Chris Welty. 2013. Crowd truth: Harnessing disagreement in crowdsourcing a relation extraction gold standard. WebSci2013. ACM 2013, 2013 (2013).
    [7]
    Ron Artstein and Massimo Poesio. 2008. Inter-coder agreement for computational linguistics. Computational Linguistics 34, 4 (2008), 555–596.
    [8]
    Frank B Baker and Seock-Ho Kim. 2004. Item response theory: Parameter estimation techniques. CRC press.
    [9]
    Steve Branson, Grant Van Horn, and Pietro Perona. 2017. Lean crowdsourcing: Combining humans and machines in an online system. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7474–7483.
    [10]
    Alexander Braylan and Matthew Lease. 2020. Modeling and Aggregation of Complex Annotations via Annotation Distances. In Proceedings of The Web Conference 2020. 1807–1818.
    [11]
    Alexander Braylan and Matthew Lease. 2021. Aggregating Complex Annotations via Merging and Matching. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 86–94.
    [12]
    Ben Carterette and Ian Soboroff. 2010. The effect of assessor error on IR system evaluation. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. 539–546.
    [13]
    Alessandro Checco, Kevin Roitero, Eddy Maddalena, Stefano Mizzaro, and Gianluca Demartini. 2017. Let’s agree to disagree: Fixing agreement measures for crowdsourcing. In Fifth AAAI Conference on Human Computation and Crowdsourcing.
    [14]
    COCO. 2020. Common Objects in Context (COCO). https://cocodataset.org/ Accessed: 2021-10-18.
    [15]
    Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement 20, 1 (1960), 37–46.
    [16]
    Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters.Psychological bulletin 76, 5 (1971), 378.
    [17]
    W Nelson Francis and Henry Kucera. 1979. Brown corpus manual. Letters to the Editor 5, 2 (1979), 7.
    [18]
    Klaus Krippendorff. 2004. Reliability in content analysis: Some common misconceptions and recommendations. Human communication research 30, 3 (2004), 411–433.
    [19]
    Jiyi Li. 2020. Crowdsourced Text Sequence Aggregation Based on Hybrid Reliability and Representation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, China) (SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 1761–1764. https://doi.org/10.1145/3397271.3401239
    [20]
    Jiyi Li and Fumiyo Fukumoto. 2019. A Dataset of Crowdsourced Word Sequences: Collections and Answer Aggregation for Ground Truth Creation. In Proceedings of the First Workshop on Aggregating and Analysing Crowdsourced Annotations for NLP. 24–28.
    [21]
    Frank J Massey Jr. 1951. The Kolmogorov-Smirnov test for goodness of fit. Journal of the American statistical Association 46, 253(1951), 68–78.
    [22]
    Yann Mathet, Antoine Widlöcher, Karën Fort, Claire François, Olivier Galibert, Cyril Grouin, Juliette Kahn, Sophie Rosset, and Pierre Zweigenbaum. 2012. Manual corpus annotation: Giving meaning to the evaluation metrics. In International Conference on Computational Linguistics. 809–818.
    [23]
    Yann Mathet, Antoine Widlöcher, and Jean-Philippe Métivier. 2015. The unified and holistic method gamma (γ) for inter-annotator agreement measure and alignment. Computational Linguistics 41, 3 (2015), 437–479.
    [24]
    David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective self-training for parsing. In Proceedings of the main conference on human language technology conference of the North American Chapter of the Association of Computational Linguistics. Association for Computational Linguistics, 152–159.
    [25]
    Arshad Muhammad Mehar, Kenan Matawie, and Anthony Maeder. 2013. Determining an optimal value of K in K-means clustering. In 2013 IEEE International Conference on Bioinformatics and Biomedicine. IEEE, 51–55.
    [26]
    Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR) 54, 6 (2021), 1–35.
    [27]
    Reshef Meir, Ofra Amir, Gal Cohensius, Omer Ben-Porat, Tsviel Ben-Shabat, and Lirong Xia. 2020. Truth Discovery via Average Proximity. arxiv:1905.00629 [cs.AI]
    [28]
    Jerome L Myers, Arnold D Well, and Robert F Lorch Jr. 2013. Research design and statistical analysis. Routledge.
    [29]
    An Thanh Nguyen, Matthew Halpern, Byron C. Wallace, and Matthew Lease. 2016. Probabilistic Modeling for Crowdsourcing Partially-Subjective Ratings. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP). 149–158.
    [30]
    An T Nguyen, Byron C Wallace, Junyi Jessy Li, Ani Nenkova, and Matthew Lease. 2017. Aggregating and predicting sequence labels from crowd annotations. In Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 2017. NIH Public Access, 299.
    [31]
    Curtis G Northcutt, Anish Athalye, and Jonas Mueller. 2021. Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks. In Thirty-fifth Conference on Neural Information Processing Systems: Datasets and Benchmarks Track.
    [32]
    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.
    [33]
    Aditya Parameswaran, Akash Das Sarma, and Vipul Venkataraman. 2016. Optimizing open-ended crowdsourcing: The next frontier in crowdsourced data management. Bulletin of the Technical Committee on Data Engineering 39 (2016).
    [34]
    Silviu Paun, Ron Artstein, and Massimo Poesio. 2022. Statistical Methods for Annotation Analysis. Synthesis Lectures on Human Language Technologies 15, 1(2022), 1–217.
    [35]
    Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. 2019. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 658–666.
    [36]
    Matteo Ruggero Ronchi and Pietro Perona. 2017. Benchmarking and error diagnosis in multi-instance pose estimation. In Proceedings of the IEEE international conference on computer vision. 369–378.
    [37]
    Erik Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003. 142–147.
    [38]
    Falk Scholer, Diane Kelly, Wan-Ching Wu, Hanseul S Lee, and William Webber. 2013. The effect of threshold priming and need for cognition on relevance calibration and assessment. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. 623–632.
    [39]
    William A Scott. 1955. Reliability of content analysis: The case of nominal scale coding. Public opinion quarterly(1955), 321–325.
    [40]
    Satoshi Sekine and Michael Collins. 1997. EvalB: a bracket scoring program. http://nlp.cs.nyu.edu/evalb/
    [41]
    Shilad Sen, Margaret E Giesel, Rebecca Gold, Benjamin Hillmann, Matt Lesicko, Samuel Naden, Jesse Russell, Zixiao Wang, and Brent Hecht. 2015. Turkers, scholars,” arafat” and” peace” cultural communities and algorithmic gold standards. In Proceedings of the 18th acm conference on computer supported cooperative work & social computing. 826–838.
    [42]
    Arne Skjærholt. 2014. A chance-corrected measure of inter-annotator agreement for syntax. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 934–944.
    [43]
    Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y Ng. 2008. Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 254–263.
    [44]
    Yuandong Tian and Jun Zhu. 2012. Learning from crowds in the presence of schools of thought. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. 226–234.
    [45]
    Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier Movellan, and Paul Ruvolo. 2009. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. Advances in neural information processing systems 22 (2009).
    [46]
    Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144(2016).
    [47]
    Liu Yang and Rong Jin. 2006. Distance metric learning: A comprehensive survey. Michigan State Universiy 2, 2 (2006), 4.
    [48]
    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675(2019).

    Cited By

    View all
    • (2024)Data Sorting Influence on Short Text Manual Labeling Quality for Hierarchical ClassificationBig Data and Cognitive Computing10.3390/bdcc80400418:4(41)Online publication date: 7-Apr-2024
    • (2024)An entity-centric approach to manage court judgments based on Natural Language ProcessingComputer Law & Security Review10.1016/j.clsr.2023.10590452(105904)Online publication date: Apr-2024
    • (2023)Machine Learning-Based Label Quality Assurance for Object Detection Projects in Requirements EngineeringApplied Sciences10.3390/app1310623413:10(6234)Online publication date: 19-May-2023
    • Show More Cited By

    Index Terms

    1. Measuring Annotator Agreement Generally across Complex Structured, Multi-object, and Free-text Annotation Tasks
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image ACM Conferences
            WWW '22: Proceedings of the ACM Web Conference 2022
            April 2022
            3764 pages
            ISBN:9781450390965
            DOI:10.1145/3485447
            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Sponsors

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            Published: 25 April 2022

            Permissions

            Request permissions for this article.

            Check for updates

            Author Tags

            1. annotation
            2. inter-annotator agreement
            3. labeling
            4. quality assurance

            Qualifiers

            • Research-article
            • Research
            • Refereed limited

            Funding Sources

            Conference

            WWW '22
            Sponsor:
            WWW '22: The ACM Web Conference 2022
            April 25 - 29, 2022
            Virtual Event, Lyon, France

            Acceptance Rates

            Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Downloads (Last 12 months)836
            • Downloads (Last 6 weeks)77
            Reflects downloads up to 27 Jul 2024

            Other Metrics

            Citations

            Cited By

            View all
            • (2024)Data Sorting Influence on Short Text Manual Labeling Quality for Hierarchical ClassificationBig Data and Cognitive Computing10.3390/bdcc80400418:4(41)Online publication date: 7-Apr-2024
            • (2024)An entity-centric approach to manage court judgments based on Natural Language ProcessingComputer Law & Security Review10.1016/j.clsr.2023.10590452(105904)Online publication date: Apr-2024
            • (2023)Machine Learning-Based Label Quality Assurance for Object Detection Projects in Requirements EngineeringApplied Sciences10.3390/app1310623413:10(6234)Online publication date: 19-May-2023
            • (2023)Toward a generalizable machine learning workflow for neurodegenerative disease staging with focus on neurofibrillary tanglesActa Neuropathologica Communications10.1186/s40478-023-01691-x11:1Online publication date: 18-Dec-2023
            • (2023)DICE: a Dataset of Italian Crime Event newsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591904(2985-2995)Online publication date: 19-Jul-2023
            • (2023)ToxiSpanSE: An Explainable Toxicity Detection in Code Review Comments2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)10.1109/ESEM56168.2023.10304855(1-12)Online publication date: 26-Oct-2023
            • (2023)Automatic extraction of social determinants of health from medical notes of chronic lower back pain patientsJournal of the American Medical Informatics Association10.1093/jamia/ocad05430:8(1438-1447)Online publication date: 13-May-2023

            View Options

            View options

            PDF

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format.

            HTML Format

            Get Access

            Login options

            Media

            Figures

            Other

            Tables

            Share

            Share

            Share this Publication link

            Share on social media