Crowd vs. expert: What can relevance judgment rationales teach us about assessor disagreement?

M Kutlu, T McDonnell, Y Barkallah, T Elsayed… - The 41st International …, 2018 - dl.acm.org
The 41st International ACM SIGIR Conference on Research & Development in …, 2018dl.acm.org
While crowdsourcing offers a low-cost, scalable way to collect relevance judgments, lack of
transparency with remote crowd work has limited understanding about the quality of
collected judgments. In prior work, we showed a variety of benefits from asking crowd
workers to provide\em rationales for each relevance judgment\citemcdonnell2016relevant.
In this work, we scale up our rationale-based judging design to assess its reliability on the
2014 TREC Web Track, collecting roughly 25K crowd judgments for 5K document-topic …
While crowdsourcing offers a low-cost, scalable way to collect relevance judgments, lack of transparency with remote crowd work has limited understanding about the quality of collected judgments. In prior work, we showed a variety of benefits from asking crowd workers to provide \em rationales for each relevance judgment \citemcdonnell2016relevant. In this work, we scale up our rationale-based judging design to assess its reliability on the 2014 TREC Web Track, collecting roughly 25K crowd judgments for 5K document-topic pairs. We also study having crowd judges perform topic-focused judging, rather than across topics, finding this improves quality. Overall, we show that crowd judgments can be used to reliably rank IR systems for evaluation. We further explore the potential of rationales to shed new light on reasons for judging disagreement between experts and crowd workers. Our qualitative and quantitative analysis distinguishes subjective vs.\ objective forms of disagreement, as well as the relative importance of each disagreement cause, and we present a new taxonomy for organizing the different types of disagreement we observe. We show that many crowd disagreements seem valid and plausible, with disagreement in many cases due to judging errors by the original TREC assessors. We also share our WebCrowd25k dataset, including: (1) crowd judgments with rationales, and (2) taxonomy category labels for each judging disagreement analyzed.
ACM Digital Library