Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Automatically Labeling Low Quality Content on Wikipedia By Leveraging Patterns in Editing Behaviors

Published: 18 October 2021 Publication History

Abstract

Wikipedia articles aim to be definitive sources of encyclopedic content. Yet, only 0.6% of Wikipedia articles have high quality according to its quality scale due to insufficient number of Wikipedia editors and enormous number of articles. Supervised Machine Learning (ML) quality improvement approaches that can automatically identify and fix content issues rely on manual labels of individual Wikipedia sentence quality. However, current labeling approaches are tedious and produce noisy labels. Here, we propose an automated labeling approach that identifies the semantic category (e.g., adding citations, clarifications) of historic Wikipedia edits and uses the modified sentences prior to the edit as examples that require that semantic improvement. Highest-rated article sentences are examples that no longer need semantic improvements. We show that training existing sentence quality classification algorithms on our labels improves their performance compared to training them on existing labels. Our work shows that editing behaviors of Wikipedia editors provide better labels than labels generated by crowdworkers who lack the context to make judgments that the editors would agree with.

References

[1]
Anderka, M., Stein, B., and Lipka, N. Predicting quality flaws in user-generated content: the case of wikipedia. In The 35th International ACM SIGIR conference on research and development in Information Retrieval, SIGIR '12, Portland, OR, USA, August 12--16, 2012 (2012), W. R. Hersh, J. Callan, Y. Maarek, and M. Sanderson, Eds., ACM, pp. 981--990.
[2]
Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings (2015), Y. Bengio and Y. LeCun, Eds.
[3]
Biber, D. Variation across speech and writing. Cambridge University Press, 1991.
[4]
Chandrasekharan, E., Gandhi, C., Mustelier, M. W., and Gilbert, E. Crossmod: A cross-community learning-based system to assist reddit moderators. Proceedings of the ACM on human-computer interaction 3, CSCW (2019), 1--30.
[5]
Cho, K., van Merrienboer, B., Gü lcc ehre, cC ., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP (2014), ACL, pp. 1724--1734.
[6]
Cialdini, R. B., and Goldstein, N. J. Social influence: Compliance and conformity. Annu. Rev. Psychol. 55 (2004), 591--621.
[7]
Dalip, D. H., Goncc alves, M. A., Cristo, M., and Calado, P. Automatic quality assessment of content created collaboratively by web communities: a case study of wikipedia. In Proceedings of the 2009 Joint International Conference on Digital Libraries, JCDL 2009, Austin, TX, USA, June 15--19, 2009 (2009), F. Heath, M. L. Rice-Lively, and R. Furuta, Eds., ACM, pp. 295--304.
[8]
Daniel, F., Kucherbaev, P., Cappiello, C., Benatallah, B., and Allahbakhsh, M. Quality control in crowdsourcing: A survey of quality attributes, assessment techniques, and assurance actions. ACM Computing Surveys (CSUR) 51, 1 (2018), 1--40.
[9]
Daston, L. Objectivity and the escape from perspective. Social Studies of Science 22, 4 (1992), 597--618.
[10]
Ekstrand, M. D., and Riedl, J. rv you're dumb: identifying discarded work in wiki article history. In Proceedings of the 2009 International Symposium on Wikis, 2009, Orlando, Florida, USA, October 25--27, 2009 (2009), D. Riehle and A. Bruckman, Eds., ACM.
[11]
Fiesler, C., Jiang, J., McCann, J., Frye, K., and Brubaker, J. Reddit rules! characterizing an ecosystem of governance. In Proceedings of the International AAAI Conference on Web and Social Media (2018), vol. 12.
[12]
Flö ck, F., Vrandecic, D., and Simperl, E. Revisiting reverts: accurate revert detection in wikipedia. In 23rd ACM Conference on Hypertext and Social Media, HT '12, Milwaukee, WI, USA, June 25--28, 2012 (2012), E. V. Munson and M. Strohmaier, Eds., ACM, pp. 3--12.
[13]
Forte, A., Andalibi, N., Gorichanaz, T., Kim, M. C., Park, T., and Halfaker, A. Information fortification: An online citation behavior. In Proceedings of the 2018 ACM Conference on Supporting Groupwork (New York, NY, USA, 2018), GROUP '18, Association for Computing Machinery, p. 83--92.
[14]
Forte, A., Larco, V., and Bruckman, A. Decentralization in wikipedia governance. Journal of Management Information Systems 26, 1 (2009), 49--72.
[15]
Geiger, R. S., and Halfaker, A. When the levee breaks: without bots, what happens to wikipedia's quality control processes? In Proceedings of the 9th International Symposium on Open Collaboration, Hong Kong, China, August 05 - 07, 2013 (2013), A. Aguiar and D. Riehle, Eds., ACM, pp. 6:1--6:6.
[16]
Geiger, R. S., and Halfaker, A. Operationalizing conflict and cooperation between automated software agents in wikipedia: A replication and expansion of 'even good bots fight'. Proc. ACM Hum. Comput. Interact. 1, CSCW (2017), 49:1--49:33.
[17]
Geiger, R. S., and Ribes, D. The work of sustaining order in wikipedia: The banning of a vandal. In Proceedings of the 2010 ACM conference on Computer supported cooperative work (2010), pp. 117--126.
[18]
Geiger, R. S., Yu, K., Yang, Y., Dai, M., Qiu, J., Tang, R., and Huang, J. Garbage in, garbage out?: do machine learning application papers in social computing report where human-labeled training data comes from? In FAT* (2020), ACM, pp. 325--336.
[19]
Halfaker, A. Interpolating quality dynamics in wikipedia and demonstrating the keilana effect. In OpenSym (2017), ACM, pp. 19:1--19:9.
[20]
Hassan, S., and Mihalcea, R. Cross-lingual semantic relatedness using encyclopedic knowledge. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (2009), pp. 1192--1201.
[21]
Hickman, M. G., Pasad, V., Sanghavi, H., Thebault-Spieker, J., and Lee, S. W. Wiki hues: Understanding wikipedia practices through hindi, urdu, and english takes on evolving regional conflict. In Proceedings of the 2020 International Conference on Information and Communication Technologies and Development (2020), pp. 1--5.
[22]
Hube, C., and Fetahu, B. Neural based statement classification for biased language. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM 2019, Melbourne, VIC, Australia, February 11--15, 2019 (2019), J. S. Culpepper, A. Moffat, P. N. Bennett, and K. Lerman, Eds., ACM, pp. 195--203.
[23]
Jhaver, S., Birman, I., Gilbert, E., and Bruckman, A. Human-machine collaboration for content regulation: The case of reddit automoderator. ACM Transactions on Computer-Human Interaction (TOCHI) 26, 5 (2019), 1--35.
[24]
Kairam, S., and Heer, J. Parting crowds: Characterizing divergent interpretations in crowdsourced annotation tasks. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing (2016), pp. 1637--1648.
[25]
Keegan, B., and Fiesler, C. The evolution and consequences of peer producing wikipedia's rules. In Proceedings of the International AAAI Conference on Web and Social Media (2017), vol. 11.
[26]
Kittur, A., Chi, E. H., and Suh, B. Crowdsourcing user studies with mechanical turk. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (New York, NY, USA, 2008), CHI '08, Association for Computing Machinery, p. 453--456.
[27]
Kittur, A., Suh, B., Pendleton, B. A., and Chi, E. H. He says, she says: conflict and coordination in wikipedia. In CHI (2007), ACM, pp. 453--462.
[28]
Klaus, K. Content analysis: An introduction to its methodology, 1980.
[29]
Lee, S. W., Krosnick, R., Park, S. Y., Keelean, B., Vaidya, S., O'Keefe, S. D., and Lasecki, W. S. Exploring real-time collaboration in crowd-powered systems through a ui design tool. Proceedings of the ACM on Human-Computer Interaction 2, CSCW (2018), 1--23.
[30]
Lovink, G., Tkacz, N., Reagle, J. M., O'Sullivan, D., Liang, L., Salah, A., Gao, C., Suchecki, K., Scharnhorst, A., Geiger, R., et al. Critical point of view: A wikipedia reader.
[31]
Matei, S. A., and Dobrescu, C. Wikipedia's ?neutral point of view": Settling conflict through ambiguity. The Information Society 27, 1 (2011), 40--51.
[32]
Morgan, J. T., and Filippova, A. 'welcome'changes? descriptive and injunctive norms in a wikipedia sub-community. Proceedings of the ACM on Human-Computer Interaction 2, CSCW (2018), 1--26.
[33]
Morgan, J. T., and Zachry, M. Negotiating with angry mastodons: the wikipedia policy environment as genre ecology. In Proceedings of the 16th ACM international conference on Supporting group work (2010), pp. 165--168.
[34]
Müller-Birn, C., Dobusch, L., and Herbsleb, J. D. Work-to-rule: the emergence of algorithmic governance in wikipedia. In Proceedings of the 6th International Conference on Communities and Technologies (2013), pp. 80--89.
[35]
Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25--29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL (2014), A. Moschitti, B. Pang, and W. Daelemans, Eds., ACL, pp. 1532--1543.
[36]
Plank, B., Hovy, D., and Søgaard, A. Linguistically debatable or just plain wrong? In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (2014), pp. 507--511.
[37]
Potthast, M. Crowdsourcing a wikipedia vandalism corpus. In SIGIR (2010), ACM, pp. 789--790.
[38]
Priedhorsky, R., Chen, J., Lam, S. K., Panciera, K. A., Terveen, L. G., and Riedl, J. Creating, destroying, and restoring value in wikipedia. In Proceedings of the 2007 International ACM SIGGROUP Conference on Supporting Group Work, GROUP 2007, Sanibel Island, Florida, USA, November 4--7, 2007 (2007), T. Gross and K. Inkpen, Eds., ACM, pp. 259--268.
[39]
Redi, M., Fetahu, B., Morgan, J. T., and Taraborelli, D. Citation needed: A taxonomy and algorithmic assessment of wikipedia's verifiability. In The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13--17, 2019 (2019), L. Liu, R. W. White, A. Mantrach, F. Silvestri, J. J. McAuley, R. Baeza-Yates, and L. Zia, Eds., ACM, pp. 1567--1578.
[40]
Ruprechter, T., Santos, T., and Helic, D. Relating wikipedia article quality to edit behavior and link structure. Applied Network Science 5, 1 (2020), 1--20.
[41]
Sen, S., Giesel, M. E., Gold, R., Hillmann, B., Lesicko, M., Naden, S., Russell, J., Wang, Z., and Hecht, B. Turkers, scholars," arafat" and" peace" cultural communities and algorithmic gold standards. In Proceedings of the 18th acm conference on computer supported cooperative work & social computing (2015), pp. 826--838.
[42]
Stvilia, B., Twidale, M. B., Smith, L. C., and Gasser, L. Information quality work organization in wikipedia. Journal of the American society for information science and technology 59, 6 (2008), 983--1001.
[43]
Suh, B., Convertino, G., Chi, E. H., and Pirolli, P. The singularity is not near: slowing growth of wikipedia. In Proceedings of the 2009 International Symposium on Wikis, 2009, Orlando, Florida, USA, October 25--27, 2009 (2009), D. Riehle and A. Bruckman, Eds., ACM.
[44]
Wang, R. Y., and Strong, D. M. Beyond accuracy: What data quality means to data consumers. J. Manag. Inf. Syst. 12, 4 (1996), 5--33.
[45]
Warncke-Wang, M., Cosley, D., and Riedl, J. Tell me more: an actionable quality model for wikipedia. In OpenSym (2013), ACM, pp. 8:1--8:10.
[46]
Wikipedia. Wikipedia, sep 2020.
[47]
Wikipedia. Wikipedia:content assessment, sep 2020.
[48]
Wikipedia. Wikipedia:core content policies, sep 2020.
[49]
Wikipedia. Wikipedia:edittypes taxnonmy, sep 2020.
[50]
Wikipedia. Wikipedia:featured articles candidates, sep 2020.
[51]
Wikipedia. Wikipedia:neutral point of view, sep 2020.
[52]
Wikipedia. Wikipedia:please clarify, sep 2020.
[53]
Wikipedia. Wikipedia:template cleanup, sep 2020.
[54]
Wulczyn, E., West, R., Zia, L., and Leskovec, J. Growing wikipedia across languages via recommendation. In Proceedings of the 25th International Conference on World Wide Web (2016), pp. 975--985.
[55]
Yang, D., Halfaker, A., Kraut, R. E., and Hovy, E. H. Identifying semantic edit intentions from revisions in wikipedia. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9--11, 2017 (2017), M. Palmer, R. Hwa, and S. Riedel, Eds., Association for Computational Linguistics, pp. 2000--2010.

Cited By

View all
  • (2024)Providing Citations to Support Fact-Checking: Contextualizing Detection of Sentences Needing Citation on Small WikipediasNatural Language Processing Journal10.1016/j.nlp.2024.1000938(100093)Online publication date: Sep-2024
  • (2023)Automatic Quality Assessment of Wikipedia Articles—A Systematic Literature ReviewACM Computing Surveys10.1145/362528656:4(1-37)Online publication date: 10-Nov-2023
  • (2023)Interpretable Classification of Wiki-Review StreamsIEEE Access10.1109/ACCESS.2023.334247211(141137-141151)Online publication date: 2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Human-Computer Interaction
Proceedings of the ACM on Human-Computer Interaction  Volume 5, Issue CSCW2
CSCW2
October 2021
5376 pages
EISSN:2573-0142
DOI:10.1145/3493286
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 October 2021
Published in PACMHCI Volume 5, Issue CSCW2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. content labeling
  2. machine learning
  3. wikipedia

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)43
  • Downloads (Last 6 weeks)4
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Providing Citations to Support Fact-Checking: Contextualizing Detection of Sentences Needing Citation on Small WikipediasNatural Language Processing Journal10.1016/j.nlp.2024.1000938(100093)Online publication date: Sep-2024
  • (2023)Automatic Quality Assessment of Wikipedia Articles—A Systematic Literature ReviewACM Computing Surveys10.1145/362528656:4(1-37)Online publication date: 10-Nov-2023
  • (2023)Interpretable Classification of Wiki-Review StreamsIEEE Access10.1109/ACCESS.2023.334247211(141137-141151)Online publication date: 2023
  • (2022)Hint: harnessing the wisdom of crowds for handling multi-phase tasksNeural Computing and Applications10.1007/s00521-021-06825-735:31(22911-22933)Online publication date: 17-Jan-2022

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media