research-article

Detecting Undisclosed Paid Editing in Wikipedia

Authors:

Francesca Spezzano,

Elijah HillAuthors Info & Claims

WWW '20: Proceedings of The Web Conference 2020

Pages 2899 - 2905

https://doi.org/10.1145/3366423.3380055

Published: 20 April 2020 Publication History

Abstract

Wikipedia, the free and open-collaboration based online encyclopedia, has millions of pages that are maintained by thousands of volunteer editors. As per Wikipedia’s fundamental principles, pages on Wikipedia are written with a neutral point of view and maintained by volunteer editors for free with well-defined guidelines in order to avoid or disclose any conflict of interest. However, there have been several known incidents where editors intentionally violate such guidelines in order to get paid (or even extort money) for maintaining promotional spam articles without disclosing such.

In this paper, we address for the first time the problem of identifying undisclosed paid articles in Wikipedia. We propose a machine learning-based framework using a set of features based on both the content of the articles as well as the patterns of edit history of users who create them. To test our approach, we collected and curated a new dataset from English Wikipedia with ground truth on undisclosed paid articles. Our experimental evaluation shows that we can identify undisclosed paid articles with an AUROC of 0.98 and an average precision of 0.91. Moreover, our approach outperforms ORES, a scoring system tool currently used by Wikipedia to automatically detect damaging content, in identifying undisclosed paid articles. Finally, we show that our user-based features can also detect undisclosed paid editors with an AUROC of 0.94 and an average precision of 0.92, outperforming existing approaches.

References

[1]

B. Thomas Adler, Luca de Alfaro, Santiago Moisés Mola-Velasco, Paolo Rosso, and Andrew G. West. 2011. Wikipedia Vandalism Detection: Combining Natural Language, Metadata, and Reputation Features. In Computational Linguistics and Intelligent Text Processing - 12th International Conference, CICLing 2011, Tokyo, Japan, February 20-26, 2011. Proceedings, Part II. 277–288.

[2]

B. Thomas Adler, Luca de Alfaro, and Ian Pye. 2010. Detecting Wikipedia Vandalism using WikiTrust - Lab Report for PAN at CLEF 2010. In CLEF 2010 LABs and Workshops, Notebook Papers, 22-23 September 2010, Padua, Italy.

[3]

Tony Ballioni, James Heilman, Brian Henry, and Aaron Halfaker. 2018. Known Undisclosed Paid Editors (English Wikipedia). (4 2018). https://doi.org/10.6084/m9.figshare.6176927.v1

[4]

Zhan Bu, Zhengyou Xia, and Jiandong Wang. 2013. A sock puppet detection algorithm on virtual spaces. Knowledge-Based Systems 37 (2013), 366–377.

Digital Library

[5]

Thomas Green and Francesca Spezzano. 2017. Spam Users Identification in Wikipedia Via Editing Behavior. In Proceedings of the Eleventh International Conference on Web and Social Media, ICWSM 2017, Montréal, Québec, Canada, May 15-18, 2017.532–535.

[6]

Abhik Jana, Pranjal Kanojiya, Pawan Goyal, and Animesh Mukherjee. 2018. WikiRef: Wikilinks as a route to recommending appropriate references for scientific Wikipedia pages. arXiv preprint arXiv:1806.04092(2018).

[7]

Srijan Kumar, Justin Cheng, Jure Leskovec, and V. S. Subrahmanian. 2017. An Army of Me: Sockpuppets in Online Discussion Communities. In Proceedings of the 26th International Conference on World Wide Web, WWW 2017, Perth, Australia, April 3-7, 2017. 857–866.

Digital Library

[8]

Srijan Kumar, Francesca Spezzano, and V. S. Subrahmanian. 2015. VEWS: A Wikipedia Vandal Early Warning System. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 2015. 607–616.

Digital Library

[9]

Srijan Kumar, Robert West, and Jure Leskovec. 2016. Disinformation on the Web: Impact, Characteristics, and Detection of Wikipedia Hoaxes. In Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11 - 15, 2016. 591–602.

Digital Library

[10]

Dong Liu, Quanyuan Wu, Weihong Han, and Bin Zhou. 2016. Sockpuppet gang detection on social media sites. Frontiers of Computer Science 10, 1 (2016), 124–135.

Digital Library

[11]

Martin Potthast, Benno Stein, and Robert Gerling. 2008. Automatic Vandalism Detection in Wikipedia. In Advances in Information Retrieval, 30th European Conference on IR Research, ECIR 2008, Glasgow, UK, March 30-April 3, 2008. Proceedings. 663–668.

[12]

Thamar Solorio, Ragib Hasan, and Mainul Mizan. 2013. A case study of sockpuppet detection in wikipedia. In Proceedings of the Workshop on Language Analysis in Social Media at NAACL HTL. 59–68.

[13]

Thamar Solorio, Ragib Hasan, and Mainul Mizan. 2013. Sockpuppet detection in wikipedia: A corpus of real-world deceptive writing for linking identities. arXiv preprint arXiv:1310.6772(2013).

[14]

Michail Tsikerdekis and Sherali Zeadally. 2014. Multiple account identity deception detection in social media using nonverbal behavior. IEEE Transactions on Information Forensics and Security 9, 8(2014), 1311–1321.

Digital Library

[15]

Bimal Viswanath, Ansley Post, Krishna P Gummadi, and Alan Mislove. 2011. An analysis of social network-based sybil defenses. ACM SIGCOMM Computer Communication Review 41, 4 (2011), 363–374.

Digital Library

[16]

Andrew G. West, Avantika Agrawal, Phillip Baker, Brittney Exline, and Insup Lee. 2011. Autonomous link spam detection in purely collaborative environments. In Proceedings of the 7th International Symposium on Wikis and Open Collaboration, 2011, Mountain View, CA, USA, October 3-5, 2011. 91–100.

Digital Library

[17]

Andrew G. West, Jian Chang, Krishna K. Venkatasubramanian, Oleg Sokolsky, and Insup Lee. 2011. Link spamming Wikipedia for profit. In The 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference, CEAS 2011, Perth, Australia, September 1-2, 2011, Proceedings. 152–161.

[18]

Andrew G. West, Sampath Kannan, and Insup Lee. 2010. Detecting Wikipedia vandalism via spatio-temporal analysis of revision metadata. In Proceedings of the Third European Workshop on System Security, EUROSEC 2010, Paris, France, April 13, 2010. 22–28.

[19]

Zaher Yamak, Julien Saunier, and Laurent Vercouter. 2016. Detection of multiple identity manipulation in collaborative projects. In Proceedings of the 25th International Conference Companion on World Wide Web (Companion). 955–960.

Digital Library

[20]

Reza Zafarani and Huan Liu. 2015. 10 Bits of Surprise: Detecting Malicious Users with Minimum Information. In Proceedings of the 24th ACM International Conference on Information and Knowledge Management, CIKM 2015, Melbourne, VIC, Australia, October 19 - 23, 2015. 423–431.

Digital Library

Cited By

Kharazian ZStarbird KHill B(2024)Governance Capture in a Self-Governing Community: A Qualitative Comparison of the Croatian, Serbian, Bosnian, and Serbo-Croatian WikipediasProceedings of the ACM on Human-Computer Interaction10.1145/36373388:CSCW1(1-26)Online publication date: 26-Apr-2024
https://dl.acm.org/doi/10.1145/3637338
Trokhymovych MAslam MChou ABaeza-Yates RSaez-Trumper DSingh ASun YAkoglu LGunopulos DYan XKumar ROzcan FYe J(2023)Fair Multilingual Vandalism Detection System for WikipediaProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599823(4981-4990)Online publication date: 6-Aug-2023
https://dl.acm.org/doi/10.1145/3580305.3599823
García-Méndez SLeal FMalheiro BBurguillo-Rial J(2023)Interpretable Classification of Wiki-Review StreamsIEEE Access10.1109/ACCESS.2023.334247211(141137-141151)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3342472
Show More Cited By

Index Terms

Detecting Undisclosed Paid Editing in Wikipedia

Index terms have been assigned to the content through auto-classification.

Recommendations

Learning multilingual named entity recognition from Wikipedia

We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...
Two-stage approach to named entity recognition using Wikipedia and DBpedia
IMCOM '17: Proceedings of the 11th International Conference on Ubiquitous Information Management and Communication

In natural language understanding, extraction of named entity (NE) mentions in given text and classification of the mentions into pre-defined NE types are important processes. Most NE recognition (NER) relies on resources such as a training corpus or NE ...
Evaluating Entity Linking with Wikipedia

Named Entity Linking (nel) grounds entity mentions to their corresponding node in a Knowledge Base (kb). Recently, a number of systems have been proposed for linking entity mentions in text to Wikipedia pages. Such systems typically search for candidate ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '20: Proceedings of The Web Conference 2020

April 2020

3143 pages

ISBN:9781450370233

DOI:10.1145/3366423

Editors:
Yennun Huang
Acadmica sinica, Taiwan
,
Irwin King
The Chinese University of Hong Kong, Hong Kong
,
Tie-Yan Liu
Microsoft Research Asia, China
,
Maarten van Steen
University of Twente, Netherlands

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 April 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WWW '20

Sponsor:

SIGWEB

WWW '20: The Web Conference 2020

April 20 - 24, 2020

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
638
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)5

Reflects downloads up to 01 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kharazian ZStarbird KHill B(2024)Governance Capture in a Self-Governing Community: A Qualitative Comparison of the Croatian, Serbian, Bosnian, and Serbo-Croatian WikipediasProceedings of the ACM on Human-Computer Interaction10.1145/36373388:CSCW1(1-26)Online publication date: 26-Apr-2024
https://dl.acm.org/doi/10.1145/3637338
Trokhymovych MAslam MChou ABaeza-Yates RSaez-Trumper DSingh ASun YAkoglu LGunopulos DYan XKumar ROzcan FYe J(2023)Fair Multilingual Vandalism Detection System for WikipediaProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599823(4981-4990)Online publication date: 6-Aug-2023
https://dl.acm.org/doi/10.1145/3580305.3599823
García-Méndez SLeal FMalheiro BBurguillo-Rial J(2023)Interpretable Classification of Wiki-Review StreamsIEEE Access10.1109/ACCESS.2023.334247211(141137-141151)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3342472
Mohammed Ali AAlwan HAl-Shakarchy N(2022)A Survey on Detecting Vandalism in Crowdsourcing Models2022 International Conference on Data Science and Intelligent Computing (ICDSIC)10.1109/ICDSIC56987.2022.10076011(25-30)Online publication date: 1-Nov-2022
https://doi.org/10.1109/ICDSIC56987.2022.10076011
Sakib MSpezzano F(2022)Automated Detection of Sockpuppet Accounts in Wikipedia2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)10.1109/ASONAM55673.2022.10068604(155-158)Online publication date: 10-Nov-2022
https://doi.org/10.1109/ASONAM55673.2022.10068604

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents