research-article

Autonomous link spam detection in purely collaborative environments

Authors:

Andrew G. West,

Avantika Agrawal,

Brittney Exline,

Insup LeeAuthors Info & Claims

WikiSym '11: Proceedings of the 7th International Symposium on Wikis and Open Collaboration

Pages 91 - 100

https://doi.org/10.1145/2038558.2038574

Published: 03 October 2011 Publication History

Abstract

Collaborative models (e.g., wikis) are an increasingly prevalent Web technology. However, the open-access that defines such systems can also be utilized for nefarious purposes. In particular, this paper examines the use of collaborative functionality to add inappropriate hyperlinks to destinations outside the host environment (i.e., link spam). The collaborative encyclopedia, Wikipedia, is the basis for our analysis.

Recent research has exposed vulnerabilities in Wikipedia's link spam mitigation, finding that human editors are latent and dwindling in quantity. To this end, we propose and develop an autonomous classifier for link additions. Such a system presents unique challenges. For example, low barriers-to-entry invite a diversity of spam types, not just those with economic motivations. Moreover, issues can arise with how a link is presented (regardless of the destination).

In this work, a spam corpus is extracted from over 235,000 link additions to English Wikipedia. From this, 40+ features are codified and analyzed. These indicators are computed using wiki metadata, landing site analysis, and external data sources. The resulting classifier attains 64% recall at 0.5% false-positives (ROC-AUC= 0.97). Such performance could enable egregious link additions to be blocked automatically with low false-positive rates, while prioritizing the remainder for human inspection. Finally, a live Wikipedia implementation of the technique has been developed.

References

[1]

Akismet. http://akismet.com/.

[2]

Alexa: The web info. company. http://www.alexa.org.

[3]

Alexa web info. service. http://aws.amazon.com/awis/.

[4]

Defensio: Social web security. http://www.defensio.com.

[5]

Google safe browsing API. http://code.google.com/apis/.

[6]

Huggle. http://en.wikipedia.org/wiki/WP:Huggle.

[7]

MediaWiki API. http://en.wikipedia.org/w/api.php.

[8]

MediaWiki extensions. http://www.mediawiki.org/wiki/Category:Extensions. (extending core engine functionality).

[9]

Wikipedia. http://www.wikipedia.org.

[10]

Wikipedia article statistics. http://dammit.lt/wikistats.

[11]

Wikipedia: External links. http://en.wikipedia.org/wiki/Wikipedia:External_links. (syntax and policy).

[12]

Wikipedia: Long-term abuse: Universe Daily. http://en.wikipedia.org/wiki/Wikipedia:UNID.

[13]

Wikipedia spam blacklist (English language version). http://en.wikipedia.org/wiki/MediaWiki:Spam-blacklist.

[14]

WikiProject spam. http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Spam.

[15]

Wikistats. http://stats.wikimedia.org/.

[16]

S. Abu-Nimeh and T. Chen. Proliferation and detection of blog spam. IEEE Security and Privacy, 8:42--47, 2010.

Digital Library

[17]

B. Adler, L. de Alfaro, S. M. Mola-Velasco, P. Rosso, and A. G. West. Wikipedia vandalism detection: Combining natural language, metadata, and reputation features. In CICLing'11: Proceedings of the 12th International Conference on Intelligent Text Processing and Computational Linguistics, LNCS 6609, pages 277--288, February 2011.

Digital Library

[18]

C. Breneman and C. Carter. Cluebot NG. http://en.wikipedia.org/wiki/User:ClueBot_NG.

[19]

H. Dai, Z. Nie, L. Wang, L. Zhao, J.-R. Wen, and Y. Li. Detecting online commercial intention (OCI). In WWW'06: Proceedings on the 15th World Wide Web Conference, 2006.

Digital Library

[20]

J. R. Douceur. The Sybil attack. In First IPTPS, 2002.

Digital Library

[21]

Y. Freund and L. Mason. The alternating decision tree algorithm. In Proceedings of the 16th International Conference on Machine Learning, pages 124--133, 1999.

Digital Library

[22]

H. Gao, J. Hu, C. Wilson, Z. Li, Y. Chen, and B. Y. Zhao. Detecting and characterizing social spam campaigns. In IMC'10: Proc. of the 10th Internet Measure. Conf., 2010.

Digital Library

[23]

E. Goldman. Wikipedia's labor squeeze and its consequences. Journal of Telecomm. and High Tech. Law, 8, 2009.

[24]

M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witen. The WEKA data mining software: An update. SIGKDD Explorations, 11(1), 2009.

Digital Library

[25]

S. Han, Y. yeol Ahn, S. Moon, and H. Jeong. Collaborative blog spam filtering using adaptive percolation search. In WWW'06 Workshop on the Weblogging Ecosystem, 2006.

[26]

S. Hao, N. A. Syed, N. Feamster, A. G. Gray, and S. Krasser. Detecting spammers with SNARE: Spatio-temporal network-level automated reputation engine. In 18th USENIX Security Symposium, August 2009.

Digital Library

[27]

P. Heymann, G. Koutrika, and H. Garcia-Molina. Fighting spam on social web sites: A survey of approaches and future challenges. IEEE Internet Computing, 11(6):36--45, 2007.

Digital Library

[28]

R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelligence Journal, 97(1):273--324, 1997.

Digital Library

[29]

G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In AIRWeb'05: Proc. of the Wkshp. on Adversarial Info. Retrieval on the Web, 2005.

[30]

Y. Niu, Y. min Wang, H. Chen, M. Ma, and F. Hsu. A quantitative study of forum spamming using context-based analysis. In NDSS'07: Proceedings of Network and Distributed System Security Symposium, 2007.

[31]

A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In WWW'06: Proceedings of the 15th World Wide Web Conference, 2006.

Digital Library

[32]

L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical Report 422-1999, Stanford University, 1999.

[33]

M. Potthast. Crowdsourcing a Wikipedia vandalism corpus. In SIGIR '10: 33rd Intl. ACM SIG Information Retrieval Conference, pages 789--790, 2010.

Digital Library

[34]

M. Potthast, B. Stein, and T. Holfeld. Overview of the 1st intl. competition on Wikipedia vandalism detection. In Notebook Papers of CLEF 2010 LABs and Workshops, 2010.

[35]

R. Priedhorsky, J. Chen, S. K. Lam, K. Panciera, L. Terveen, and J. Riedl. Creating, destroying, and restoring value in Wikipedia. In GROUP '07: Proc. of the Intl. Conf. on Supporting Group Work, pages 259--268, 2007.

Digital Library

[36]

N. Provos, P. Mavrommatis, M. A. Rajab, and F. Monrose. All your iFrames point to us. In SECURITY'08: The 17th USENIX Security Symposium, pages 1--15, 2008.

Digital Library

[37]

Y. Shin, M. Gupta, and S. Myers. The nuts and bolts of a forum spam automator. In LEET'11: Proc. of the 4th Wkshp. on Large-Scale Exploits and Emergent Threats, 2011.

Digital Library

[38]

B. Stone. Policing the Web's lurid precincts. The New York Times, page B1, July 18, 2010.

[39]

Symantec MessageLabs. 2010 Security Report. http://www.messagelabs.com/resources/mlireports.aspx.

[40]

A. G. West. STiki: A vandalism detection tool for Wikipedia. http://en.wikipedia.org/wiki/Wikipedia:STiki.

[41]

A. G. West, J. Chang, K. Venkatasubramanian, O. Sokolsky, and I. Lee. Link spamming Wikipedia for profit. In CEAS '11: Proc. of the 8th Annual Collaboration, Electronic Messaging, Anti-Abuse, and Spam Conference, September 2011.

Digital Library

[42]

A. G. West, S. Kannan, and I. Lee. Detecting Wikipedia vandalism via spatio-temporal analysis of revision metadata. In EUROSEC '10: Proceedings of the Third European Workshop on System Security, pages 22--28, April 2010.

Digital Library

[43]

J. Winter. Wikipedia distributing child porn, co-founder tells FBI. FoxNews.com, April 27, 2010.

Cited By

Lin ZLi ZLiao XWang XLiu X(2024)MAWSEO: Adversarial Wiki Search Poisoning for Illicit Online Promotion2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00049(388-406)Online publication date: 19-May-2024
https://doi.org/10.1109/SP54263.2024.00049
Joshi NSpezzano FGreen MHill E(2020)Detecting Undisclosed Paid Editing in WikipediaProceedings of The Web Conference 202010.1145/3366423.3380055(2899-2905)Online publication date: 20-Apr-2020
https://dl.acm.org/doi/10.1145/3366423.3380055
Spezzano FSuyehira KGundala L(2019)Detecting pages to protect in Wikipedia across multiple languagesSocial Network Analysis and Mining10.1007/s13278-019-0555-09:1Online publication date: 14-Mar-2019
https://doi.org/10.1007/s13278-019-0555-0
Show More Cited By

Index Terms

Autonomous link spam detection in purely collaborative environments

Recommendations

Link spamming Wikipedia for profit
CEAS '11: Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference

Collaborative functionality is an increasingly prevalent web technology. To encourage participation, these systems usually have low barriers-to-entry and permissive privileges. Unsurprisingly, ill-intentioned users try to leverage these characteristics ...
Link-based web spam detection using weight properties

Link spam is created with the intention of boosting one target's rank in exchange of business profit. This unethical way of deceiving Web search engines is known as Web spam. Since then many anti-link spam detection techniques have constantly being ...
Combating link spam by noisy link analysis
ADMA'10: Proceedings of the 6th international conference on Advanced data mining and applications: Part I

Link Spam has indentified as one of the major obstacles for linkbased ranking algorithms of modern search engine since it intently constructs hyperlink structure to help some poor-content pages obtaining undeserved high rank. This problem is even worse ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

WikiSym '11: Proceedings of the 7th International Symposium on Wikis and Open Collaboration

October 2011

245 pages

ISBN:9781450309097

DOI:10.1145/2038558

Conference Chair:
Felipe Ortega
University Rey Juan Carlos, Madrid, Spain
,
Program Chair:
Andrea Forte
Drexel University, Philadelphia

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

TJEF: The John Ernest Foundation

In-Cooperation

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 October 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Office of Naval Research

Conference

WikiSym '11

Sponsor:

WikiSym '11: The 7th International Symposium on Wikis and Open Collaboration

October 3 - 5, 2011

California, Mountain View

Acceptance Rates

Overall Acceptance Rate 69 of 145 submissions, 48%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
178
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lin ZLi ZLiao XWang XLiu X(2024)MAWSEO: Adversarial Wiki Search Poisoning for Illicit Online Promotion2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00049(388-406)Online publication date: 19-May-2024
https://doi.org/10.1109/SP54263.2024.00049
Joshi NSpezzano FGreen MHill E(2020)Detecting Undisclosed Paid Editing in WikipediaProceedings of The Web Conference 202010.1145/3366423.3380055(2899-2905)Online publication date: 20-Apr-2020
https://dl.acm.org/doi/10.1145/3366423.3380055
Spezzano FSuyehira KGundala L(2019)Detecting pages to protect in Wikipedia across multiple languagesSocial Network Analysis and Mining10.1007/s13278-019-0555-09:1Online publication date: 14-Mar-2019
https://doi.org/10.1007/s13278-019-0555-0
West AMohaisen A(2014)Metadata-Driven Threat Classification of Network Endpoints Appearing in MalwareDetection of Intrusions and Malware, and Vulnerability Assessment10.1007/978-3-319-08509-8_9(152-171)Online publication date: 2014
https://doi.org/10.1007/978-3-319-08509-8_9
West AHayati PPotdar VLee I(2012)Spamming for scienceProceedings of the 16th international conference on Financial Cryptography and Data Security10.1007/978-3-642-34638-5_9(98-111)Online publication date: 2-Mar-2012
https://dl.acm.org/doi/10.1007/978-3-642-34638-5_9
West AChang JVenkatasubramanian KSokolsky OLee IPotdar V(2011)Link spamming Wikipedia for profitProceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference10.1145/2030376.2030394(152-161)Online publication date: 1-Sep-2011
https://dl.acm.org/doi/10.1145/2030376.2030394

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents