Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2038558.2038574acmotherconferencesArticle/Chapter ViewAbstractPublication PageswikisymConference Proceedingsconference-collections
research-article

Autonomous link spam detection in purely collaborative environments

Published: 03 October 2011 Publication History

Abstract

Collaborative models (e.g., wikis) are an increasingly prevalent Web technology. However, the open-access that defines such systems can also be utilized for nefarious purposes. In particular, this paper examines the use of collaborative functionality to add inappropriate hyperlinks to destinations outside the host environment (i.e., link spam). The collaborative encyclopedia, Wikipedia, is the basis for our analysis.
Recent research has exposed vulnerabilities in Wikipedia's link spam mitigation, finding that human editors are latent and dwindling in quantity. To this end, we propose and develop an autonomous classifier for link additions. Such a system presents unique challenges. For example, low barriers-to-entry invite a diversity of spam types, not just those with economic motivations. Moreover, issues can arise with how a link is presented (regardless of the destination).
In this work, a spam corpus is extracted from over 235,000 link additions to English Wikipedia. From this, 40+ features are codified and analyzed. These indicators are computed using wiki metadata, landing site analysis, and external data sources. The resulting classifier attains 64% recall at 0.5% false-positives (ROC-AUC= 0.97). Such performance could enable egregious link additions to be blocked automatically with low false-positive rates, while prioritizing the remainder for human inspection. Finally, a live Wikipedia implementation of the technique has been developed.

References

[1]
Akismet. http://akismet.com/.
[2]
Alexa: The web info. company. http://www.alexa.org.
[3]
Alexa web info. service. http://aws.amazon.com/awis/.
[4]
Defensio: Social web security. http://www.defensio.com.
[5]
Google safe browsing API. http://code.google.com/apis/.
[6]
Huggle. http://en.wikipedia.org/wiki/WP:Huggle.
[7]
MediaWiki API. http://en.wikipedia.org/w/api.php.
[8]
MediaWiki extensions. http://www.mediawiki.org/wiki/Category:Extensions. (extending core engine functionality).
[9]
Wikipedia. http://www.wikipedia.org.
[10]
Wikipedia article statistics. http://dammit.lt/wikistats.
[11]
Wikipedia: External links. http://en.wikipedia.org/wiki/Wikipedia:External_links. (syntax and policy).
[12]
Wikipedia: Long-term abuse: Universe Daily. http://en.wikipedia.org/wiki/Wikipedia:UNID.
[13]
Wikipedia spam blacklist (English language version). http://en.wikipedia.org/wiki/MediaWiki:Spam-blacklist.
[14]
WikiProject spam. http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Spam.
[15]
Wikistats. http://stats.wikimedia.org/.
[16]
S. Abu-Nimeh and T. Chen. Proliferation and detection of blog spam. IEEE Security and Privacy, 8:42--47, 2010.
[17]
B. Adler, L. de Alfaro, S. M. Mola-Velasco, P. Rosso, and A. G. West. Wikipedia vandalism detection: Combining natural language, metadata, and reputation features. In CICLing'11: Proceedings of the 12th International Conference on Intelligent Text Processing and Computational Linguistics, LNCS 6609, pages 277--288, February 2011.
[18]
C. Breneman and C. Carter. Cluebot NG. http://en.wikipedia.org/wiki/User:ClueBot_NG.
[19]
H. Dai, Z. Nie, L. Wang, L. Zhao, J.-R. Wen, and Y. Li. Detecting online commercial intention (OCI). In WWW'06: Proceedings on the 15th World Wide Web Conference, 2006.
[20]
J. R. Douceur. The Sybil attack. In First IPTPS, 2002.
[21]
Y. Freund and L. Mason. The alternating decision tree algorithm. In Proceedings of the 16th International Conference on Machine Learning, pages 124--133, 1999.
[22]
H. Gao, J. Hu, C. Wilson, Z. Li, Y. Chen, and B. Y. Zhao. Detecting and characterizing social spam campaigns. In IMC'10: Proc. of the 10th Internet Measure. Conf., 2010.
[23]
E. Goldman. Wikipedia's labor squeeze and its consequences. Journal of Telecomm. and High Tech. Law, 8, 2009.
[24]
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witen. The WEKA data mining software: An update. SIGKDD Explorations, 11(1), 2009.
[25]
S. Han, Y. yeol Ahn, S. Moon, and H. Jeong. Collaborative blog spam filtering using adaptive percolation search. In WWW'06 Workshop on the Weblogging Ecosystem, 2006.
[26]
S. Hao, N. A. Syed, N. Feamster, A. G. Gray, and S. Krasser. Detecting spammers with SNARE: Spatio-temporal network-level automated reputation engine. In 18th USENIX Security Symposium, August 2009.
[27]
P. Heymann, G. Koutrika, and H. Garcia-Molina. Fighting spam on social web sites: A survey of approaches and future challenges. IEEE Internet Computing, 11(6):36--45, 2007.
[28]
R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelligence Journal, 97(1):273--324, 1997.
[29]
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In AIRWeb'05: Proc. of the Wkshp. on Adversarial Info. Retrieval on the Web, 2005.
[30]
Y. Niu, Y. min Wang, H. Chen, M. Ma, and F. Hsu. A quantitative study of forum spamming using context-based analysis. In NDSS'07: Proceedings of Network and Distributed System Security Symposium, 2007.
[31]
A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In WWW'06: Proceedings of the 15th World Wide Web Conference, 2006.
[32]
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical Report 422-1999, Stanford University, 1999.
[33]
M. Potthast. Crowdsourcing a Wikipedia vandalism corpus. In SIGIR '10: 33rd Intl. ACM SIG Information Retrieval Conference, pages 789--790, 2010.
[34]
M. Potthast, B. Stein, and T. Holfeld. Overview of the 1st intl. competition on Wikipedia vandalism detection. In Notebook Papers of CLEF 2010 LABs and Workshops, 2010.
[35]
R. Priedhorsky, J. Chen, S. K. Lam, K. Panciera, L. Terveen, and J. Riedl. Creating, destroying, and restoring value in Wikipedia. In GROUP '07: Proc. of the Intl. Conf. on Supporting Group Work, pages 259--268, 2007.
[36]
N. Provos, P. Mavrommatis, M. A. Rajab, and F. Monrose. All your iFrames point to us. In SECURITY'08: The 17th USENIX Security Symposium, pages 1--15, 2008.
[37]
Y. Shin, M. Gupta, and S. Myers. The nuts and bolts of a forum spam automator. In LEET'11: Proc. of the 4th Wkshp. on Large-Scale Exploits and Emergent Threats, 2011.
[38]
B. Stone. Policing the Web's lurid precincts. The New York Times, page B1, July 18, 2010.
[39]
Symantec MessageLabs. 2010 Security Report. http://www.messagelabs.com/resources/mlireports.aspx.
[40]
A. G. West. STiki: A vandalism detection tool for Wikipedia. http://en.wikipedia.org/wiki/Wikipedia:STiki.
[41]
A. G. West, J. Chang, K. Venkatasubramanian, O. Sokolsky, and I. Lee. Link spamming Wikipedia for profit. In CEAS '11: Proc. of the 8th Annual Collaboration, Electronic Messaging, Anti-Abuse, and Spam Conference, September 2011.
[42]
A. G. West, S. Kannan, and I. Lee. Detecting Wikipedia vandalism via spatio-temporal analysis of revision metadata. In EUROSEC '10: Proceedings of the Third European Workshop on System Security, pages 22--28, April 2010.
[43]
J. Winter. Wikipedia distributing child porn, co-founder tells FBI. FoxNews.com, April 27, 2010.

Cited By

View all
  • (2024)MAWSEO: Adversarial Wiki Search Poisoning for Illicit Online Promotion2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00049(388-406)Online publication date: 19-May-2024
  • (2020)Detecting Undisclosed Paid Editing in WikipediaProceedings of The Web Conference 202010.1145/3366423.3380055(2899-2905)Online publication date: 20-Apr-2020
  • (2019)Detecting pages to protect in Wikipedia across multiple languagesSocial Network Analysis and Mining10.1007/s13278-019-0555-09:1Online publication date: 14-Mar-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
WikiSym '11: Proceedings of the 7th International Symposium on Wikis and Open Collaboration
October 2011
245 pages
ISBN:9781450309097
DOI:10.1145/2038558
  • Conference Chair:
  • Felipe Ortega,
  • Program Chair:
  • Andrea Forte
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • TJEF: The John Ernest Foundation

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 October 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Wikipedia
  2. collaboration
  3. collaborative security
  4. information security
  5. intelligent routing
  6. link spam
  7. machine-learning
  8. reputation
  9. spam mitigation
  10. spatio-temporal features

Qualifiers

  • Research-article

Funding Sources

Conference

WikiSym '11
Sponsor:

Acceptance Rates

Overall Acceptance Rate 69 of 145 submissions, 48%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)MAWSEO: Adversarial Wiki Search Poisoning for Illicit Online Promotion2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00049(388-406)Online publication date: 19-May-2024
  • (2020)Detecting Undisclosed Paid Editing in WikipediaProceedings of The Web Conference 202010.1145/3366423.3380055(2899-2905)Online publication date: 20-Apr-2020
  • (2019)Detecting pages to protect in Wikipedia across multiple languagesSocial Network Analysis and Mining10.1007/s13278-019-0555-09:1Online publication date: 14-Mar-2019
  • (2014)Metadata-Driven Threat Classification of Network Endpoints Appearing in MalwareDetection of Intrusions and Malware, and Vulnerability Assessment10.1007/978-3-319-08509-8_9(152-171)Online publication date: 2014
  • (2012)Spamming for scienceProceedings of the 16th international conference on Financial Cryptography and Data Security10.1007/978-3-642-34638-5_9(98-111)Online publication date: 2-Mar-2012
  • (2011)Link spamming Wikipedia for profitProceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference10.1145/2030376.2030394(152-161)Online publication date: 1-Sep-2011

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media