Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1531914.1531921acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesiea-aeiConference Proceedingsconference-collections
research-article

An empirical study on selective sampling in active learning for splog detection

Published: 21 April 2009 Publication History

Abstract

This paper studies how to reduce the amount of human supervision for identifying splogs / authentic blogs in the context of continuously updating splog data sets year by year. Following the previous works on active learning, against the task of splog / authentic blog detection, this paper empirically examines several strategies for selective sampling in active learning by Support Vector Machines (SVMs). As a confidence measure of SVMs learning, we employ the distance from the separating hyperplane to each test instance, which have been well studied in active learning for text classification. Unlike those results of applying active learning to text classification tasks, in the task of splog / authentic blog detection of this paper, it is not the case that adding least confident samples peforms best.

References

[1]
Wikipedia, Spam blog. http://en.wikipedia.org/wiki/Spam_blog.
[2]
Wikipedia, Ping (blogging). http://en.wikipedia.org/wiki/Ping_(blogging).
[3]
N. Glance, M. Hurst, and T. Tomokiyo. Blogpulse: Automated trend discovery for Weblogs. In WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2004.
[4]
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proc. 1st AIRWeb, pages 39--47, 2005.
[5]
P. Kolari, T. Finin, and A. Joshi. SVMs for the Blogosphere: Blog identification and Splog detection. In Proc. 2006 AAAI Spring Symp. Computational Approaches to Analyzing Weblogs, pages 92--99, 2006.
[6]
P. Kolari, T. Finin, and A. Joshi. Spam in blogs and social media. In Tutorial at ICWSM, 2007.
[7]
P. Kolari, A. Joshi, and T. Finin. Characterizing the splogosphere. In Proc. 3rd Ann. Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2006.
[8]
L. I. Kuncheva. Classifier ensembles for detecting concept change in streaming data: Overview and perspectives. In Proc. 2nd Workshop SUEMA 2008 (ECAI 2008), pages 5--10, 2008.
[9]
D. D. Lewis and W. A. Gale. A sequential algorithm for training text classifiers. In Proc. 17th SIGIR, pages 3--12, 1994.
[10]
Y.-R. Lin, H. Sundaram, Y. Chi, J. Tatemura, and B. L. Tseng. Splog detection using self-similarity analysis on blog temporal dynamics. In Proc. 3rd AIRWeb, pages 1--8, 2007.
[11]
C. Macdonald and I. Ounis. The TREC Blogs06 collection: Creating and analysing a blog test collection. Technical Report TR-2006-224, University of Glasgow, Department of Computing Science, 2006.
[12]
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In Proc. 1st AIRWeb, 2005.
[13]
T. Nanno, T. Fujiki, Y. Suzuki, and M. Okumura. Automatically collecting, monitoring, and mining Japanese weblogs. In WWW Alt. '04: Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, pages 320--321. ACM Press, 2004.
[14]
Y. Sato, T. Utsuro, T. Fukuhara, Y. Kawada, Y. Murakami, H. Nakagawa, and N. Kando. Analysing features of Japanese splogs and characteristics of keywords. In Proc. 4th AIRWeb, 2008.
[15]
G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In Proc. 17th ICML, pages 839--846, 2000.
[16]
S. Tong and D. Koller. Support vector machine active learning with applications to text classification. In Proc. 17th ICML, pages 999--1006, 2000.
[17]
V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, 1995.
[18]
Y. Wang, M. Ma, Y. Niu, and H. Chen. Spam double-funnel: Connecting web spammers with advertisers,. In Proc. 16th WWW Conf., pages 291--300, 2007.

Cited By

View all
  • (2017)Challenges in the Analysis of Online Social NetworksWireless Personal Communications: An International Journal10.1007/s11277-017-4712-397:3(4015-4061)Online publication date: 1-Dec-2017
  • (2012)Detection Splog Algorithm Based on Features Relation TreeProceedings of the 2012 Ninth Web Information Systems and Applications Conference10.1109/WISA.2012.39(99-102)Online publication date: 16-Nov-2012
  • (2012)Spammer Behavior Analysis and Detection in User Generated Content on Social NetworksProceedings of the 2012 IEEE 32nd International Conference on Distributed Computing Systems10.1109/ICDCS.2012.40(305-314)Online publication date: 18-Jun-2012
  • Show More Cited By

Index Terms

  1. An empirical study on selective sampling in active learning for splog detection

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      AIRWeb '09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
      April 2009
      67 pages
      ISBN:9781605584386
      DOI:10.1145/1531914
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 21 April 2009

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. SVM
      2. active learning
      3. selective sampling
      4. spam blog detection

      Qualifiers

      • Research-article

      Conference

      AIRWeb '09

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)2
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 17 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2017)Challenges in the Analysis of Online Social NetworksWireless Personal Communications: An International Journal10.1007/s11277-017-4712-397:3(4015-4061)Online publication date: 1-Dec-2017
      • (2012)Detection Splog Algorithm Based on Features Relation TreeProceedings of the 2012 Ninth Web Information Systems and Applications Conference10.1109/WISA.2012.39(99-102)Online publication date: 16-Nov-2012
      • (2012)Spammer Behavior Analysis and Detection in User Generated Content on Social NetworksProceedings of the 2012 IEEE 32nd International Conference on Distributed Computing Systems10.1109/ICDCS.2012.40(305-314)Online publication date: 18-Jun-2012
      • (2011)Dynamic Splog Filtering algorithm Based on Combinational FeaturesProceedings of the 2011 Eighth Web Information Systems and Applications Conference10.1109/WISA.2011.23(82-85)Online publication date: 21-Oct-2011
      • (2011)Spam, Opinions, and Other Relationships: Towards a Comprehensive View of the Web Knowledge DiscoveryAdvanced Topics in Information Retrieval10.1007/978-3-642-20946-8_3(51-82)Online publication date: 2011
      • (2010)Detecting splogs using similarities of splog HTML structuresProceedings of the 4th International Conference on Uniquitous Information Management and Communication10.1145/2108616.2108661(1-8)Online publication date: 14-Jan-2010
      • (2010)Proliferation and Detection of Blog SpamIEEE Security and Privacy10.1109/MSP.2010.1138:5(42-47)Online publication date: 1-Sep-2010

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media