Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2487788.2488140acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Automatically generated spam detection based on sentence-level topic information

Published: 13 May 2013 Publication History

Abstract

Spammers use a wide range of content generation techniques with low quality pages known as content spam to achieve their goals. We argue that content spam must be tackled using a wide range of content quality features. In this paper, we propose novel sentence-level diversity features based on the probabilistic topic model. We combine them with other content features to build a content spam classifier. Our experiments show that our method outperforms the conventional methods.

References

[1]
I. Bíró, D. Siklósi, J. Szabó, and A. A. Benczúr. Linked latent dirichlet allocation in web spam filtering. In Proc. AIRWeb '09, AIRWeb '09, pages 37--40, 2009.
[2]
I. Bíró, J. Szabó, and A. A. Benczúr. Latent dirichlet allocation in web spam filtering. In Proc. AIRWeb '08, AIRWeb '08, pages 29--32, 2008.
[3]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993--1022, 2003.
[4]
M. Erdélyi, A. Garzó, and A. A. Benczúr. Web spam classification: a few features worth more. In Proc. WebQuality '11, WebQuality '11, pages 27--34, 2011.
[5]
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classification. J. Mach. Learn. Res., 9:1871--1874, 2008.
[6]
D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the world wide web. In Proc. SIGIR '05, SIGIR '05, pages 170--177, 2005.
[7]
T. Fuchi and S. Takagi. Japanese morphological analyzer using word co-occurrence: Jtag. In Proc. COLING '98, pages 409--413, 1998.
[8]
T. L. Griffiths and M. Steyvers. Finding scientific topics. In Proceedings of the National Academy of Sciences, volume 101 (suppl. 1), pages 5228--5235, 2004.
[9]
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proc. AIRWeb '05, pages 39--47, 2005.
[10]
Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In Proc. VLDB '04, pages 576--587, 2004.
[11]
Y. Jo and A. H. Oh. Aspect and sentiment unification model for online review analysis. In Proc. WSDM '11, WSDM '11, pages 815--824, 2011.
[12]
C. D. Manning and H. Schütze. Foundations of statistical natural language processing. MIT Press, Cambridge, MA, USA, 1999.
[13]
J. Martinez-Romo and L. Araujo. Web spam identification through language model analysis. In Proc. AIRWeb '09, AIRWeb '09, pages 21--28, 2009.
[14]
A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proc. WWW '06, pages 83--92, 2006.
[15]
A. Pavlov and B. V. Dobrov. Detecting content spam on the web through text diversity analysis. In Proc. SYRCoDIS '11, pages 11--18, 2011.
[16]
M. Riedl and C. Biemann. Sweeping through the topic space: bad luck? roll again! In Proc. ROBUS-UNSUP '12, ROBUS-UNSUP '12, pages 19--27, 2012.
[17]
M. Riedl and C. Biemann. Topictiling: a text segmentation algorithm based on lda. In Proc. ACL '12 Student Research Workshop, ACL '12, pages 37--42, 2012.
[18]
N. Spirin and J. Han. Survey on web spam detection: principles and algorithms. SIGKDD Explor. Newsl., 13(2):50--64, 2012.
[19]
E. Vallés and P. Rosso. Detection of near-duplicate user generated contents: the sms spam collection. In Proc. SMUC '11, SMUC '11, pages 27--34, 2011.

Cited By

View all
  • (2022)A Study on Diverse Methods and Performance Measures in Sentiment AnalysisRecent Patents on Engineering10.2174/187221211499920101915495416:3Online publication date: May-2022
  • (2022)Pseudo Base Station Spam SMS Identification Based on BiLSTM-Attention2022 11th International Conference on Communications, Circuits and Systems (ICCCAS)10.1109/ICCCAS55266.2022.9825128(216-219)Online publication date: 13-May-2022
  • (2021)Korean Erroneous Sentence Classification With Integrated Eojeol EmbeddingIEEE Access10.1109/ACCESS.2021.30858649(81778-81785)Online publication date: 2021
  • Show More Cited By

Index Terms

  1. Automatically generated spam detection based on sentence-level topic information

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide Web
      May 2013
      1636 pages
      ISBN:9781450320382
      DOI:10.1145/2487788

      Sponsors

      • NICBR: Nucleo de Informatcao e Coordenacao do Ponto BR
      • CGIBR: Comite Gestor da Internet no Brazil

      In-Cooperation

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 May 2013

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. spam detection
      2. spam feature
      3. topic model

      Qualifiers

      • Research-article

      Conference

      WWW '13
      Sponsor:
      • NICBR
      • CGIBR
      WWW '13: 22nd International World Wide Web Conference
      May 13 - 17, 2013
      Rio de Janeiro, Brazil

      Acceptance Rates

      WWW '13 Companion Paper Acceptance Rate 831 of 1,250 submissions, 66%;
      Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)2
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 08 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)A Study on Diverse Methods and Performance Measures in Sentiment AnalysisRecent Patents on Engineering10.2174/187221211499920101915495416:3Online publication date: May-2022
      • (2022)Pseudo Base Station Spam SMS Identification Based on BiLSTM-Attention2022 11th International Conference on Communications, Circuits and Systems (ICCCAS)10.1109/ICCCAS55266.2022.9825128(216-219)Online publication date: 13-May-2022
      • (2021)Korean Erroneous Sentence Classification With Integrated Eojeol EmbeddingIEEE Access10.1109/ACCESS.2021.30858649(81778-81785)Online publication date: 2021
      • (2019)An effective feature selection method for web spam detectionKnowledge-Based Systems10.1016/j.knosys.2018.12.026166(198-206)Online publication date: Feb-2019
      • (2018)Two time-efficient gibbs sampling inference algorithms for biterm topic modelApplied Intelligence10.1007/s10489-017-1004-248:3(730-754)Online publication date: 1-Mar-2018
      • (2017)Cleaning Out Web Spam by Entropy-Based Cascade Outlier DetectionDatabase and Expert Systems Applications10.1007/978-3-319-64471-4_19(232-246)Online publication date: 2-Aug-2017
      • (2015)Exploiting latent content based features for the detection of static SMS spamsProceedings of the American Society for Information Science and Technology10.1002/meet.2014.1450510115751:1(1-4)Online publication date: 24-Apr-2015
      • (2013)How Many Zombies Around You?2013 IEEE 13th International Conference on Data Mining10.1109/ICDM.2013.166(1133-1138)Online publication date: Dec-2013

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media