Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Diversionary Comments under Blog Posts

Published: 24 September 2015 Publication History

Abstract

There has been a recent swell of interest in the analysis of blog comments. However, much of the work focuses on detecting comment spam in the blogsphere. An important issue that has been neglected so far is the identification of diversionary comments. Diversionary comments are defined as comments that divert the topic from the original post. A possible purpose is to distract readers from the original topic and draw attention to a new topic. We categorize diversionary comments into five types based on our observations and propose an effective framework to identify and flag them. To the best of our knowledge, the problem of detecting diversionary comments has not been studied so far. We solve the problem in two different ways: (i) rank all comments in descending order of being diversionary and (ii) consider it as a classification problem. Our evaluation on 4,179 comments under 40 different blog posts from Digg and Reddit shows that the proposed method achieves the high mean average precision of 91.9% when the problem is considered as a ranking problem and 84.9% of F-measure as a classification problem. Sensitivity analysis indicates that the effectiveness of the method is stable under different parameter settings.

References

[1]
Erik Aumayr, Jeffrey Chan, and Conor Hayes. 2011. Reconstruction of threaded conversations in online discussion forums. In Proceedings of the International AAAI Conference on Weblogs and Social Media.
[2]
Ricardo Baeza-Yates and Berthier Ribeiro-Neto. 2008. Modern Information Retrieval (2nd ed.). Addison-Wesley.
[3]
Somnath Banerjee, Krishnan Ramanathan, and Ajay Gupta. 2007. Clustering short texts using Wikipedia. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). ACM, New York, NY, 787--788.
[4]
Eric Bengtson and Dan Roth. 2008. Understanding the value of features for coreference resolution. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’08). 294--303. http://dl.acm.org/citation.cfm?id=1613715.1613756
[5]
Archana Bhattarai, Vasile Rus, and Dipankar Dasgupta. 2009. Characterizing comment spam in the blogosphere through content analysis. In Proceedings of the IEEE Symposium on Computational Intelligence in Cyber Security (CICS’09).
[6]
Christopher M. Bishop. 2007. Pattern Recognition and Machine Learning. Information Science and Statistics Series. Springer.
[7]
Enrico Blanzieri and Anton Bryl. 2008. A survey of learning-based techniques of email spam filtering. Artificial Intelligence Review 29, 1, 63--92.
[8]
David M. Blei, Andrew Y. Ng, Michael I. Jordan, and John Lafferty. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993--1022.
[9]
Carlos Castillo and Brian D. Davison. 2010. Adversarial Web search. Foundations and Trends in Information Retrieval 4, 5, 377--486.
[10]
Carlos Castillo, Debora Donato, Luca Becchetti, Paolo Boldi, Stefano Leonardi, Massimo Santini, and Sebastiano Vigna. 2006. A reference collection for Web spam. SIGIR Forum 40, 11--24.
[11]
Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 37--46.
[12]
Gordon V. Cormack. 2008. Email spam filtering: A systematic review. Foundation and Trends in Information Retrieval 1, 4, 335--455.
[13]
Gordon V. Cormack, José María Gómez Hidalgo, and Enrique Puertas Sánz. 2007. Spam filtering for short messages. In Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM’07). ACM, New York, NY, 313--320.
[14]
Yoav Freund and Robert E. Schapire. 1999. Large margin classification using the perceptron algorithm. Machine Learning 37, 3, 277--296.
[15]
Bent Fuglede and Flemming Topsoe. 2004. Jensen-Shannon divergence and Hilbert space embedding. In Proceedings of the International Symposium on Information Theory (ISIT’04).
[16]
Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI’07). 1606--1611. http://dl.acm.org/citation.cfm?id=1625275.1625535
[17]
Tom Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101, 1, 5228--5235.
[18]
Gregor Heinrich. 2004. Parameter Estimation for Text Analysis. Technical Report. University of Lepzig, Germany.
[19]
Liangjie Hong and Brian D. Davison. 2010. Empirical study of topic modeling in Twitter. In Proceedings of the 1st Workshop on Social Media Analytics (SOMA’10). ACM, New York, NY, 80--88.
[20]
Xia Hu, Nan Sun, Chao Zhang, and Tat-Seng Chua. 2009. Exploiting internal and external semantics for the clustering of short texts using world knowledge. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). ACM, New York, NY, 919--928.
[21]
Molly Ireland, Amy Gonzales, James W. Pennebaker, Cindy K. Chung, and Roger J. Booth. 2007. The Development and Psychometric Properties of LIWC2007. LIWC.net, Austin, TX.
[22]
Nitin Jindal and Bing Liu. 2008. Opinion spam and analysis. In Proceedings of the 2008 International Conference on Web Search and Data Mining (WSDM’08). ACM, New York, NY, 219--230.
[23]
Nitin Jindal, Bing Liu, and Ee-Peng Lim. 2010. Finding unusual review patterns using unexpected rules. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM’10). ACM, New York, NY, 1549--1552.
[24]
Solomon Kullback. 2008. Information Theory and Statistics. Wiley.
[25]
J. Richard Landis and Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics. 33, 1, 159--174.
[26]
François Mairesse, Marilyn A. Walker, Matthias R. Mehl, and Roger K. Moore. 2007. Using linguistic cues for the automatic recognition of personality in conversation and text. Journal of Artificial Intelligence Research 30, 1, 457--500. http://dl.acm.org/citation.cfm?id=1622637.1622649
[27]
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY.
[28]
Juan Martinez-Romo and Lourdes Araujo. 2009. Web spam identification through language model analysis. In Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’09). ACM, New York, NY, 21--28.
[29]
Edgar Meij, Wouter Weerkamp, and Maarten de Rijke. 2012. Adding semantics to microblog posts. In Proceedings of the 5th ACM International Conference on Web Search and Data Mining (WSDM’12). ACM, New York, NY, 563--572.
[30]
Gilad Mishne. 2005. Blocking blog spam with language model disagreement. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’05).
[31]
Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fetterly. 2006. Detecting spam Web pages through content analysis. In Proceedings of the 15th International Conference on World Wide Web.
[32]
Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T. Hancock. 2011. Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (HLT’11). 309--319. http://dl.acm.org/citation.cfm?id=2002472.2002512
[33]
Gerard Salton, Andrew Wong, and Chungshu S. Yang. 1974. A Vector Space Model for Automatic Indexing. Technical Report. Ithaca, NY.
[34]
David Sculley and Gabriel M. Wachman. 2007. Relaxed online SVMs for spam filtering. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). ACM, New York, NY, 415--422.
[35]
Mark Steyvers and Tom Griffiths. 2007. Probabilistic topic models. In Handbook of Latent Semantic Analysis, T. K. Landauer, D. S. McNamara, S. Dennis, and W. Kintsch (Eds.). Lawrence Erlbaum Associates, 427--448.
[36]
Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2004. Hierarchical Dirichlet processes. Journal of the American Statistical Association 101, 476, 1566--1581.
[37]
Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology.
[38]
Manos Tsagkias, Maarten de Rijke, and Wouter Weerkamp. 2011. Linking online news and social media. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM’11). ACM, New York, NY, 565--574.
[39]
Dan Twining, Matthew M. Williamson, Miranda J. F. Mowbray, and Maher Rahmouni. 2004. Email prioritization: Reducing delays on legitimate mail caused by junk mail. In Proceedings of the USENIX Annual Technical Conference (ATEC’04). 4. http://dl.acm.org/citation.cfm?id=1247415.1247419
[40]
Hongning Wang, Chi Wang, ChengXiang Zhai, and Jiawei Han. 2011a. Learning online discussion structures by conditional random fields. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 435--444.
[41]
Li Wang, Marco Lui, Su Nam Kim, Joakim Nivre, and Timothy Baldwin. 2011b. Predicting thread discourse structure over technical Web forums. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 13--25.
[42]
Yi-Min Wang, Ming Ma, Yuan Niu, and Hao Chen. 2007. Spam double-funnel: Connecting Web spammers with advertisers. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). ACM, New York, NY, 291--300.
[43]
Tae Yano, William W. Cohen, and Noah A. Smith. 2009. Predicting response to political blog posts with topic models. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’09). 477--485.
[44]
Mingliang Zhu, Weiming Hu, and Ou Wu. 2008. Topic detection and tracking for threaded discussion communities. In Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Volume 01. IEEE, Los Alamitos, CA, 77--83.
[45]
Li Zhuang, John Dunagan, Daniel R. Simon, Helen J. Wang, and J. Doug Tygar. 2008. Characterizing botnets from email spam records. In Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats (LEET’08). Article No. 2. http://dl.acm.org/citation.cfm?id=1387709.1387711

Cited By

View all
  • (2023)An Informative Path Planning Framework for Active Learning in UAV-Based Semantic MappingIEEE Transactions on Robotics10.1109/TRO.2023.331381139:6(4279-4296)Online publication date: 1-Dec-2023
  • (2023)Contrastive Active Learning Under Class Distribution MismatchIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.318880745:4(4260-4273)Online publication date: 1-Apr-2023
  • (2022)A clustering-based active learning method to query informative and representative samplesApplied Intelligence10.1007/s10489-021-03139-y52:11(13250-13267)Online publication date: 1-Sep-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on the Web
ACM Transactions on the Web  Volume 9, Issue 4
October 2015
114 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/2830542
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 September 2015
Accepted: 01 June 2015
Revised: 01 March 2015
Received: 01 July 2013
Published in TWEB Volume 9, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Diversionary comments
  2. classification
  3. coreference resolution
  4. extraction from Wikipedia
  5. hierarchical Dirichlet process
  6. latent Dirichlet allocation
  7. ranking
  8. spam
  9. topic model

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Pinnacle Lab at Singapore Management University
  • Google Research Award
  • National Science Foundation

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)1
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)An Informative Path Planning Framework for Active Learning in UAV-Based Semantic MappingIEEE Transactions on Robotics10.1109/TRO.2023.331381139:6(4279-4296)Online publication date: 1-Dec-2023
  • (2023)Contrastive Active Learning Under Class Distribution MismatchIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.318880745:4(4260-4273)Online publication date: 1-Apr-2023
  • (2022)A clustering-based active learning method to query informative and representative samplesApplied Intelligence10.1007/s10489-021-03139-y52:11(13250-13267)Online publication date: 1-Sep-2022
  • (2022)Combating Label Distribution Shift for Active Domain AdaptationComputer Vision – ECCV 202210.1007/978-3-031-19827-4_32(549-566)Online publication date: 23-Oct-2022
  • (2020)Defining Social Media…It’s ComplicatedDesigning the Social10.1007/978-981-15-5716-3_2(15-43)Online publication date: 12-Jun-2020
  • (2019)Characterizing the Behavioral Evolution of Twitter Users and The Truth Behind the 90-9-1 RuleCompanion Proceedings of The 2019 World Wide Web Conference10.1145/3308560.3316705(1035-1038)Online publication date: 13-May-2019
  • (2019)Content Similarity Analysis of Written Comments under Posts in Social Media2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS)10.1109/SNAMS.2019.8931726(158-165)Online publication date: Oct-2019
  • (2017)On Obstructing Obscenity ObfuscationACM Transactions on the Web10.1145/303296311:2(1-24)Online publication date: 24-Apr-2017

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media