research-article

Diversionary Comments under Blog Posts

Authors:

Weiyi MengAuthors Info & Claims

ACM Transactions on the Web (TWEB), Volume 9, Issue 4

Article No.: 18, Pages 1 - 34

https://doi.org/10.1145/2789211

Published: 24 September 2015 Publication History

Abstract

There has been a recent swell of interest in the analysis of blog comments. However, much of the work focuses on detecting comment spam in the blogsphere. An important issue that has been neglected so far is the identification of diversionary comments. Diversionary comments are defined as comments that divert the topic from the original post. A possible purpose is to distract readers from the original topic and draw attention to a new topic. We categorize diversionary comments into five types based on our observations and propose an effective framework to identify and flag them. To the best of our knowledge, the problem of detecting diversionary comments has not been studied so far. We solve the problem in two different ways: (i) rank all comments in descending order of being diversionary and (ii) consider it as a classification problem. Our evaluation on 4,179 comments under 40 different blog posts from Digg and Reddit shows that the proposed method achieves the high mean average precision of 91.9% when the problem is considered as a ranking problem and 84.9% of F-measure as a classification problem. Sensitivity analysis indicates that the effectiveness of the method is stable under different parameter settings.

References

[1]

Erik Aumayr, Jeffrey Chan, and Conor Hayes. 2011. Reconstruction of threaded conversations in online discussion forums. In Proceedings of the International AAAI Conference on Weblogs and Social Media.

[2]

Ricardo Baeza-Yates and Berthier Ribeiro-Neto. 2008. Modern Information Retrieval (2nd ed.). Addison-Wesley.

Digital Library

[3]

Somnath Banerjee, Krishnan Ramanathan, and Ajay Gupta. 2007. Clustering short texts using Wikipedia. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). ACM, New York, NY, 787--788.

Digital Library

[4]

Eric Bengtson and Dan Roth. 2008. Understanding the value of features for coreference resolution. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’08). 294--303. http://dl.acm.org/citation.cfm?id=1613715.1613756

Digital Library

[5]

Archana Bhattarai, Vasile Rus, and Dipankar Dasgupta. 2009. Characterizing comment spam in the blogosphere through content analysis. In Proceedings of the IEEE Symposium on Computational Intelligence in Cyber Security (CICS’09).

[6]

Christopher M. Bishop. 2007. Pattern Recognition and Machine Learning. Information Science and Statistics Series. Springer.

Digital Library

[7]

Enrico Blanzieri and Anton Bryl. 2008. A survey of learning-based techniques of email spam filtering. Artificial Intelligence Review 29, 1, 63--92.

Digital Library

[8]

David M. Blei, Andrew Y. Ng, Michael I. Jordan, and John Lafferty. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993--1022.

Digital Library

[9]

Carlos Castillo and Brian D. Davison. 2010. Adversarial Web search. Foundations and Trends in Information Retrieval 4, 5, 377--486.

Digital Library

[10]

Carlos Castillo, Debora Donato, Luca Becchetti, Paolo Boldi, Stefano Leonardi, Massimo Santini, and Sebastiano Vigna. 2006. A reference collection for Web spam. SIGIR Forum 40, 11--24.

Digital Library

[11]

Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 37--46.

[12]

Gordon V. Cormack. 2008. Email spam filtering: A systematic review. Foundation and Trends in Information Retrieval 1, 4, 335--455.

Digital Library

[13]

Gordon V. Cormack, José María Gómez Hidalgo, and Enrique Puertas Sánz. 2007. Spam filtering for short messages. In Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM’07). ACM, New York, NY, 313--320.

Digital Library

[14]

Yoav Freund and Robert E. Schapire. 1999. Large margin classification using the perceptron algorithm. Machine Learning 37, 3, 277--296.

Digital Library

[15]

Bent Fuglede and Flemming Topsoe. 2004. Jensen-Shannon divergence and Hilbert space embedding. In Proceedings of the International Symposium on Information Theory (ISIT’04).

[16]

Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI’07). 1606--1611. http://dl.acm.org/citation.cfm?id=1625275.1625535

Digital Library

[17]

Tom Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101, 1, 5228--5235.

[18]

Gregor Heinrich. 2004. Parameter Estimation for Text Analysis. Technical Report. University of Lepzig, Germany.

[19]

Liangjie Hong and Brian D. Davison. 2010. Empirical study of topic modeling in Twitter. In Proceedings of the 1st Workshop on Social Media Analytics (SOMA’10). ACM, New York, NY, 80--88.

Digital Library

[20]

Xia Hu, Nan Sun, Chao Zhang, and Tat-Seng Chua. 2009. Exploiting internal and external semantics for the clustering of short texts using world knowledge. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). ACM, New York, NY, 919--928.

Digital Library

[21]

Molly Ireland, Amy Gonzales, James W. Pennebaker, Cindy K. Chung, and Roger J. Booth. 2007. The Development and Psychometric Properties of LIWC2007. LIWC.net, Austin, TX.

[22]

Nitin Jindal and Bing Liu. 2008. Opinion spam and analysis. In Proceedings of the 2008 International Conference on Web Search and Data Mining (WSDM’08). ACM, New York, NY, 219--230.

Digital Library

[23]

Nitin Jindal, Bing Liu, and Ee-Peng Lim. 2010. Finding unusual review patterns using unexpected rules. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM’10). ACM, New York, NY, 1549--1552.

Digital Library

[24]

Solomon Kullback. 2008. Information Theory and Statistics. Wiley.

[25]

J. Richard Landis and Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics. 33, 1, 159--174.

[26]

François Mairesse, Marilyn A. Walker, Matthias R. Mehl, and Roger K. Moore. 2007. Using linguistic cues for the automatic recognition of personality in conversation and text. Journal of Artificial Intelligence Research 30, 1, 457--500. http://dl.acm.org/citation.cfm?id=1622637.1622649

Digital Library

[27]

Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY.

Digital Library

[28]

Juan Martinez-Romo and Lourdes Araujo. 2009. Web spam identification through language model analysis. In Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’09). ACM, New York, NY, 21--28.

Digital Library

[29]

Edgar Meij, Wouter Weerkamp, and Maarten de Rijke. 2012. Adding semantics to microblog posts. In Proceedings of the 5th ACM International Conference on Web Search and Data Mining (WSDM’12). ACM, New York, NY, 563--572.

Digital Library

[30]

Gilad Mishne. 2005. Blocking blog spam with language model disagreement. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’05).

[31]

Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fetterly. 2006. Detecting spam Web pages through content analysis. In Proceedings of the 15th International Conference on World Wide Web.

Digital Library

[32]

Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T. Hancock. 2011. Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (HLT’11). 309--319. http://dl.acm.org/citation.cfm?id=2002472.2002512

Digital Library

[33]

Gerard Salton, Andrew Wong, and Chungshu S. Yang. 1974. A Vector Space Model for Automatic Indexing. Technical Report. Ithaca, NY.

Digital Library

[34]

David Sculley and Gabriel M. Wachman. 2007. Relaxed online SVMs for spam filtering. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). ACM, New York, NY, 415--422.

Digital Library

[35]

Mark Steyvers and Tom Griffiths. 2007. Probabilistic topic models. In Handbook of Latent Semantic Analysis, T. K. Landauer, D. S. McNamara, S. Dennis, and W. Kintsch (Eds.). Lawrence Erlbaum Associates, 427--448.

[36]

Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2004. Hierarchical Dirichlet processes. Journal of the American Statistical Association 101, 476, 1566--1581.

[37]

Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology.

Digital Library

[38]

Manos Tsagkias, Maarten de Rijke, and Wouter Weerkamp. 2011. Linking online news and social media. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM’11). ACM, New York, NY, 565--574.

Digital Library

[39]

Dan Twining, Matthew M. Williamson, Miranda J. F. Mowbray, and Maher Rahmouni. 2004. Email prioritization: Reducing delays on legitimate mail caused by junk mail. In Proceedings of the USENIX Annual Technical Conference (ATEC’04). 4. http://dl.acm.org/citation.cfm?id=1247415.1247419

Digital Library

[40]

Hongning Wang, Chi Wang, ChengXiang Zhai, and Jiawei Han. 2011a. Learning online discussion structures by conditional random fields. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 435--444.

Digital Library

[41]

Li Wang, Marco Lui, Su Nam Kim, Joakim Nivre, and Timothy Baldwin. 2011b. Predicting thread discourse structure over technical Web forums. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 13--25.

Digital Library

[42]

Yi-Min Wang, Ming Ma, Yuan Niu, and Hao Chen. 2007. Spam double-funnel: Connecting Web spammers with advertisers. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). ACM, New York, NY, 291--300.

Digital Library

[43]

Tae Yano, William W. Cohen, and Noah A. Smith. 2009. Predicting response to political blog posts with topic models. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’09). 477--485.

Digital Library

[44]

Mingliang Zhu, Weiming Hu, and Ou Wu. 2008. Topic detection and tracking for threaded discussion communities. In Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Volume 01. IEEE, Los Alamitos, CA, 77--83.

Digital Library

[45]

Li Zhuang, John Dunagan, Daniel R. Simon, Helen J. Wang, and J. Doug Tygar. 2008. Characterizing botnets from email spam records. In Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats (LEET’08). Article No. 2. http://dl.acm.org/citation.cfm?id=1387709.1387711

Digital Library

Cited By

Rückin JMagistri FStachniss CPopović M(2023)An Informative Path Planning Framework for Active Learning in UAV-Based Semantic MappingIEEE Transactions on Robotics10.1109/TRO.2023.331381139:6(4279-4296)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1109/TRO.2023.3313811
Du PChen HZhao SChai SChen HLi C(2023)Contrastive Active Learning Under Class Distribution MismatchIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.318880745:4(4260-4273)Online publication date: 1-Apr-2023
https://dl.acm.org/doi/10.1109/TPAMI.2022.3188807
Yan XNazmi SGebru BAnwar MHomaifar ASarkar MGupta K(2022)A clustering-based active learning method to query informative and representative samplesApplied Intelligence10.1007/s10489-021-03139-y52:11(13250-13267)Online publication date: 1-Sep-2022
https://dl.acm.org/doi/10.1007/s10489-021-03139-y
Show More Cited By

Index Terms

Diversionary Comments under Blog Posts
1. Information systems
  1. World Wide Web
    1. Web searching and information discovery
      1. Web search engines
        Spam detection
2. Mathematics of computing
  1. Probability and statistics
    1. Probabilistic representations
      1. Bayesian networks

Recommendations

Diversionary comments under political blog posts
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

An important issue that has been neglected so far is the identification of diversionary comments. Diversionary comments under political blog posts are defined as comments that deliberately twist the bloggers' intention and divert the topic to another ...
Topic-driven reader comments summarization
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Readers of a news article often read its comments contributed by other readers. By reading comments, readers obtain not only complementary information about this news article but also the opinions from other readers. However, the existing ranking ...
A Novel Hybrid HDP-LDA Model for Sentiment Analysis
WI-IAT '13: Proceedings of the 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 01

Sentiment analysis studies the public opinions towards an entity, and it is an important research area in data mining. Recently, a lot of sentiment analysis models have been proposed, including supervised and unsupervised approaches. However, the role ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on the Web

ACM Transactions on the Web Volume 9, Issue 4

October 2015

114 pages

ISSN:1559-1131

EISSN:1559-114X

DOI:10.1145/2830542

Editors:
Brian D. Davison
Lehigh University, USA
,
Marianne Winslett
University of Illinois at Urbana-Champaign

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 September 2015

Accepted: 01 June 2015

Revised: 01 March 2015

Received: 01 July 2013

Published in TWEB Volume 9, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Pinnacle Lab at Singapore Management University
Google Research Award
National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
393
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)1

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Rückin JMagistri FStachniss CPopović M(2023)An Informative Path Planning Framework for Active Learning in UAV-Based Semantic MappingIEEE Transactions on Robotics10.1109/TRO.2023.331381139:6(4279-4296)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1109/TRO.2023.3313811
Du PChen HZhao SChai SChen HLi C(2023)Contrastive Active Learning Under Class Distribution MismatchIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.318880745:4(4260-4273)Online publication date: 1-Apr-2023
https://dl.acm.org/doi/10.1109/TPAMI.2022.3188807
Yan XNazmi SGebru BAnwar MHomaifar ASarkar MGupta K(2022)A clustering-based active learning method to query informative and representative samplesApplied Intelligence10.1007/s10489-021-03139-y52:11(13250-13267)Online publication date: 1-Sep-2022
https://dl.acm.org/doi/10.1007/s10489-021-03139-y
Hwang SLee SKim SOk JKwak S(2022)Combating Label Distribution Shift for Active Domain AdaptationComputer Vision – ECCV 202210.1007/978-3-031-19827-4_32(549-566)Online publication date: 23-Oct-2022
https://dl.acm.org/doi/10.1007/978-3-031-19827-4_32
Dyer HDyer H(2020)Defining Social Media…It’s ComplicatedDesigning the Social10.1007/978-981-15-5716-3_2(15-43)Online publication date: 12-Jun-2020
https://doi.org/10.1007/978-981-15-5716-3_2
Antelmi AMalandrino DScarano V(2019)Characterizing the Behavioral Evolution of Twitter Users and The Truth Behind the 90-9-1 RuleCompanion Proceedings of The 2019 World Wide Web Conference10.1145/3308560.3316705(1035-1038)Online publication date: 13-May-2019
https://dl.acm.org/doi/10.1145/3308560.3316705
Mozafari MFarahbakhsh RCrespi N(2019)Content Similarity Analysis of Written Comments under Posts in Social Media2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS)10.1109/SNAMS.2019.8931726(158-165)Online publication date: Oct-2019
https://doi.org/10.1109/SNAMS.2019.8931726
Rojas-Galeano S(2017)On Obstructing Obscenity ObfuscationACM Transactions on the Web10.1145/303296311:2(1-24)Online publication date: 24-Apr-2017
https://dl.acm.org/doi/10.1145/3032963

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents