research-article

Sentimental Spidering: Leveraging Opinion Information in Focused Crawlers

Authors:

Hsinchun ChenAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 30, Issue 4

Article No.: 24, Pages 1 - 30

https://doi.org/10.1145/2382438.2382443

Published: 01 November 2012 Publication History

Abstract

Despite the increased prevalence of sentiment-related information on the Web, there has been limited work on focused crawlers capable of effectively collecting not only topic-relevant but also sentiment-relevant content. In this article, we propose a novel focused crawler that incorporates topic and sentiment information as well as a graph-based tunneling mechanism for enhanced collection of opinion-rich Web content regarding a particular topic. The graph-based sentiment (GBS) crawler uses a text classifier that employs both topic and sentiment categorization modules to assess the relevance of candidate pages. This information is also used to label nodes in web graphs that are employed by the tunneling mechanism to improve collection recall. Experimental results on two test beds revealed that GBS was able to provide better precision and recall than seven comparison crawlers. Moreover, GBS was able to collect a large proportion of the relevant content after traversing far fewer pages than comparison methods. GBS outperformed comparison methods on various categories of Web pages in the test beds, including collection of blogs, Web forums, and social networking Web site content. Further analysis revealed that both the sentiment classification module and graph-based tunneling mechanism played an integral role in the overall effectiveness of the GBS crawler.

References

[1]

Abbasi, A. and Chen, H. 2008. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Trans. Inf. Syst. 26, 2, Article 2.

Digital Library

[2]

Abbasi, A., Chen, H., and Salem, A. 2008. Sentiment analysis in multiple languages: Feature selection for opinion classification in web forums. ACM Trans. Inf. Syst. 26, 3, Article 12.

Digital Library

[3]

Abbasi, A., France, S. L., Zhang, Z., and Chen, H. 2011. Selecting attributes for sentiment classification using feature relation networks. IEEE Trans. Knowl. Data Engin. 23, 3, 447--462.

Digital Library

[4]

Aggarwal, C. C., Al-Garawi, F., and Yu, P. S. 2001. Intelligent crawling on the World Wide Web with arbitrary predicates. In Proceedings of the 10th International Conference on World Wide Web. 96--105.

Digital Library

[5]

Allwein, E. L., Schapire, R. E., and Singer, Y. 2001. Reducing multiclass to binary: A unifying approach for margin classifiers. J. Mach. Learn. Res. 1, 113--141.

Digital Library

[6]

Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., and Raghavan, S. 2001. Searching the Web. ACM Trans. Intern. Techn. 1, 1, 2--43.

Digital Library

[7]

Baeza-Yates, R. 2000. An image similarity measure based on graph matching. In Proceedings of the 7th International Symposium on String Processing and Information Retrieval. 28--38.

Digital Library

[8]

Bhattacharya, C., Korschun, D., and Sen, S. 2009. Strengthening stakeholder-company relationships through mutually beneficial corporate social responsibility initiatives. J. Business Ethics 85, 2, 257--272.

[9]

Brewer, T. and Colditz, G. A. 1999. Postmarketing surveillance and adverse drug reactions. J. Amer. Med. Assoc. 281, 9, 830--834.

[10]

Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 1--7, 107--117.

Digital Library

[11]

Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., and Wiener, J. 2000. Graph structure in the Web. Comput. Netw. 33, 1--6, 309--320.

Digital Library

[12]

Chakrabarti, S., van den Berg, M., and Dom, B. 1999. Focused crawling: a new approach to topic-specific Web resource discovery. Comput. Netw. 31, 11--16, 1623--1640.

Digital Library

[13]

Chau, M. and Chen, H. 2003. Comparison of three vertical spiders. IEEE Comput. 36, 5, 56--62.

Digital Library

[14]

Chau, M. and Chen, H. 2007. Incorporating web analysis into neural networks: An example in Hopfield Net searching. IEEE Trans. Syst. Man Cybern. Part C: Appl. Rev. 37, 3, 352--358.

Digital Library

[15]

Chen, H. 2009. AI, e-government, and politics 2.0. IEEE Intell. Syst. 24, 5, 64--86.

Digital Library

[16]

Chen, H. and Zimbra, D. 2010. AI and opinion mining. IEEE Intell. Syst. 25, 3, 74--80.

Digital Library

[17]

Cho, J. and Garcia-Molina, H. 2003. Estimating frequency of change. ACM Trans. Intern Techn. 3, 3, 256--290.

Digital Library

[18]

Cho, J., Garcia-Molina, H., and Page, L. 1998. Efficient crawling through URL ordering. In Proceedings of the 7th World Wide Web Conference.

Digital Library

[19]

Chung, K., Derdenger, T., and Srinivasan, K. 2011. Economic value of celebrity endorsement: Tiger Woods’ impact on sales of Nike golf balls. CMU Working Paper. http://www.andrew.cmu.edu/user/derdenge/TWExecutiveSummary.pdf.

Digital Library

[20]

Conte, D., Foggia, P., Sansone, C., and Vento, M. 2004. Thirty years of graph matching in pattern recognition. Int. J. Pattern Recog. Artif. Intell. 18, 3, 265--298.

[21]

Davison, B. D. 2000. Topical locality in the Web. 2000. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 272--279.

Digital Library

[22]

Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. 2000. Focused crawling using context graphs. In Proceedings of the 26th International Conference on Very Large Data Bases. 527--534.

Digital Library

[23]

Eshera, M. A. and Fu, K. S. 1984. A graph distance measure for image analysis. IEEE Trans. Syst. Man Cybern. 14, 3, 398--408.

[24]

Esuli, A. and Sebastiani, F. 2006. SentiWordNet: A publicly available lexical resource for opinion mining, In Proceedings of the 5th Conference on Language Resources and Evaluation. 417--422.

[25]

Fu, T., Abbasi, A., and Chen, H. 2010. A focused crawler for Dark Web forums. J. Amer. Soc. Inf. Sci. Techn. 61, 6, 1213--1231.

Digital Library

[26]

Fürnkranz, J. 2002. Hyperlink ensembles: A case study in hypertext classification. Inf. Fusion. 3, 4, 299--312.

[27]

Garey, M. and Johnson, D. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co Ltd.

Digital Library

[28]

Kashima, H., Tsuda, K. and Inokuchi, A. 2003. Marginalized kernels between labeled graphs. In Proceedings of the 20th International Conference on Machine Learning. 321--328.

[29]

Levenshtein, V. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 8, 707--710.

[30]

Li, X., Chen, H., Zhang, Z., Li, J., and Nunamaker, J. 2009. Managing knowledge in light of its evolution process: An empirical study on citation network--based patent classification. J. Manage. Inf. Syst. 26, 1, 129--153.

Digital Library

[31]

Liu, B. 2011. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 2nd Ed. Springer.

Digital Library

[32]

Liu, H., Yu, P. S., Agarwal, N., and Suel, T. 2010. Guest editors’ introduction: Social computing in the Blogosphere. IEEE Intern. Comput. 14, 2, 12--14.

Digital Library

[33]

Lu, H., Chen, H., Chen, T., Hung, M., and Li, S. 2010. Financial text mining: Supporting decision making using Web 2.0 content. IEEE Intell. Syst. 25, 1, 78--82.

[34]

Martin, E., Matthias G., and Hans-Peter, K. 2001. Focused web crawling: A generic framework for specifying the user interest and for adaptive crawling strategies. http://www.dbs.informatik.uni-muenchen.de/~ester/papers/VLDB2001.Submitted.pdf.

[35]

Menczer, F., Pant, G., and Srinivasan, P. 2004. Topical web crawlers: Evaluating adaptive algorithms. ACM Trans. Internet Technol. 4, 4, 378--419.

Digital Library

[36]

Myers, R., Wilson, R., and Hancock, E. 2000. Bayesian graph edit distance. IEEE Trans. Pattern Anal. Mach. Intell. 22, 6, 628--635.

Digital Library

[37]

Pant, G. and Srinivasan, P. 2005. Learning to crawl: Comparing classification schemes. ACM Trans. Inf. Syst. 23, 4, 430--462.

Digital Library

[38]

Pant, G. and Srinivasan, P. 2009. Predicting web page status. Inf. Syst. Resear. 21, 2, 345--364.

Digital Library

[39]

Rieck, K., Krueger, T., Brefeld, U., and Müller, K. 2010. Approximate tree kernels. J. Mach. Learn. Resear. 11, 555--580.

Digital Library

[40]

Riesen, K. and Bunke, H. 2010. Graph classification and clustering based on vector space embedding. In Machine Perception and Artificial Intelligence, World Scientific Publishing Company. 348.

Digital Library

[41]

Robles-Kelly, A. and Hancock, E. R. 2005. Graph edit distance from spectral seriation. IEEE Trans. Pattern Anal. Mach. Intell. 27, 3, 365--378.

Digital Library

[42]

Salton, G. and McGill, M. J. 1986. Introduction to Modern Information Retrieval. McGraw-Hill, Inc. New York, NY, 400.

Digital Library

[43]

Schapire, R. E. and Singer, Y. 1999. Improved boosting algorithms using confidence-rated predictions. Mach. Learn. 37, 3, 297--336.

Digital Library

[44]

Shannon, C. E. 1948. A mathematical theory of communication. Bell Syst. Techn. J. 27, 4, 379--423.

[45]

Spangler, S., Proctor, L., and Chen, Y. 2008. Multi-Taxonomy: Determining perceived brand characteristics from web data. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. 258--264.

Digital Library

[46]

Srinivasan, P., Menczer, F., and Pant, G. 2005. A general evaluation framework for topical crawlers. Inf. Retrieval 8, 417--447.

Digital Library

[47]

Subrahmanian, V. S. 2009. Mining online opinions. Comput. 42, 7, 88--90.

Digital Library

[48]

Thelwall, M. 2007. Blog searching: The first general-purpose source of retrospective public opinion in the social sciences? Online Inf. Rev. 31, 3, 277--289.

[49]

Tremayne, M., Zheng, N., Lee, J. K., and Jeong, J. 2006. Issue publics on the Web: Applying network theory to the war Blogosphere. J. Comput.-Mediated Commun. 12, 1, Article 15.

[50]

van Grootheest, K., de Graaf, L., and de Jong-van den Berg, L. T. 2003. Consumer adverse drug reaction reporting: A new step in pharmacovigilance? Drug Safety 26, 3, 211--217.

[51]

Wiebe, J. M. 1994. Tracking point of view in narrative. Comput. Linguistics 20, 2, 233--287.

Digital Library

[52]

Yang, Y. and Pedersen, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning. 412--420.

Digital Library

Cited By

Lalor JAbbasi AOketch KYang YForsgren N(2024)Should Fairness be a Metric or a Model? A Model-based Framework for Assessing Bias in Machine Learning PipelinesACM Transactions on Information Systems10.1145/364127642:4(1-41)Online publication date: 23-Jan-2024
https://dl.acm.org/doi/10.1145/3641276
Joe Dhanith PSaeed KRohith GRaja S(2024)Weakly supervised learning for an effective focused web crawlerEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.107944132:COnline publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1016/j.engappai.2024.107944
Chouhan KYadav MKumar Rout RSagar Sahoo KJhanjhi NMasud MAljahdali S(2023)Sentiment Analysis with Tweets Behaviour in Twitter Streaming APIComputer Systems Science and Engineering10.32604/csse.2023.03084245:2(1113-1128)Online publication date: 2023
https://doi.org/10.32604/csse.2023.030842
Show More Cited By

Index Terms

Sentimental Spidering: Leveraging Opinion Information in Focused Crawlers
1. Computing methodologies
  1. Artificial intelligence
    1. Search methodologies
2. Information systems
  1. Information retrieval
    1. Information retrieval query processing

Recommendations

Joint sentiment/topic model for sentiment analysis
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Sentiment analysis or opinion mining aims to use automated tools to detect subjective information such as opinions, attitudes, and feelings expressed in text. This paper proposes a novel probabilistic modeling framework based on Latent Dirichlet ...
Detecting General Opinions from Customer Surveys
ICDMW '11: Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops

Questionnaire-based surveys and on-line product reviews resemble each other in that they both have user comments and satisfaction ratings. Since a comment might be a general opinion about a product or only one or a set of its attributes, in which case ...
Twitter Opinion Topic Model: Extracting Product Opinions from Tweets by Leveraging Hashtags and Sentiment Lexicon
CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management

Aspect-based opinion mining is widely applied to review data to aggregate or summarize opinions of a product, and the current state-of-the-art is achieved with Latent Dirichlet Allocation (LDA)-based model. Although social media data like tweets are ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems

ACM Transactions on Information Systems Volume 30, Issue 4

November 2012

216 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/2382438

Issue’s Table of Contents

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2012

Accepted: 01 June 2012

Revised: 01 March 2012

Received: 01 April 2011

Published in TOIS Volume 30, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
750
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)2

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lalor JAbbasi AOketch KYang YForsgren N(2024)Should Fairness be a Metric or a Model? A Model-based Framework for Assessing Bias in Machine Learning PipelinesACM Transactions on Information Systems10.1145/364127642:4(1-41)Online publication date: 23-Jan-2024
https://dl.acm.org/doi/10.1145/3641276
Joe Dhanith PSaeed KRohith GRaja S(2024)Weakly supervised learning for an effective focused web crawlerEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.107944132:COnline publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1016/j.engappai.2024.107944
Chouhan KYadav MKumar Rout RSagar Sahoo KJhanjhi NMasud MAljahdali S(2023)Sentiment Analysis with Tweets Behaviour in Twitter Streaming APIComputer Systems Science and Engineering10.32604/csse.2023.03084245:2(1113-1128)Online publication date: 2023
https://doi.org/10.32604/csse.2023.030842
Somanchi SAbbasi AKelley KDobolyi DYuan T(2023)Examining User Heterogeneity in Digital ExperimentsACM Transactions on Information Systems10.1145/357893141:4(1-34)Online publication date: 22-Mar-2023
https://dl.acm.org/doi/10.1145/3578931
Prabha KMahesh CGoundar SRaja S(2022)Amelioration of linguistic semantic classifier with sentiment classifier manacle for the focused web crawlerInternational Journal of Information Technology10.1007/s41870-022-01139-w15:2(1137-1149)Online publication date: 27-Dec-2022
https://doi.org/10.1007/s41870-022-01139-w
Messaoudi CGuessoum ZBen Romdhane L(2022)Opinion mining in online social media: a surveySocial Network Analysis and Mining10.1007/s13278-021-00855-812:1Online publication date: 11-Jan-2022
https://doi.org/10.1007/s13278-021-00855-8
Zargari HZahedi MRahimi M(2021)GINSJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-20287940:6(11763-11776)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.3233/JIFS-202879
Mary JBalasubramanian SRaj R(2021)A Critique Empirical Evaluation of Relevance Computation for Focused Web CrawlersBrazilian Archives of Biology and Technology10.1590/1678-4324-202121022364Online publication date: 2021
https://doi.org/10.1590/1678-4324-2021210223
Zhou ZGao MXiao HWang RLiu W(2021)Big data and portfolio optimization: A novel approach integrating DEA with multiple data sourcesOmega10.1016/j.omega.2021.102479104(102479)Online publication date: Oct-2021
https://doi.org/10.1016/j.omega.2021.102479
Ahmad FAbbasi ALi JDobolyi DNetemeyer RClifford GChen H(2020)A Deep Learning Architecture for Psychometric Natural Language ProcessingACM Transactions on Information Systems10.1145/336521138:1(1-29)Online publication date: 5-Feb-2020
https://dl.acm.org/doi/10.1145/3365211
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents