Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Sentimental Spidering: Leveraging Opinion Information in Focused Crawlers

Published: 01 November 2012 Publication History

Abstract

Despite the increased prevalence of sentiment-related information on the Web, there has been limited work on focused crawlers capable of effectively collecting not only topic-relevant but also sentiment-relevant content. In this article, we propose a novel focused crawler that incorporates topic and sentiment information as well as a graph-based tunneling mechanism for enhanced collection of opinion-rich Web content regarding a particular topic. The graph-based sentiment (GBS) crawler uses a text classifier that employs both topic and sentiment categorization modules to assess the relevance of candidate pages. This information is also used to label nodes in web graphs that are employed by the tunneling mechanism to improve collection recall. Experimental results on two test beds revealed that GBS was able to provide better precision and recall than seven comparison crawlers. Moreover, GBS was able to collect a large proportion of the relevant content after traversing far fewer pages than comparison methods. GBS outperformed comparison methods on various categories of Web pages in the test beds, including collection of blogs, Web forums, and social networking Web site content. Further analysis revealed that both the sentiment classification module and graph-based tunneling mechanism played an integral role in the overall effectiveness of the GBS crawler.

References

[1]
Abbasi, A. and Chen, H. 2008. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Trans. Inf. Syst. 26, 2, Article 2.
[2]
Abbasi, A., Chen, H., and Salem, A. 2008. Sentiment analysis in multiple languages: Feature selection for opinion classification in web forums. ACM Trans. Inf. Syst. 26, 3, Article 12.
[3]
Abbasi, A., France, S. L., Zhang, Z., and Chen, H. 2011. Selecting attributes for sentiment classification using feature relation networks. IEEE Trans. Knowl. Data Engin. 23, 3, 447--462.
[4]
Aggarwal, C. C., Al-Garawi, F., and Yu, P. S. 2001. Intelligent crawling on the World Wide Web with arbitrary predicates. In Proceedings of the 10th International Conference on World Wide Web. 96--105.
[5]
Allwein, E. L., Schapire, R. E., and Singer, Y. 2001. Reducing multiclass to binary: A unifying approach for margin classifiers. J. Mach. Learn. Res. 1, 113--141.
[6]
Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., and Raghavan, S. 2001. Searching the Web. ACM Trans. Intern. Techn. 1, 1, 2--43.
[7]
Baeza-Yates, R. 2000. An image similarity measure based on graph matching. In Proceedings of the 7th International Symposium on String Processing and Information Retrieval. 28--38.
[8]
Bhattacharya, C., Korschun, D., and Sen, S. 2009. Strengthening stakeholder-company relationships through mutually beneficial corporate social responsibility initiatives. J. Business Ethics 85, 2, 257--272.
[9]
Brewer, T. and Colditz, G. A. 1999. Postmarketing surveillance and adverse drug reactions. J. Amer. Med. Assoc. 281, 9, 830--834.
[10]
Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 1--7, 107--117.
[11]
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., and Wiener, J. 2000. Graph structure in the Web. Comput. Netw. 33, 1--6, 309--320.
[12]
Chakrabarti, S., van den Berg, M., and Dom, B. 1999. Focused crawling: a new approach to topic-specific Web resource discovery. Comput. Netw. 31, 11--16, 1623--1640.
[13]
Chau, M. and Chen, H. 2003. Comparison of three vertical spiders. IEEE Comput. 36, 5, 56--62.
[14]
Chau, M. and Chen, H. 2007. Incorporating web analysis into neural networks: An example in Hopfield Net searching. IEEE Trans. Syst. Man Cybern. Part C: Appl. Rev. 37, 3, 352--358.
[15]
Chen, H. 2009. AI, e-government, and politics 2.0. IEEE Intell. Syst. 24, 5, 64--86.
[16]
Chen, H. and Zimbra, D. 2010. AI and opinion mining. IEEE Intell. Syst. 25, 3, 74--80.
[17]
Cho, J. and Garcia-Molina, H. 2003. Estimating frequency of change. ACM Trans. Intern Techn. 3, 3, 256--290.
[18]
Cho, J., Garcia-Molina, H., and Page, L. 1998. Efficient crawling through URL ordering. In Proceedings of the 7th World Wide Web Conference.
[19]
Chung, K., Derdenger, T., and Srinivasan, K. 2011. Economic value of celebrity endorsement: Tiger Woods’ impact on sales of Nike golf balls. CMU Working Paper. http://www.andrew.cmu.edu/user/derdenge/TWExecutiveSummary.pdf.
[20]
Conte, D., Foggia, P., Sansone, C., and Vento, M. 2004. Thirty years of graph matching in pattern recognition. Int. J. Pattern Recog. Artif. Intell. 18, 3, 265--298.
[21]
Davison, B. D. 2000. Topical locality in the Web. 2000. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 272--279.
[22]
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. 2000. Focused crawling using context graphs. In Proceedings of the 26th International Conference on Very Large Data Bases. 527--534.
[23]
Eshera, M. A. and Fu, K. S. 1984. A graph distance measure for image analysis. IEEE Trans. Syst. Man Cybern. 14, 3, 398--408.
[24]
Esuli, A. and Sebastiani, F. 2006. SentiWordNet: A publicly available lexical resource for opinion mining, In Proceedings of the 5th Conference on Language Resources and Evaluation. 417--422.
[25]
Fu, T., Abbasi, A., and Chen, H. 2010. A focused crawler for Dark Web forums. J. Amer. Soc. Inf. Sci. Techn. 61, 6, 1213--1231.
[26]
Fürnkranz, J. 2002. Hyperlink ensembles: A case study in hypertext classification. Inf. Fusion. 3, 4, 299--312.
[27]
Garey, M. and Johnson, D. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co Ltd.
[28]
Kashima, H., Tsuda, K. and Inokuchi, A. 2003. Marginalized kernels between labeled graphs. In Proceedings of the 20th International Conference on Machine Learning. 321--328.
[29]
Levenshtein, V. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 8, 707--710.
[30]
Li, X., Chen, H., Zhang, Z., Li, J., and Nunamaker, J. 2009. Managing knowledge in light of its evolution process: An empirical study on citation network--based patent classification. J. Manage. Inf. Syst. 26, 1, 129--153.
[31]
Liu, B. 2011. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 2nd Ed. Springer.
[32]
Liu, H., Yu, P. S., Agarwal, N., and Suel, T. 2010. Guest editors’ introduction: Social computing in the Blogosphere. IEEE Intern. Comput. 14, 2, 12--14.
[33]
Lu, H., Chen, H., Chen, T., Hung, M., and Li, S. 2010. Financial text mining: Supporting decision making using Web 2.0 content. IEEE Intell. Syst. 25, 1, 78--82.
[34]
Martin, E., Matthias G., and Hans-Peter, K. 2001. Focused web crawling: A generic framework for specifying the user interest and for adaptive crawling strategies. http://www.dbs.informatik.uni-muenchen.de/~ester/papers/VLDB2001.Submitted.pdf.
[35]
Menczer, F., Pant, G., and Srinivasan, P. 2004. Topical web crawlers: Evaluating adaptive algorithms. ACM Trans. Internet Technol. 4, 4, 378--419.
[36]
Myers, R., Wilson, R., and Hancock, E. 2000. Bayesian graph edit distance. IEEE Trans. Pattern Anal. Mach. Intell. 22, 6, 628--635.
[37]
Pant, G. and Srinivasan, P. 2005. Learning to crawl: Comparing classification schemes. ACM Trans. Inf. Syst. 23, 4, 430--462.
[38]
Pant, G. and Srinivasan, P. 2009. Predicting web page status. Inf. Syst. Resear. 21, 2, 345--364.
[39]
Rieck, K., Krueger, T., Brefeld, U., and Müller, K. 2010. Approximate tree kernels. J. Mach. Learn. Resear. 11, 555--580.
[40]
Riesen, K. and Bunke, H. 2010. Graph classification and clustering based on vector space embedding. In Machine Perception and Artificial Intelligence, World Scientific Publishing Company. 348.
[41]
Robles-Kelly, A. and Hancock, E. R. 2005. Graph edit distance from spectral seriation. IEEE Trans. Pattern Anal. Mach. Intell. 27, 3, 365--378.
[42]
Salton, G. and McGill, M. J. 1986. Introduction to Modern Information Retrieval. McGraw-Hill, Inc. New York, NY, 400.
[43]
Schapire, R. E. and Singer, Y. 1999. Improved boosting algorithms using confidence-rated predictions. Mach. Learn. 37, 3, 297--336.
[44]
Shannon, C. E. 1948. A mathematical theory of communication. Bell Syst. Techn. J. 27, 4, 379--423.
[45]
Spangler, S., Proctor, L., and Chen, Y. 2008. Multi-Taxonomy: Determining perceived brand characteristics from web data. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. 258--264.
[46]
Srinivasan, P., Menczer, F., and Pant, G. 2005. A general evaluation framework for topical crawlers. Inf. Retrieval 8, 417--447.
[47]
Subrahmanian, V. S. 2009. Mining online opinions. Comput. 42, 7, 88--90.
[48]
Thelwall, M. 2007. Blog searching: The first general-purpose source of retrospective public opinion in the social sciences? Online Inf. Rev. 31, 3, 277--289.
[49]
Tremayne, M., Zheng, N., Lee, J. K., and Jeong, J. 2006. Issue publics on the Web: Applying network theory to the war Blogosphere. J. Comput.-Mediated Commun. 12, 1, Article 15.
[50]
van Grootheest, K., de Graaf, L., and de Jong-van den Berg, L. T. 2003. Consumer adverse drug reaction reporting: A new step in pharmacovigilance? Drug Safety 26, 3, 211--217.
[51]
Wiebe, J. M. 1994. Tracking point of view in narrative. Comput. Linguistics 20, 2, 233--287.
[52]
Yang, Y. and Pedersen, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning. 412--420.

Cited By

View all
  • (2024)Should Fairness be a Metric or a Model? A Model-based Framework for Assessing Bias in Machine Learning PipelinesACM Transactions on Information Systems10.1145/364127642:4(1-41)Online publication date: 23-Jan-2024
  • (2024)Weakly supervised learning for an effective focused web crawlerEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.107944132:COnline publication date: 1-Jun-2024
  • (2023)Sentiment Analysis with Tweets Behaviour in Twitter Streaming APIComputer Systems Science and Engineering10.32604/csse.2023.03084245:2(1113-1128)Online publication date: 2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 30, Issue 4
November 2012
216 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/2382438
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2012
Accepted: 01 June 2012
Revised: 01 March 2012
Received: 01 April 2011
Published in TOIS Volume 30, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Web crawlers
  2. classification
  3. focused crawlers
  4. graph similarities
  5. opinion mining
  6. random walk path
  7. sentiment analysis

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)2
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Should Fairness be a Metric or a Model? A Model-based Framework for Assessing Bias in Machine Learning PipelinesACM Transactions on Information Systems10.1145/364127642:4(1-41)Online publication date: 23-Jan-2024
  • (2024)Weakly supervised learning for an effective focused web crawlerEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.107944132:COnline publication date: 1-Jun-2024
  • (2023)Sentiment Analysis with Tweets Behaviour in Twitter Streaming APIComputer Systems Science and Engineering10.32604/csse.2023.03084245:2(1113-1128)Online publication date: 2023
  • (2023)Examining User Heterogeneity in Digital ExperimentsACM Transactions on Information Systems10.1145/357893141:4(1-34)Online publication date: 22-Mar-2023
  • (2022)Amelioration of linguistic semantic classifier with sentiment classifier manacle for the focused web crawlerInternational Journal of Information Technology10.1007/s41870-022-01139-w15:2(1137-1149)Online publication date: 27-Dec-2022
  • (2022)Opinion mining in online social media: a surveySocial Network Analysis and Mining10.1007/s13278-021-00855-812:1Online publication date: 11-Jan-2022
  • (2021)GINSJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-20287940:6(11763-11776)Online publication date: 1-Jan-2021
  • (2021)A Critique Empirical Evaluation of Relevance Computation for Focused Web CrawlersBrazilian Archives of Biology and Technology10.1590/1678-4324-202121022364Online publication date: 2021
  • (2021)Big data and portfolio optimization: A novel approach integrating DEA with multiple data sourcesOmega10.1016/j.omega.2021.102479104(102479)Online publication date: Oct-2021
  • (2020)A Deep Learning Architecture for Psychometric Natural Language ProcessingACM Transactions on Information Systems10.1145/336521138:1(1-29)Online publication date: 5-Feb-2020
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media