Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1390334.1390413acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Exploring traversal strategy for web forum crawling

Published: 20 July 2008 Publication History

Abstract

In this paper, we study the problem of Web forum crawling. Web forum has now become an important data source of many Web applications; while forum crawling is still a challenging task due to complex in-site link structures and login controls of most forum sites. Without carefully selecting the traversal path, a generic crawler usually downloads many duplicate and invalid pages from forums, and thus wastes both the precious bandwidth and the limited storage space. To crawl forum data more effectively and efficiently, in this paper, we propose an automatic approach to exploring an appropriate traversal strategy to direct the crawling of a given target forum. In detail, the traversal strategy consists of the identification of the skeleton links and the detection of the page-flipping links. The skeleton links instruct the crawler to only crawl valuable pages and meanwhile avoid duplicate and uninformative ones; and the page-flipping links tell the crawler how to completely download a long discussion thread which is usually shown in multiple pages in Web forums. The extensive experimental results on several forums show encouraging performance of our approach. Following the discovered traversal strategy, our forum crawler can archive more informative pages in comparison with previous related work and a commercial generic crawler.

References

[1]
Internet Forum. http://en.wikipedia.org/wiki/Internet_forum.
[2]
S. Abiteboul, M. Preda, and G. Cobena. Adaptive on-line page importance computation. In Proc. 12th WWW, pages 280--290, Budapest, Hungary, May 20-24, 2003.
[3]
R. Baeza-Yates and C. Castillo. Crawling the infinite Web: five levels are enough. In Proc. 3rd Workshop on Algorithms and Models for the Web-Graph, LNCS, volume 3243, pages 156--167, Rome, Italy, Oct. 16, 2004.
[4]
R. Baeza-Yates, C. Castillo, M. Marin, and A. Rodriguez. Crawling a Country: better strategies than breadth-first for Web page ordering. In Proc. 14th WWW, pages 864--872, Chiba, Japan, May 10-14, 2005.
[5]
S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1-7):107--117, 1998.
[6]
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the Web. In Proc. 6th WWW, pages 1157--1166, Santa Clara, California, USA, Apr. 1997.
[7]
R. Cai, J.-M. Yang, W. Lai, Y. Wang, and L. Zhang. iRobot: An Intelligent Crawler for Web Forums. In Proc. 17th WWW, Beijing, P.R. China, April 21-25, 2008.
[8]
S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks, 31(11-16):1623--1640, 1999.
[9]
Y. Guo, K. Li, K. Zhang, and G. Zhang. Board forum crawling: a Web crawling method for Web forum. In Proc. 2006 IEEE/WIC/ACM Int. Conf. Web Intelligence, pages 745--748, Hong Kong, Dec. 2006.
[10]
M. Henzinger. Finding near-duplicate Web pages: a large-scale evaluation of algorithms. In Proc. 29th SIGIR, pages 284--291, Seattle, Washington, USA, Aug. 2006.
[11]
G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In Proc. 16th WWW, pages 141--150, Banff, Canada, May 8-12, 2007.
[12]
F. Menczer, G. Pant, P. Srinivasan, and M. E. Ruiz. Evaluating topic-driven Web crawlers. In Proc. 24th SIGIR, pages 241--249, New Orleans, LA, USA, Sept. 9-12, 2001.
[13]
S. Pandey and C. Olston. User-centric Web crawling. In Proc. 14th WWW, pages 401--411, Chiba, May 10-14, 2005.
[14]
S. Raghavan and H. Garcia-Molina. Crawling the hidden Web. In Proc. 27th VLDB, pages 129--138, San Francisco, CA, USA, Sept. 11-14, 2001.
[15]
M. L.A. Vidal, A. S. da Siva, E. S. de Moura, and J. M. B. Cavalcanti. Structure-driven crawler generation by example. In Proc. 29th SIGIR, pages 292--299, Seattle, Washington, USA, Aug. 6-11, 2006.

Cited By

View all
  • (2023)Travellers’ social media postings during protests and mass demonstrationsCurrent Issues in Tourism10.1080/13683500.2023.221435927:10(1513-1529)Online publication date: 9-Jun-2023
  • (2021)inTIME: A Machine Learning-Based Framework for Gathering and Leveraging Web Data to Cyber-Threat IntelligenceElectronics10.3390/electronics1007081810:7(818)Online publication date: 30-Mar-2021
  • (2019)A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence2019 IEEE World Congress on Services (SERVICES)10.1109/SERVICES.2019.00016(3-8)Online publication date: Jul-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
July 2008
934 pages
ISBN:9781605581644
DOI:10.1145/1390334
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 July 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. forum crawler
  2. sitemap
  3. traversal strategy

Qualifiers

  • Research-article

Conference

SIGIR '08
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Travellers’ social media postings during protests and mass demonstrationsCurrent Issues in Tourism10.1080/13683500.2023.221435927:10(1513-1529)Online publication date: 9-Jun-2023
  • (2021)inTIME: A Machine Learning-Based Framework for Gathering and Leveraging Web Data to Cyber-Threat IntelligenceElectronics10.3390/electronics1007081810:7(818)Online publication date: 30-Mar-2021
  • (2019)A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence2019 IEEE World Congress on Services (SERVICES)10.1109/SERVICES.2019.00016(3-8)Online publication date: Jul-2019
  • (2019)SInFo – Structure-Driven Incremental Forum Crawler That Optimizes User-Generated Content RetrievalIEEE Access10.1109/ACCESS.2019.29398727(126941-126961)Online publication date: 2019
  • (2017)Vigi4Med Scraper: A Framework for Web Forum Structured Data Extraction and Semantic RepresentationPLOS ONE10.1371/journal.pone.016965812:1(e0169658)Online publication date: 25-Jan-2017
  • (2017)A novel algorithm for extracting the user reviews from web pagesJournal of Information Science10.1177/016555151666644643:5(696-712)Online publication date: 1-Oct-2017
  • (2017)Harvesting Forum Pages from Seed SitesWeb Engineering10.1007/978-3-319-60131-1_32(457-468)Online publication date: 1-Jun-2017
  • (2013)A Lightweight Algorithm for Automated Forum Information ProcessingProceedings of the 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 0110.1109/WI-IAT.2013.18(121-126)Online publication date: 17-Nov-2013
  • (2013)Prequery Discovery of Domain-Specific Query FormsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2012.11125:8(1830-1848)Online publication date: 1-Aug-2013
  • (2013)Generalized and lightweight algorithms for automated web forum content extraction2013 IEEE International Conference on Computational Intelligence and Computing Research10.1109/ICCIC.2013.6724259(1-8)Online publication date: Dec-2013
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media