Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1390334.1390413acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Exploring traversal strategy for web forum crawling

Published: 20 July 2008 Publication History
  • Get Citation Alerts
  • Abstract

    In this paper, we study the problem of Web forum crawling. Web forum has now become an important data source of many Web applications; while forum crawling is still a challenging task due to complex in-site link structures and login controls of most forum sites. Without carefully selecting the traversal path, a generic crawler usually downloads many duplicate and invalid pages from forums, and thus wastes both the precious bandwidth and the limited storage space. To crawl forum data more effectively and efficiently, in this paper, we propose an automatic approach to exploring an appropriate traversal strategy to direct the crawling of a given target forum. In detail, the traversal strategy consists of the identification of the skeleton links and the detection of the page-flipping links. The skeleton links instruct the crawler to only crawl valuable pages and meanwhile avoid duplicate and uninformative ones; and the page-flipping links tell the crawler how to completely download a long discussion thread which is usually shown in multiple pages in Web forums. The extensive experimental results on several forums show encouraging performance of our approach. Following the discovered traversal strategy, our forum crawler can archive more informative pages in comparison with previous related work and a commercial generic crawler.

    References

    [1]
    Internet Forum. http://en.wikipedia.org/wiki/Internet_forum.
    [2]
    S. Abiteboul, M. Preda, and G. Cobena. Adaptive on-line page importance computation. In Proc. 12th WWW, pages 280--290, Budapest, Hungary, May 20-24, 2003.
    [3]
    R. Baeza-Yates and C. Castillo. Crawling the infinite Web: five levels are enough. In Proc. 3rd Workshop on Algorithms and Models for the Web-Graph, LNCS, volume 3243, pages 156--167, Rome, Italy, Oct. 16, 2004.
    [4]
    R. Baeza-Yates, C. Castillo, M. Marin, and A. Rodriguez. Crawling a Country: better strategies than breadth-first for Web page ordering. In Proc. 14th WWW, pages 864--872, Chiba, Japan, May 10-14, 2005.
    [5]
    S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1-7):107--117, 1998.
    [6]
    A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the Web. In Proc. 6th WWW, pages 1157--1166, Santa Clara, California, USA, Apr. 1997.
    [7]
    R. Cai, J.-M. Yang, W. Lai, Y. Wang, and L. Zhang. iRobot: An Intelligent Crawler for Web Forums. In Proc. 17th WWW, Beijing, P.R. China, April 21-25, 2008.
    [8]
    S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks, 31(11-16):1623--1640, 1999.
    [9]
    Y. Guo, K. Li, K. Zhang, and G. Zhang. Board forum crawling: a Web crawling method for Web forum. In Proc. 2006 IEEE/WIC/ACM Int. Conf. Web Intelligence, pages 745--748, Hong Kong, Dec. 2006.
    [10]
    M. Henzinger. Finding near-duplicate Web pages: a large-scale evaluation of algorithms. In Proc. 29th SIGIR, pages 284--291, Seattle, Washington, USA, Aug. 2006.
    [11]
    G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In Proc. 16th WWW, pages 141--150, Banff, Canada, May 8-12, 2007.
    [12]
    F. Menczer, G. Pant, P. Srinivasan, and M. E. Ruiz. Evaluating topic-driven Web crawlers. In Proc. 24th SIGIR, pages 241--249, New Orleans, LA, USA, Sept. 9-12, 2001.
    [13]
    S. Pandey and C. Olston. User-centric Web crawling. In Proc. 14th WWW, pages 401--411, Chiba, May 10-14, 2005.
    [14]
    S. Raghavan and H. Garcia-Molina. Crawling the hidden Web. In Proc. 27th VLDB, pages 129--138, San Francisco, CA, USA, Sept. 11-14, 2001.
    [15]
    M. L.A. Vidal, A. S. da Siva, E. S. de Moura, and J. M. B. Cavalcanti. Structure-driven crawler generation by example. In Proc. 29th SIGIR, pages 292--299, Seattle, Washington, USA, Aug. 6-11, 2006.

    Cited By

    View all
    • (2023)Travellers’ social media postings during protests and mass demonstrationsCurrent Issues in Tourism10.1080/13683500.2023.221435927:10(1513-1529)Online publication date: 9-Jun-2023
    • (2021)inTIME: A Machine Learning-Based Framework for Gathering and Leveraging Web Data to Cyber-Threat IntelligenceElectronics10.3390/electronics1007081810:7(818)Online publication date: 30-Mar-2021
    • (2019)A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence2019 IEEE World Congress on Services (SERVICES)10.1109/SERVICES.2019.00016(3-8)Online publication date: Jul-2019
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
    July 2008
    934 pages
    ISBN:9781605581644
    DOI:10.1145/1390334
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 July 2008

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. forum crawler
    2. sitemap
    3. traversal strategy

    Qualifiers

    • Research-article

    Conference

    SIGIR '08
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)13
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Travellers’ social media postings during protests and mass demonstrationsCurrent Issues in Tourism10.1080/13683500.2023.221435927:10(1513-1529)Online publication date: 9-Jun-2023
    • (2021)inTIME: A Machine Learning-Based Framework for Gathering and Leveraging Web Data to Cyber-Threat IntelligenceElectronics10.3390/electronics1007081810:7(818)Online publication date: 30-Mar-2021
    • (2019)A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence2019 IEEE World Congress on Services (SERVICES)10.1109/SERVICES.2019.00016(3-8)Online publication date: Jul-2019
    • (2019)SInFo – Structure-Driven Incremental Forum Crawler That Optimizes User-Generated Content RetrievalIEEE Access10.1109/ACCESS.2019.29398727(126941-126961)Online publication date: 2019
    • (2017)Vigi4Med Scraper: A Framework for Web Forum Structured Data Extraction and Semantic RepresentationPLOS ONE10.1371/journal.pone.016965812:1(e0169658)Online publication date: 25-Jan-2017
    • (2017)A novel algorithm for extracting the user reviews from web pagesJournal of Information Science10.1177/016555151666644643:5(696-712)Online publication date: 1-Oct-2017
    • (2017)Harvesting Forum Pages from Seed SitesWeb Engineering10.1007/978-3-319-60131-1_32(457-468)Online publication date: 1-Jun-2017
    • (2013)A Lightweight Algorithm for Automated Forum Information ProcessingProceedings of the 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 0110.1109/WI-IAT.2013.18(121-126)Online publication date: 17-Nov-2013
    • (2013)Prequery Discovery of Domain-Specific Query FormsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2012.11125:8(1830-1848)Online publication date: 1-Aug-2013
    • (2013)Generalized and lightweight algorithms for automated web forum content extraction2013 IEEE International Conference on Computational Intelligence and Computing Research10.1109/ICCIC.2013.6724259(1-8)Online publication date: Dec-2013
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media