Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-642-39200-9_49guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Current challenges in web crawling

Published: 08 July 2013 Publication History
  • Get Citation Alerts
  • Abstract

    Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an astronomical amount of data already published on the Web and ongoing exponential growth of web content, any party that want to take advantage of massive-scale web data faces a high barrier to entry. In this tutorial, we will introduce the audience to five topics: architecture and implementation of high-performance web crawler, collaborative web crawling, crawling the deep Web, crawling multimedia content and future directions in web crawling research.

    References

    [1]
    Olston, C., Najork, M.: Web crawling. Foundations and Trends in Information Retrieval 4(3), 175-246 (2010)
    [2]
    Barabasi, A.-L.: Scale-Free networks: A decade and beyond. Science 325(5939), 412-413 (2009)
    [3]
    Kleinberg, J. M., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A. S.: The Web as a graph: measurements, models, and methods. In: Asano, T., Imai, H., Lee, D. T., Nakano, S.-I., Tokuyama, T. (eds.) COCOON 1999. LNCS, vol. 1627, pp. 1-17. Springer, Heidelberg (1999)
    [4]
    Schonfeld, U., Shivakumar, N.: Sitemaps: Above and beyond the crawl of duty. In: Proc. of WWW 2009, pp. 991-1000 (2009)
    [5]
    Bar-Yossef, Z., Gurevich, M.: Random sampling from a search engine's index. JACM 55(5) (2008)
    [6]
    Shestakov, D.: Sampling the national deep Web. In: Hameurlain, A., Liddle, S. W., Schewe, K.-D., Zhou, X. (eds.) DEXA 2011, Part I. LNCS, vol. 6860, pp. 331-340. Springer, Heidelberg (2011)
    [7]
    Shkapenyuk, V., Suel, T.: Design and implementation of a high-performance distributed web crawler. In: Proc. of ICDE 2002, pp. 357-368 (2002)
    [8]
    Lee, H.-T., Leonard, D., Wang, X., Loguinov, D.: IRLbot: Scaling to 6 billion pages and beyond. ACM Transactions on the Web 3(3) (2009)
    [9]
    Hsieh, J., Gribble, S., Levy, H.: The architecture and implementation of an extensible web crawler. In: Proc. of NSDI 2010 (2010)
    [10]
    Shestakov, D.: Deep Web: databases on the Web. Entry: Handbook of Research on Innovations in Database Technologies and Applications, pp. 581-588 (2009)
    [11]
    Madhavan, J., Ko, D., Kot, Ł., Ganapathy, V., Rasmussen, A., Halevy, A.: Google's deep-Web crawl. In: Proc. of VLDB 2008, pp. 1241-1252 (2008)
    [12]
    Shestakov, D.: On building a search interface discovery system. In: Proc. of VLDB Workshops 2009, pp. 81-93 (2009)
    [13]
    Duda, C., Frey, G., Kossmann, D., Matter, R., Zhou, C.: AJAX crawl: Making AJAX applications searchable. In: Proc. of ICDE 2009, pp. 78-89 (2009)
    [14]
    Lin, S.-H., Ho, J.-M.: Discovering informative content blocks from web documents. In: Proc. of SIGKDD 2002, pp. 588-593 (2002)
    [15]
    Shestakov, D.: Search interfaces on the Web: Querying and characterizing. Doctoral thesis, University of Turku (2008)

    Index Terms

    1. Current challenges in web crawling
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Guide Proceedings
      ICWE'13: Proceedings of the 13th international conference on Web Engineering
      July 2013
      525 pages
      ISBN:9783642391996
      • Editors:
      • Florian Daniel,
      • Peter Dolog,
      • Qing Li

      Sponsors

      • Otto Monsted Fond: Otto Monsted Fond
      • Det Obelske Famieliefond: Det Obelske Famieliefond

      Publisher

      Springer-Verlag

      Berlin, Heidelberg

      Publication History

      Published: 08 July 2013

      Author Tags

      1. collaborative crawling
      2. crawler architecture
      3. deep web
      4. distributed crawling
      5. focused crawling
      6. web coverage
      7. web crawler
      8. web crawling
      9. web ecosystem
      10. web graph
      11. web growth
      12. web harvesting
      13. web mining
      14. web retrieval
      15. web robot
      16. web spider
      17. web structure

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 0
        Total Downloads
      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 11 Aug 2024

      Other Metrics

      Citations

      View Options

      View options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media