Article

Current challenges in web crawling

Author:

Denis ShestakovAuthors Info & Claims

ICWE'13: Proceedings of the 13th international conference on Web Engineering

Pages 518 - 521

https://doi.org/10.1007/978-3-642-39200-9_49

Published: 08 July 2013 Publication History

Abstract

Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an astronomical amount of data already published on the Web and ongoing exponential growth of web content, any party that want to take advantage of massive-scale web data faces a high barrier to entry. In this tutorial, we will introduce the audience to five topics: architecture and implementation of high-performance web crawler, collaborative web crawling, crawling the deep Web, crawling multimedia content and future directions in web crawling research.

References

[1]

Olston, C., Najork, M.: Web crawling. Foundations and Trends in Information Retrieval 4(3), 175-246 (2010)

Digital Library

[2]

Barabasi, A.-L.: Scale-Free networks: A decade and beyond. Science 325(5939), 412-413 (2009)

[3]

Kleinberg, J. M., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A. S.: The Web as a graph: measurements, models, and methods. In: Asano, T., Imai, H., Lee, D. T., Nakano, S.-I., Tokuyama, T. (eds.) COCOON 1999. LNCS, vol. 1627, pp. 1-17. Springer, Heidelberg (1999)

Digital Library

[4]

Schonfeld, U., Shivakumar, N.: Sitemaps: Above and beyond the crawl of duty. In: Proc. of WWW 2009, pp. 991-1000 (2009)

Digital Library

[5]

Bar-Yossef, Z., Gurevich, M.: Random sampling from a search engine's index. JACM 55(5) (2008)

Digital Library

[6]

Shestakov, D.: Sampling the national deep Web. In: Hameurlain, A., Liddle, S. W., Schewe, K.-D., Zhou, X. (eds.) DEXA 2011, Part I. LNCS, vol. 6860, pp. 331-340. Springer, Heidelberg (2011)

Digital Library

[7]

Shkapenyuk, V., Suel, T.: Design and implementation of a high-performance distributed web crawler. In: Proc. of ICDE 2002, pp. 357-368 (2002)

Digital Library

[8]

Lee, H.-T., Leonard, D., Wang, X., Loguinov, D.: IRLbot: Scaling to 6 billion pages and beyond. ACM Transactions on the Web 3(3) (2009)

Digital Library

[9]

Hsieh, J., Gribble, S., Levy, H.: The architecture and implementation of an extensible web crawler. In: Proc. of NSDI 2010 (2010)

Digital Library

[10]

Shestakov, D.: Deep Web: databases on the Web. Entry: Handbook of Research on Innovations in Database Technologies and Applications, pp. 581-588 (2009)

[11]

Madhavan, J., Ko, D., Kot, Ł., Ganapathy, V., Rasmussen, A., Halevy, A.: Google's deep-Web crawl. In: Proc. of VLDB 2008, pp. 1241-1252 (2008)

Digital Library

[12]

Shestakov, D.: On building a search interface discovery system. In: Proc. of VLDB Workshops 2009, pp. 81-93 (2009)

Digital Library

[13]

Duda, C., Frey, G., Kossmann, D., Matter, R., Zhou, C.: AJAX crawl: Making AJAX applications searchable. In: Proc. of ICDE 2009, pp. 78-89 (2009)

Digital Library

[14]

Lin, S.-H., Ho, J.-M.: Discovering informative content blocks from web documents. In: Proc. of SIGKDD 2002, pp. 588-593 (2002)

Digital Library

[15]

Shestakov, D.: Search interfaces on the Web: Querying and characterizing. Doctoral thesis, University of Turku (2008)

Index Terms

Current challenges in web crawling
1. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Using Web Pages Dynamicity to Prioritise Web Crawling
MLMI '19: Proceedings of the 2019 2nd International Conference on Machine Learning and Machine Intelligence

Web crawling is a process performed to collect web pages from the web, in order to be indexed and used for displaying the search results according to users' requirements. In addition, web crawlers must continually revisit web pages, to keep the search ...
A QIIIEP based domain specific hidden web crawler
ICWET '11: Proceedings of the International Conference & Workshop on Emerging Trends in Technology

For context based surfing of World Wide Web in a systematic and automatic manner, a web crawler is required. The World Wide Web consists interlinked documents and resources that are easily crawled by general web crawler, known as surface web crawler. ...
Deep Web crawling: a survey

Deep Web crawling refers to the problem of traversing the collection of pages in a deep Web site, which are dynamically generated in response to a particular query that is submitted using a search form. To achieve this, crawlers need to be endowed with ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

ICWE'13: Proceedings of the 13th international conference on Web Engineering

July 2013

525 pages

ISBN:9783642391996

Editors:
Florian Daniel
University of Trento, Via Sommarive 5, Povo, TN, Italy
,
Peter Dolog
Department of Computer Science, Aalborg University, Selma Lagerloefs Vej 300, Aalborg, TN, Denmark
,
Qing Li
Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave., Kowloon, Hong Kong, TN, China

Sponsors

Otto Monsted Fond: Otto Monsted Fond
Det Obelske Famieliefond: Det Obelske Famieliefond

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 08 July 2013

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents