Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Mercator: A scalable, extensible Web crawler

Published: 15 April 1999 Publication History

Abstract

This paper describes Mercator, a scalable, extensible Web crawler written entirely in Java. Scalable Web crawlers are an important component of many Web services, but their design is not well-documented in the literature. We enumerate the major components of any scalable Web crawler, comment on alternatives and tradeoffs in their design, and describe the particular components used in Mercator. We also describe Mercator’s support for extensibility and customizability. Finally, we comment on Mercator’s performance, which we have found to be comparable to that of other crawlers for which performance numbers have been published.

References

[1]
AltaVista, "AltaVista Software Search Intranet Home Page," altavista.software.digital.com/search/intranet.
[2]
BIND, "Berkeley Internet Name Domain (BIND)," www.isc.org/bind.html.
[3]
Bloom, B. (1970), "Space/Time Trade-Offs in Hash Coding with Allow able Errors," Communications of the ACM 13 , 7, 422-426.
[4]
Brin, S. and L. Page (1998), "The Anatomy of a Large-Scale Hypertextual Web Search Engine," In Proceedings of the Seventh International World Wide Web Conference , pp. 107-117.
[5]
Broder, A. (1993), "Some Applications of Rabin's Fingerprinting Method," In Sequences II: Methods in Communications, Security, and Computer Science , R. Capocelli, A. De Santis, and U. Vaccaro, Eds., Springer-Verlag, pp. 143-152.
[6]
Burner, M. (1977), "Crawling Towards Eternity: Building an Archive of the World Wide Web," Web Techniques Magazine 2 , 5.
[7]
Cho, J., H. Garcia-Molina, and L. Page (1998), "Efficient Crawling Through URL Ordering," In Proceedings of the Seventh International World Wide Web Conference , pp. 161-172.
[8]
DCPI, "Digital Continuous Profiling Infrastructure," www.research.digital.com/SRC/dcpi/.
[9]
Eichmann, D. (1994), "The RBSE Spider - Balancing Effective Search Against Web Load," In Proceedings of the First International World Wide Web Conference , pp. 113-120.
[10]
Ghemawat, S., "srcjava home page," www.research.digital.com/SRC/java/.
[11]
Google, "Google! Search Engine," google.stanford.edu/.
[12]
Gray, M., "Internet Growth and Statistics: Credits and Background," www.mit.edu/people/mkgray/net/background.html.
[13]
Henzinger, M., A. Heydon, M. Mitzenmacher, and M.A. Najork (1999), "Measuring Index Quality Using Random Walks on the Web," In Proceedings of the Eighth International World Wide Web Conference , pp. 213-225.
[14]
Heydon, A. and M. Najork (1999), "Performance Limitations of the Java Core Libraries," In Proceedings of the 1999 ACM Java Grande Conference , pp. 35-41.
[15]
InternetArchive, "The Internet Archive," www.archive.org/.
[16]
Koster, M., "The Web Robots Pages," info.webcrawler.com/mak/projects/robots/robots. html.
[17]
McBryan, O.A. (1994), "GENVL and WWWW: Tools for Taming the Web," In Proceedings of the First International World Wide Web Conference , pp. 79-90.
[18]
Miller, R.C. and K. Bharat (1998), "SPHINX: A Framework for Creating Personal, Site-Specific Web Crawlers," In Proceedings of the Seventh International World Wide Web Conference , pp. 119-130.
[19]
Pinkerton, B. (1994), "Finding What People Want: Experiences with the WebCrawler," In Proceedings of the Second International World Wide Web Conference .
[20]
Rabin, M.O. (1981), "Fingerprinting by Random Polynomials," Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University.
[21]
RobotsExclusion, "The Robots Exclusion Protocol," info.webcrawler.com/mak/projects/robots/ exclusion.html.
[22]
Smith, Z. (1997), "The Truth About the Web: Crawling Towards Eternity," Web Techniques Magazine 2 , 5.

Cited By

View all
  • (2023)Time-limited Bloom FilterProceedings of the 38th ACM/SIGAPP Symposium on Applied Computing10.1145/3555776.3577791(1285-1288)Online publication date: 27-Mar-2023
  • (2023)Parallel and Distributed Architecture for Multilingual Open Source Intelligence SystemsSoftware Architecture. ECSA 2023 Tracks, Workshops, and Doctoral Symposium10.1007/978-3-031-66326-0_27(438-450)Online publication date: 18-Sep-2023
  • (2022)Analysis of Application Data Mining to Capture Consumer Review Data on Booking WebsitesMobile Information Systems10.1155/2022/30629532022Online publication date: 1-Jan-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image World Wide Web
World Wide Web  Volume 2, Issue 4
1999
70 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 15 April 1999

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Time-limited Bloom FilterProceedings of the 38th ACM/SIGAPP Symposium on Applied Computing10.1145/3555776.3577791(1285-1288)Online publication date: 27-Mar-2023
  • (2023)Parallel and Distributed Architecture for Multilingual Open Source Intelligence SystemsSoftware Architecture. ECSA 2023 Tracks, Workshops, and Doctoral Symposium10.1007/978-3-031-66326-0_27(438-450)Online publication date: 18-Sep-2023
  • (2022)Analysis of Application Data Mining to Capture Consumer Review Data on Booking WebsitesMobile Information Systems10.1155/2022/30629532022Online publication date: 1-Jan-2022
  • (2022)Optimising the website accessibility conformance evaluation methodologyProceedings of the 19th International Web for All Conference10.1145/3493612.3520452(1-5)Online publication date: 25-Apr-2022
  • (2021)Face retrieval system based on elastic web crawler over cloud computingMultimedia Tools and Applications10.1007/s11042-020-10271-380:8(11723-11738)Online publication date: 1-Mar-2021
  • (2020)Intelligent Distributed Web Crawler Based on Attention MechanismProceedings of the 2020 2nd International Conference on Robotics, Intelligent Control and Artificial Intelligence10.1145/3438872.3439085(229-233)Online publication date: 17-Oct-2020
  • (2020)Change Rate Estimation and Optimal Freshness in Web Page CrawlingProceedings of the 13th EAI International Conference on Performance Evaluation Methodologies and Tools10.1145/3388831.3388846(3-10)Online publication date: 18-May-2020
  • (2020)Power-Aware Allocation of Graph Jobs in Geo-Distributed Cloud NetworksIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.294345731:4(749-765)Online publication date: 1-Apr-2020
  • (2020)An effective approach to enhancing a focused crawler using GoogleThe Journal of Supercomputing10.1007/s11227-019-02787-976:10(8175-8192)Online publication date: 1-Oct-2020
  • (2020)Towards extracting event-centric collections from Web archivesInternational Journal on Digital Libraries10.1007/s00799-018-0258-621:1(31-45)Online publication date: 1-Mar-2020
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media