0% found this document useful (0 votes)

116 views

WebCrawler in Java

Doc on java web crawler

Uploaded by

Samuel Dalton

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

116 views

WebCrawler in Java

Doc on java web crawler

Uploaded by

Samuel Dalton

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Writing a Web Crawler in the Java Programming Language

Sun

Java

Solaris

Communities

My SDN Account

APIs

Downloads

Products

Support

Training

Participate

search tips

SDN Home > Products & Technologies > Java Technology > Reference > Technical Articles and Tips > Developer Technical Articles & Tips > Third-Party Technologies >

Article

Writing a Web Crawler in the Java Programming Language

Print-friendly Version

Articles Index

By Thom Blum, Doug Keislar, Jim Wheaton, and Erling Wold of Muscle Fish, LLC January 1998
Everyone uses web crawlersindirectly, at least! Every time you search the Internet using a service such as Alta Vista, Excite, or Lycos, you're making use of an index that's based on the output of a web crawler. Web crawlersalso known as spiders, robots, or wanderersare software programs that automatically traverse the Web. Search engines use crawlers to find what's on the Web; then they construct an index of the pages that were found. However, you might want to use a crawler directly. You might even want to write your own! Here are some possible reasons: You want to maintain mirror sites for popular Web sites. You need to test web pages and links for valid syntax and structure. You want to monitor sites to see when their structure or contents change. Your company needs to search for copyright infringements. You'd like to build a special-purpose indexfor example, one that has some understanding of the content stored in multimedia files on the Web. This article explains what web crawlers are. It includes a web-crawling demo program, written in the Java programming language, that you can run from your browser. The demo traverses the Web automatically, shows a running list of files it has found, and updates the list each time it finds a new one. You can specify what type of file you want to find. The Java language source code for this demo application is provided as a programming example.

How Web Crawlers Work

Web crawlers start by parsing a specified web page, noting any hypertext links on that page that point to other web pages. They then parse those pages for new links, and so on, recursively. Web-crawler software doesn't actually move around to different computers on the Internet, as viruses or intelligent agents do. A crawler resides on a single machine. The crawler simply sends HTTP requests for documents to other machines on the Internet, just as a web browser does when the user clicks on links. All the crawler really does is to automate the process of following links. Following links isn't greatly useful in itself, of course. The list of linked pages almost always serves some subsequent purpose. The most common use is to build an index for a web search engine, but crawlers are also used for other purposes, such as those mentioned in the previous section.

http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/[17/5/2011 2:26:17 AM]

Writing a Web Crawler in the Java Programming Language Muscle Fish uses a crawler to search the Web for audio files. This is a straightforward task, as shown by the demo in the next section. It turns out that searching for audio files is not very different from searching for any other kind of file. On the other hand, indexing audio is anything but straightforward. Most search engines, if they handle audio at all, index only textual information that's associated with the sound file. Muscle Fish's approach is to acoustically analyze the audio itself. This feature lets you search for sound files based on how they actually soundyou're not limited to searching for whatever words happen to be located nearby on the same web page. (A forthcoming article and demo program will show this feature.)

A Web-Crawling Demo Program

The simple application shown below crawls the Web, searching for a specified type of file. Note: This demo was written using JDK 1.1.3. Not all web browsers support such a recent version of JDK. You can run the demo on any platform by using the HotJava browser. On the Macintosh, the demo should work with any browser that uses MRJ (Macintosh Runtime for Java) 2.0.

Application source code. To run the demo, follow these steps: Type a valid URL (web address), including the "http://" portion, in the text field at the top of the application window. Click the Search button. Look at the status area below the scrolling list. In this area, the application reports which page it is currently searching. As it encounters links on the page, it adds any new URLs to the scrolling list. The application remembers which pages it's already visited, so it won't search any web page twice. This prevents infinite loops. As you inspect the list of URLs, you can see that the application performs a breadth-first search. In other words, it accumulates a list of all the links that are on the current page before it follows any of the links to a new page. If you tire of witnessing this little tour of the Web, click the Stop button. The status area reports "stopped." If you let the tour run without stopping, it will eventually stop on its own once it's found 50 files. At this point, it reports "reached search limit of 50." (You can increase the limit by changing the SEARCH_LIMIT constant in the source code.) The application will also stop automatically if it encounters a dead endmeaning that

http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/[17/5/2011 2:26:17 AM]

Writing a Web Crawler in the Java Programming Language it's traversed all the files that are directly or indirectly available from the starting position you specified. If this happens, the application reports "done." The next time you click Search, the list of files gets cleared, and the search process starts over again. Notice that there's a pull-down menu that lets you specify what type of file you want to find. The default is HTML text files. You can also choose "audio/basic," "audio/au," "audio/aiff," "audio/wav," "video/mpeg," or "video/x-avi."

A Look at the Code

Take a look at the Java-language source code for this demo. The code occupies less than 400 lines, including comments. It is a testament to JDK's elegance that this application took only a few person-hours to write from scratch. (Muscle Fish had never written a crawler before, nor was any pre-existing web-crawler code borrowed or studied.) Here's a pseudocode summary of the algorithm:

Get the user's input: the starting URL and the desired file type. Add the URL to the currently empty list of URLs to search. While the list of URLs to search is not empty, { Get the first URL in the list. Move the URL to the list of URLs already searched. Check the URL to make sure its protocol is HTTP (if not, break out of the loop, back to "While"). See whether there's a robots.txt file at this site that includes a "Disallow" statement. (If so, break out of the loop, back to "While".) Try to "open" the URL (that is, retrieve that document From the Web). If it's not an HTML file, break out of the loop, back to "While." Step through the HTML file. While the HTML text contains another link, { Validate the link's URL and make sure robots are allowed (just as in the outer loop). If it's an HTML file, If the URL isn't present in either the to-search list or the already-searched list, add it to the to-search list. Else if it's the type of the file the user requested, Add it to the list of files found. } }

This demo tries to respect the robots exclusion standard, meaning that it avoids sites where it's unwelcome. Any site can exclude web crawlers from all or part of its filesystem, by putting certain statements in a file called robots.txt . See the robotSafe function in the demo's source code. This function is conservative in that it avoids sites where any crawler is disallowed, even if this particular one is not. (There is a new HTML meta-tag called ROBOTS , which this demo does not yet support. If you revise the source code to support this meta-tag, send your code to the authors and the version posted here will be updated.)

Where to Go from Here

This simple programming example might have given you some ideas about how to write a full-fledged web crawler. Muscle Fish can't provide technical support for running this demo program or for writing crawlers. However, there are various resources on the Web for people interested in crawlers. The Web Robots Pages is a good starting point, and it contains links to other important sites.

Thom Blum, Doug Keislar, Jim Wheaton, and Erling Wold are members of Muscle Fish, LLC, a software consulting firm in Berkeley, California. Muscle Fish specializes in audio and music technology, and produces software that searches for sound based on its acoustical content.

http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/[17/5/2011 2:26:17 AM]

Writing a Web Crawler in the Java Programming Language

Oracle is reviewing the Sun product roadmap and will provide guidance to customers in accordance with Oracle's standard product communication policies. Any resulting features and timing of release of such features as determined by Oracle's review of roadmaps, are at the sole discretion of Oracle. All product roadmap information, whether communicated by Sun Microsystems or by Oracle, does not represent a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. It is intended for information purposes only, and may not be incorporated into any contract.

A Sun Developer Network Site

2010, Oracle Corporation and/or its affiliates

Unless otherwise licensed, code in all technical manuals herein (including articles, FAQs, samples) is provided under this License.

Sun Developer RSS Feeds

http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/[17/5/2011 2:26:17 AM]

Skip Fish
No ratings yet
Skip Fish
1 page
Python Web Frameworks
100% (2)
Python Web Frameworks
83 pages
Final SRS
No ratings yet
Final SRS
7 pages
Crawler: 1.0 Introduction
No ratings yet
Crawler: 1.0 Introduction
12 pages
Security Testing With JMeter
No ratings yet
Security Testing With JMeter
11 pages
day 8,9 and 10 notes (1)
No ratings yet
day 8,9 and 10 notes (1)
10 pages
Unit2 IT
No ratings yet
Unit2 IT
16 pages
Instruction Manual
No ratings yet
Instruction Manual
9 pages
Google Hacking Mini
No ratings yet
Google Hacking Mini
8 pages
Search Engine With Web Crawler
No ratings yet
Search Engine With Web Crawler
23 pages
What Is A Search Engine
No ratings yet
What Is A Search Engine
1 page
Vulnerability Assessment Practical 2 (OWASPZAP)
No ratings yet
Vulnerability Assessment Practical 2 (OWASPZAP)
17 pages
Unit 5 World Wide Web
No ratings yet
Unit 5 World Wide Web
25 pages
Intelligent Browsers
No ratings yet
Intelligent Browsers
19 pages
Website: Opensearch/Sherlock Search Engine For Your
100% (2)
Website: Opensearch/Sherlock Search Engine For Your
8 pages
Web Crawler Security Tool - MatesLab
No ratings yet
Web Crawler Security Tool - MatesLab
2 pages
SEO E-Book
No ratings yet
SEO E-Book
11 pages
What Is The Best Web Scraping Open Source
No ratings yet
What Is The Best Web Scraping Open Source
7 pages
8. Content Delivery and Search Spider Control
No ratings yet
8. Content Delivery and Search Spider Control
22 pages
SEOtools CCG Sum2019 E-Book
No ratings yet
SEOtools CCG Sum2019 E-Book
11 pages
What Should A Developer Know - Before - Building A Public Web Site?
No ratings yet
What Should A Developer Know - Before - Building A Public Web Site?
9 pages
HTML, CSS and JavaScript Resources
No ratings yet
HTML, CSS and JavaScript Resources
9 pages
Army Report 1
No ratings yet
Army Report 1
14 pages
15 Helpful In-Browser Web Development Tools - Smashing Magazine
No ratings yet
15 Helpful In-Browser Web Development Tools - Smashing Magazine
10 pages
Cross Browser Testing
No ratings yet
Cross Browser Testing
17 pages
Api Tools
No ratings yet
Api Tools
4 pages
Q #1) What Is Automation Testing?
No ratings yet
Q #1) What Is Automation Testing?
21 pages
Internet Programming Unit II Final (1)
No ratings yet
Internet Programming Unit II Final (1)
9 pages
Lente: Looking For Equal or Near Text Efficiently
No ratings yet
Lente: Looking For Equal or Near Text Efficiently
42 pages
RPG Consuming Web Services With HTTPAPI and SoapUI
100% (10)
RPG Consuming Web Services With HTTPAPI and SoapUI
10 pages
What Is The "Invisible Web", A.K.A. The "Deep Web"?: Search Engines Subject Directories
No ratings yet
What Is The "Invisible Web", A.K.A. The "Deep Web"?: Search Engines Subject Directories
6 pages
BJ Chapter 4
No ratings yet
BJ Chapter 4
52 pages
Research 3
No ratings yet
Research 3
7 pages
What Does Do? Is It An HTML Tag? (IMP)
No ratings yet
What Does Do? Is It An HTML Tag? (IMP)
6 pages
Comparison of Existing Open-Source Tools For Web Crawling and Indexing of Free Music
No ratings yet
Comparison of Existing Open-Source Tools For Web Crawling and Indexing of Free Music
6 pages
Web Pentest
No ratings yet
Web Pentest
9 pages
Open Source Software
No ratings yet
Open Source Software
13 pages
WWW Overview: WWW Stands For World Wide Web. A Technical Definition of The World Wide Web Is: All
No ratings yet
WWW Overview: WWW Stands For World Wide Web. A Technical Definition of The World Wide Web Is: All
21 pages
Internet Terminology
100% (1)
Internet Terminology
8 pages
SEARCH ENGINES and PAGERANK
No ratings yet
SEARCH ENGINES and PAGERANK
29 pages
DeVito_et_al_2020_how_we_learnt_to_stop_worrying_and
No ratings yet
DeVito_et_al_2020_how_we_learnt_to_stop_worrying_and
3 pages
Ian Talks JavaScript Libraries and Frameworks A-Z: WebDevAtoZ, #4
From Everand
Ian Talks JavaScript Libraries and Frameworks A-Z: WebDevAtoZ, #4
Ian Eress
No ratings yet
Internet Programming Unit II
No ratings yet
Internet Programming Unit II
9 pages
Resourceslist
100% (1)
Resourceslist
17 pages
11g Oracle Search
No ratings yet
11g Oracle Search
39 pages
Interview Questions
No ratings yet
Interview Questions
4 pages
Unit 5
No ratings yet
Unit 5
46 pages
Back Door Into Java EE Application Servers
100% (1)
Back Door Into Java EE Application Servers
17 pages
What Is Javascript
No ratings yet
What Is Javascript
14 pages
Robot Framework
No ratings yet
Robot Framework
11 pages
Soap UI Notes
No ratings yet
Soap UI Notes
7 pages
His 06 2013
No ratings yet
His 06 2013
15 pages
Seminar Formatkhjj
No ratings yet
Seminar Formatkhjj
24 pages
Fashion Blog Report
No ratings yet
Fashion Blog Report
18 pages
Project 4: Time Due: 9 PM Thursday, March 14
No ratings yet
Project 4: Time Due: 9 PM Thursday, March 14
26 pages
Basic Internet Terminology Student
No ratings yet
Basic Internet Terminology Student
31 pages
Performance Tools
From Everand
Performance Tools
Ahmed Bouchefra
No ratings yet
JavaScript: Beginner's Guide to Programming Code with JavaScript: JavaScript Computer Programming
From Everand
JavaScript: Beginner's Guide to Programming Code with JavaScript: JavaScript Computer Programming
Charlie Masterson
No ratings yet
JavaScript: Beginner's Guide to Programming Code with JavaScript
From Everand
JavaScript: Beginner's Guide to Programming Code with JavaScript
Charlie Masterson
5/5 (1)
An Introduction to Website Performance: How to Outrun the Zombie Hordes: Undead Institute, #15
From Everand
An Introduction to Website Performance: How to Outrun the Zombie Hordes: Undead Institute, #15
John Rhea
No ratings yet
Symantec Endpoint Protection 15: at A Glance
100% (1)
Symantec Endpoint Protection 15: at A Glance
6 pages
Retail Banking Payments Standards PDF
No ratings yet
Retail Banking Payments Standards PDF
20 pages
GL To Ap Drill Down Query R12, General Ledger Link To Account Payables R12, AP To GL
No ratings yet
GL To Ap Drill Down Query R12, General Ledger Link To Account Payables R12, AP To GL
4 pages
SAP HU SAP Basic
No ratings yet
SAP HU SAP Basic
3 pages
Viewnet Computer Systems SDN BHD Price List 2016-12
No ratings yet
Viewnet Computer Systems SDN BHD Price List 2016-12
2 pages
Ert PC
No ratings yet
Ert PC
20 pages
Huawei HG8247 Huawei Router Default Configuration
No ratings yet
Huawei HG8247 Huawei Router Default Configuration
3 pages
C Basics - C Programming Tutorial
No ratings yet
C Basics - C Programming Tutorial
42 pages
LESSON 3 of MIL
No ratings yet
LESSON 3 of MIL
1 page
Conceptual Foundation of CRM: Evolution of CRM Benefits of CRM Schools of Thought On CRM Different Definitions of CRM
No ratings yet
Conceptual Foundation of CRM: Evolution of CRM Benefits of CRM Schools of Thought On CRM Different Definitions of CRM
39 pages
sm2h Website Term 3
No ratings yet
sm2h Website Term 3
3 pages
Controller ARM
No ratings yet
Controller ARM
25 pages
Contracting Process SPAWAR
No ratings yet
Contracting Process SPAWAR
1 page
Msc-Mepc 6-Circ 17
No ratings yet
Msc-Mepc 6-Circ 17
2 pages
Currency Eur Nodim Symbol Display in Bex Query 7.0
No ratings yet
Currency Eur Nodim Symbol Display in Bex Query 7.0
5 pages
Sap CP 1
No ratings yet
Sap CP 1
5 pages
Standard Methods of Solution
No ratings yet
Standard Methods of Solution
4 pages
Kosovo GeoPortal Manual Ang
No ratings yet
Kosovo GeoPortal Manual Ang
17 pages
Huawei Lte FDD&TDD Integration
No ratings yet
Huawei Lte FDD&TDD Integration
6 pages
Binomial Expansion Practice Problems and Markscheme
No ratings yet
Binomial Expansion Practice Problems and Markscheme
6 pages
CHEM 749 "Computational Chemistry"
No ratings yet
CHEM 749 "Computational Chemistry"
3 pages
Search and Sort Algorithm
No ratings yet
Search and Sort Algorithm
37 pages
Aiman CV
No ratings yet
Aiman CV
4 pages
The User Datagram Protocol
No ratings yet
The User Datagram Protocol
10 pages
Compiler Design Question
No ratings yet
Compiler Design Question
4 pages
SE QP & SCHEME - III BSc CS CIA 2
No ratings yet
SE QP & SCHEME - III BSc CS CIA 2
3 pages
NFV Fundamentals V3.0 PDF
No ratings yet
NFV Fundamentals V3.0 PDF
31 pages
Web Servers (PWS, IIS, Apache, Jigsaw) : Fig. 24.1 A Web Server Communicating With Several HTTP Clients
No ratings yet
Web Servers (PWS, IIS, Apache, Jigsaw) : Fig. 24.1 A Web Server Communicating With Several HTTP Clients
28 pages
4.2BSD and 4.3BSD As Examples of The UNIX System - John S. Quarterman, Abraham Silberschatz and James L. Peterson
No ratings yet
4.2BSD and 4.3BSD As Examples of The UNIX System - John S. Quarterman, Abraham Silberschatz and James L. Peterson
40 pages