WebCrawler in Java
WebCrawler in Java
Sun
Java
Solaris
Communities
My SDN Account
APIs
Downloads
Products
Support
Training
Participate
search tips
Search
SDN Home > Products & Technologies > Java Technology > Reference > Technical Articles and Tips > Developer Technical Articles & Tips > Third-Party Technologies >
Article
Articles Index
By Thom Blum, Doug Keislar, Jim Wheaton, and Erling Wold of Muscle Fish, LLC January 1998
Everyone uses web crawlersindirectly, at least! Every time you search the Internet using a service such as Alta Vista, Excite, or Lycos, you're making use of an index that's based on the output of a web crawler. Web crawlersalso known as spiders, robots, or wanderersare software programs that automatically traverse the Web. Search engines use crawlers to find what's on the Web; then they construct an index of the pages that were found. However, you might want to use a crawler directly. You might even want to write your own! Here are some possible reasons: You want to maintain mirror sites for popular Web sites. You need to test web pages and links for valid syntax and structure. You want to monitor sites to see when their structure or contents change. Your company needs to search for copyright infringements. You'd like to build a special-purpose indexfor example, one that has some understanding of the content stored in multimedia files on the Web. This article explains what web crawlers are. It includes a web-crawling demo program, written in the Java programming language, that you can run from your browser. The demo traverses the Web automatically, shows a running list of files it has found, and updates the list each time it finds a new one. You can specify what type of file you want to find. The Java language source code for this demo application is provided as a programming example.
Writing a Web Crawler in the Java Programming Language Muscle Fish uses a crawler to search the Web for audio files. This is a straightforward task, as shown by the demo in the next section. It turns out that searching for audio files is not very different from searching for any other kind of file. On the other hand, indexing audio is anything but straightforward. Most search engines, if they handle audio at all, index only textual information that's associated with the sound file. Muscle Fish's approach is to acoustically analyze the audio itself. This feature lets you search for sound files based on how they actually soundyou're not limited to searching for whatever words happen to be located nearby on the same web page. (A forthcoming article and demo program will show this feature.)
Application source code. To run the demo, follow these steps: Type a valid URL (web address), including the "http://" portion, in the text field at the top of the application window. Click the Search button. Look at the status area below the scrolling list. In this area, the application reports which page it is currently searching. As it encounters links on the page, it adds any new URLs to the scrolling list. The application remembers which pages it's already visited, so it won't search any web page twice. This prevents infinite loops. As you inspect the list of URLs, you can see that the application performs a breadth-first search. In other words, it accumulates a list of all the links that are on the current page before it follows any of the links to a new page. If you tire of witnessing this little tour of the Web, click the Stop button. The status area reports "stopped." If you let the tour run without stopping, it will eventually stop on its own once it's found 50 files. At this point, it reports "reached search limit of 50." (You can increase the limit by changing the SEARCH_LIMIT constant in the source code.) The application will also stop automatically if it encounters a dead endmeaning that
Writing a Web Crawler in the Java Programming Language it's traversed all the files that are directly or indirectly available from the starting position you specified. If this happens, the application reports "done." The next time you click Search, the list of files gets cleared, and the search process starts over again. Notice that there's a pull-down menu that lets you specify what type of file you want to find. The default is HTML text files. You can also choose "audio/basic," "audio/au," "audio/aiff," "audio/wav," "video/mpeg," or "video/x-avi."
Get the user's input: the starting URL and the desired file type. Add the URL to the currently empty list of URLs to search. While the list of URLs to search is not empty, { Get the first URL in the list. Move the URL to the list of URLs already searched. Check the URL to make sure its protocol is HTTP (if not, break out of the loop, back to "While"). See whether there's a robots.txt file at this site that includes a "Disallow" statement. (If so, break out of the loop, back to "While".) Try to "open" the URL (that is, retrieve that document From the Web). If it's not an HTML file, break out of the loop, back to "While." Step through the HTML file. While the HTML text contains another link, { Validate the link's URL and make sure robots are allowed (just as in the outer loop). If it's an HTML file, If the URL isn't present in either the to-search list or the already-searched list, add it to the to-search list. Else if it's the type of the file the user requested, Add it to the list of files found. } }
This demo tries to respect the robots exclusion standard, meaning that it avoids sites where it's unwelcome. Any site can exclude web crawlers from all or part of its filesystem, by putting certain statements in a file called robots.txt . See the robotSafe function in the demo's source code. This function is conservative in that it avoids sites where any crawler is disallowed, even if this particular one is not. (There is a new HTML meta-tag called ROBOTS , which this demo does not yet support. If you revise the source code to support this meta-tag, send your code to the authors and the version posted here will be updated.)
Thom Blum, Doug Keislar, Jim Wheaton, and Erling Wold are members of Muscle Fish, LLC, a software consulting firm in Berkeley, California. Muscle Fish specializes in audio and music technology, and produces software that searches for sound based on its acoustical content.
Oracle is reviewing the Sun product roadmap and will provide guidance to customers in accordance with Oracle's standard product communication policies. Any resulting features and timing of release of such features as determined by Oracle's review of roadmaps, are at the sole discretion of Oracle. All product roadmap information, whether communicated by Sun Microsystems or by Oracle, does not represent a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. It is intended for information purposes only, and may not be incorporated into any contract.
About Sun | About This Site | Newsletters | Contact Us | Employment | How to Buy | Licensing | Terms of Use | Privacy | Trademarks
Unless otherwise licensed, code in all technical manuals herein (including articles, FAQs, samples) is provided under this License.