An Approach to Design Incremental Parallel Webcrawler

Jorge Morato

An Approach to Design Incremental Parallel Webcrawler

2012

Journal of Theoretical and Applied Information Technology 15 September 2012. Vol. 43 No.1 © 2005 - 2012 JATIT & LLS. All rights reserved. ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195 AN APPROACH TO DESIGN INCREMENTAL PARALLEL WEBCRAWLER DIVAKAR YADAV1, AK SHARMA2, SONIA SANCHEZ-CUADRADO3, JORGE MORATO4 1 Assistant Professor, Department of Computer Science & Engg. and IT, JIIT, Noida (India) 2 Professor and Dean, Department of Computer Science & Engg., YMCA University, Faridabad (India) 3, 4 Associate Professor, Department of Computer Science & Engg., UC3, Madrid (Spain) Email: 1dsy99@rediffmail.com, 2ashokkale2@rediffmail.com, 3ssanchec@ie.inf.uc3m.es, 4 jorge@kr.inf.uc3m.es ABSTRACT World Wide Web (WWW) is a huge repository of interlinked hypertext documents known as web pages. Users access these hypertext documents via Internet. Since its inception in 1990, WWW has become many folds in size, and now it contains more than 50 billion publicly accessible web documents distributed all over the world on thousands of web servers and still growing at exponential rate. It is very difficult to search information from such a huge collection of WWW as the web pages or documents are not organized as books on shelves in a library, nor are web pages completely catalogued at one central location. Search engine is basic information retrieval tool, used to access information from WWW. In response to the search query provided by users, Search engines use their database to search the relevant documents and produce the result after ranking on the basis of relevance. In fact, the Search engine builds its database, with the help of WebCrawlers. To maximize the download rate and to retrieve the whole or significant portion of the Web, search engines run multiple crawlers in parallel. Overlapping of downloaded web documents, quality, network bandwidth and refreshing of web documents are the major challenging problems faced by existing parallel WebCrawlers that are addressed in this work. A Multi Threaded (MT) server based novel architecture for incremental parallel web crawler has been designed that helps to reduce overlapping, quality and network bandwidth problems. Additionally, web page change detection methods have been developed to refresh the web document by detecting the structural, presentation and content level changes in web documents. These change detection methods help to detect whether version of a web page, existing at Search engine side has got changed from the one existing at Web server end or not. If it has got changed, the WebCrawler should replace the existing version at Search engine database side to keep its repository up-to-date Keywords: World Wide Web (WWW), Uniform Resource Locator (URLs), Search engine, WebCrawler, Checksum, Change detection, Ranking algorithms. 1. pages. Query processor processes user queries and returns matching answers in an order determined by ranking algorithms. Web Crawlers are responsible to maintain indexed database for search engines which are used by the users indirectly. Every time one searches the internet using a service such as Google, Alta Vista, Excite, Lycos etc, he is making use of an index that is based on the output of WebCrawlers. WebCrawlers, also known as spiders, robots, or wanderers are software programs that automatically INTRODUCTION About 20% of the world’s population use Web [28] which is increasing exponentially day by day and a large majority thereof uses web search engines to find information. Search engines consist of three major components: indexer, query processor and crawlers. Indexer processes pages, decides which of them to index and builds various data structures (inverted index, web graph etc) representing the 8 Journal of Theoretical and Applied Information Technology 15 September 2012. Vol. 43 No.1 © 2005 - 2012 JATIT & LLS. All rights reserved. ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195 Section 3 discusses about the proposed design architecture for parallel WebCrawler. Section 4 discusses about the proposed change detection algorithms and finally section 5 concludes the work along with some future directions followed by references. traverse the web [4]. Search engines use it to find what is on the web. It starts by parsing a specified web page and noting any hypertext links on that page that point to other web pages. They recursively parse those pages for new links. The size of WWW is enormous and there are no known methods available to find its exact size, but it may be approximated by observing the indexes maintained by key search engines. In 1994, one of the first web search engine, the World Wide Web Worm (WWWW) had an index of 110,000 web pages and web accessible documents [1]. Google, the most popular search engine had 26 million web pages indexed in 1998 that by 2000 reached one billion mark and now it indexes around 50 billion web pages [2]. Similar is the case with other search engines too. So major difficulty for any search engine is to create the repository of high quality web pages and keep it up-to-date. It is not possible in time to create such a huge database by a single WebCrawler because it may take moths or even more time to create it and in mean time large fractions of web pages would have been changed and thus not so useful for end users. To minimize down load time search engines execute multiple crawlers simultaneously known as parallel WebCrawlers [4, 8, 26]. 2. RELATED WORKS Though most of the popular search engines nowadays use parallel/distributed crawling schemes but not many literatures are available on it because they generally do not disclose the internal working of their systems. Junghoo Cho and Hector Garcia-Molina [3] proposed architecture for parallel crawler and discussed fundamental issues related with it. It concentrated mainly on three issues of parallel crawlers namely overlap, quality and communication bandwidth. In [9] P Boldi et all have described about UbiCrawler: a scalable fully distributed web crawler. The main features mentioned are: platform independence, linear scalability, graceful degradation in the presence of faults, an effective assignment function based on consistent hashing for partitioning the domain to crawl and complete decentralization of every task. In [10] authors have discussed about design and implementation of distributed web crawler. In [5] Junghoo Cho et all have discussed about order in which a crawler should visit the URLs in order to obtain more important pages first. This paper defined several important metrics, ordering schemes and performance evaluation measures for this problem. Its results show that a crawler with a good ordering scheme can obtain important pages significantly faster than one without it. The other major problem associated with any web crawler is refreshing policies used to keep indexes of web pages up to date with its copy maintained at owner’s end. Two policies are used for this purpose. First policy is based on fixed frequency and the second on variable frequency [6]. In fixed frequency scheme all web pages are revisited after fixed interval (say once in 15 days) irrespective of how often they change, updating all pages in the collection of WebCrawler whereas in variable frequency scheme, revisit policy is based on how frequently web documents change. The more frequently changing documents are revisited in shorter span of time where as less frequently changing web pages have longer revisit frequency. Another major problem associated with web crawler is dynamic nature of web documents. No methods till date are known that may provide the actual change frequencies of web documents because changes in web pages follow Poisson process [7] but few researchers [6, 11-14] have discussed about its dynamic nature and change frequency. According to [27] it takes approximately 6 months for a new page to be indexed by popular search engine. Junghoo Cho and H G Molina [6] have performed some experiments to find the change frequencies among web pages mainly from .com, .netorg, .edu and .gov domains. After observing the web pages from these four domains for 4 continuous months, it was found that web pages from .com domain change at highest frequency and .edu and .gov pages are static in The aim of this paper is to propose a design for parallel WebCrawler and change detection techniques for refreshing web documents in variable frequency scheme. The crawling scheme, discussed in this paper for parallel WebCrawler, is such that the important URLs/documents are crawled first and thus the fraction of web that is visited is more meaningful and up-to date. This paper is divided as follows: Section 2 discusses related work on parallel WebCrawlers and page refreshment techniques. 9 Journal of Theoretical and Applied Information Technology 15 September 2012. Vol. 43 No.1 © 2005 - 2012 JATIT & LLS. All rights reserved. ISSN: 1992-8645 www.jatit.org nature. In [14] the authors have discussed analytically as well as experimentally about the effective page refresh policies for web crawlers. According to it the freshness of a local database S with N elements at time t is given as F(S: t) = M/N where M (<N) are up-to-date elements at time t. This paper also proposed a metric, age to calculate how old a database is. E-ISSN: 1817-3195 information appearing on the web by retrieving the changes will require a small amount of data processing as compared to the huge size of the web. ARCHITECTURE OF 3. PROPOSED PARALLEL CRAWLER AND CHANGE DETECTION METHODS The proposed crawler (see Figure 1) has a client server based architecture consisting of following main components: Papers [12, 15-23, 25, 29] discuss about the methods for change detection in HTML and XML documents. In [12] researchers of AT & T and Lucent technologies, Bell laboratories have discussed details about the internet difference search engine (AIDE) that finds and displays changes to pages on World Wide Web. In [15] Ling Liu et all have discussed about WebCQ: a prototype system for large scale web information monitoring and delivery. It consists of four main components: a change detection robot that discovers and detects changes, a proxy cache service that reduces communication traffic to the original information servers, a personalized presentation tool that highlights changes detected by WebCQ sentinels and a change notification service that delivers the fresh information to the right users at right times. Ntoulas [16] collected a historical database for the web by downloading 154 popular web sites (e.g., acm.org, hp.com and oreilly.com) every week from October 2002 until October 2003, for a total of 51 weeks. The experiments show that significant fractions (around 50%) of web pages remain completely unchanged during the entire period of observation. To measure the degree of changes, it computed the shingles of each document and measured the difference of shingles between different versions of web documents. Fretterly [23] performed a large crawl that downloaded about 151 million HTML pages. Then it was attempted to fetch each of these 151 million HTML pages ten more times over a span of ten weeks during Dec. 2002 to Mar. 2003. For each version of documents, checksum and shingles were computed to measure the degree of change. The degree of change was categorized into 6 groups: complete change, large change, medium change, small change, no text change, and no change. Experiments show that about 76% of all pages fall into the groups of no text change and no change. The percentage for the group of small change is around 16% while the percentage for groups of complete change and large change is only 3%. The above results are very supportive to study the change behaviors of web documents. The paper suggests that incremental method may be very effective in updating web indexes, and that searching for new • • • Multi Threaded server Client crawlers Change detection module The Multi Threaded (MT) server is the main coordinating component of the architecture. On its own, it does not download any web document but manages a connection pool with client machines which actually download the web documents. The client crawlers collectively refer to all the different instances of client machines interacting with each other through server. The number of clients may vary depending on the availability of resources and the scale of actual implementation. Change detection module helps to identify whether the target page has changed or not and consequently only the changed documents are stored in the repository in search/insert fashion. Thus the repository is kept up-to-date having fresh and latest information available at Search engine database end. 3.1 Multi Threaded (MT) Server The Multi Threaded server is the main coordinating component of the proposed architecture (see Figure 2), directly involved in interaction with client processes ensuring that there is no need for direct communication among them. The sub-components of MT server are: • URL dispatcher • Ranking module • URL distributor • URL allocator • Indexer • Repository 3.1.1 Url Dispatcher The URL dispatcher is initiated by seed URLs. In the beginning, both unsorted as well as sorted URL queues [see Figure 2] are empty and the seed URL received from user is stored in sorted URLs queue and a message called distribute_URLs is sent to URL 10 Journal of Theoretical and Applied Information Technology 15 September 2012. Vol. 43 No.1 © 2005 - 2012 JATIT & LLS. All rights reserved. ISSN: 1992-8645 www.jatit.org distributor. URL distributor selects the seed URL from the sorted queue and puts it in first priority list (P_list 1) as shown in Figure 2. URL allocator picks the URL from priority list and assigns it to client crawler to download web document. After downloading the document, the client crawler parses it to extract the embedded URLs within it and stores the web document and corresponding URLs in document and URL buffer. Every URL, retrieved by the URL dispatcher from document and URL buffer, is verified from repository before putting it in the unsorted queue, to know whether it has already been downloaded or not. If it is found that the URL is already downloaded and the corresponding document exists in the repository then the retrieved URL is discarded and not stored in unsorted queue. By doing this, URL dispatcher ensures that the same URL is not used multiple times to download its corresponding document from WWW and thus it helps to save network bandwidth. URL dispatcher keeps all new URLs in the queue of unsorted URLs in the order they are retrieved from the buffer and sends rank_URL message to ranking module. This process is repeated till the queue of sorted URLs becomes empty. 3.1.2 E-ISSN: 1817-3195 are no or very few URLs pointing to current URLs, as the database is still in a nascent stage but as the database of indexed web pages grows, the weightage of back link count gradually increases. This results in more number of URLs having lower Pvalue and thus more URLs holding lower priority. This is important since it is not desired to assign high priority to a page which is already downloaded and update it very frequently as indicated by the high BkLk value but we do want to prioritize addition of new pages to our repository of indexed pages, which has a nil BkLk, but high FwdLk as it leads to a higher number of pages. A similar ranking method is discussed in [5] in which only back link counts are considered. After calculating priority, the ranking module sorts the list in descending order of priority, stores them in the sorted queue and sends signal to URLs distributor. This method is particularly useful as it also gives weightage to current database and builds a quality database of indexed web pages even when the focus is to crawl whole of the Web. It works efficiently in both, when the database is growing or in maturity stage. Also, the method works well for broken links i.e. the URLs having zero value for forward link. Even if it is referred from pages in the database, priority will always be negative resulting in low priority value. Ranking Module After receiving Rank_URLs message from the dispatcher, the ranking module retrieves URLs from the unsorted queue and computes their priority. For computing priority of the URLs to be crawled, it considers both their forward as well as back link counts where forward link count is the number of URLs present in the web page, pointing to other web pages of WWW and back link count is the number of URLs from search engine’s local repository pointing to this URL. The forward link count is computed at the time of parsing of the web pages. However, to estimate the number of back link count, the server refers to its existing database and checks as to how many pages of current database refer to this page. Once the forward and back link counts are known, the priority (Pvalue) is computed using the following developed formula. The advantage of above ranking method over others is that it does not require the image of the entire Web to know the relevance of an URL as “forward link counts” are directly computed from the downloaded web pages where as “back link counts” are obtained from in-built repository. 3.1.3 Url Distributor URLs distributor retrieves URLs from the sorted queue maintained by the ranking module. Sometimes, a situation may arise wherein highly referenced URL with a low forward link counts, may always find itself at the bottom in the sorted queue of URLs. To get rid of such situations, the URLs list is divided into three almost equal parts on the basis of priority in such a way that top one-third of the URLs are sent to p_list1, middle one-third are sent to p_list2 and bottom one-third are sent to p_list3 from where the URL allocator assign them to the respective client crawlers, as shown in Figure 2. Pvalue = FwdLk – BkLk ------------------ (1) Where, Pvalue = priority value FwdLk = forward link counts BkLk = back link counts For example, if the queue of unsorted URLs contains URLs list as shown in Table 1 then after calculating the ranking of URLs, same are divided in three lists as shown in Table 2. From the Table 2 one URLs having higher difference between FwdLk and BkLk are assigned higher Pvalue. Initially, the forward link count holds higher weightage as there 11 Journal of Theoretical and Applied Information Technology 15 September 2012. Vol. 43 No.1 © 2005 - 2012 JATIT & LLS. All rights reserved. ISSN: 1992-8645 www.jatit.org Table 2: URLs after sorting on basis of Pvalue and assigned in respective lists can see that the number of back link count is zero for all URLs as no links from the local repository of search engines are pointing to them, being the repository in initial state and is almost empty. URL http://www.yahoo.com http://www.rediffmail.c om http://www.gmail.com http://www.ugc.ac.in/ http://www.ugc.ac.in/po licy/modelcurr.html http://www.ugc.ac.in/po licy/payorder.html http://www.ugc.ac.in/or gn/directory.html http://www.google.co.in / http://www.ugc.ac.in/or gn/regional_offices.html http://www.jiit.ac.in http://www.jiit.ac.in/jiit/ files/RTI.htm http://www.ugc.ac.in/in side/fakealerts.html http://www.ugc.ac.in/in side/uni.html http://www.ugc.ac.in/po licy/fac_dev.html http://www.jiit.ac.in/jiit/ files/department.htm http://www.ugc.ac.in/co ntact/index.html http://www.jiit.ac.in/jiit/ files/PROJ_SPONS.htm http://www.jiit.ac.in/jiit/ files/curri_obj.htm Table 1: Unsorted URLs list received from client crawlers URL http://www.jiit.ac.in http://www.jiit.ac.in/jiit/fi les/RTI.htm http://www.jiit.ac.in/jiit/fi les/department.htm http://www.jiit.ac.in/jiit/fi les/PROJ_SPONS.htm http://www.jiit.ac.in/jiit/fi les/curri_obj.htm http://www.google.co.in/ http://www.rediffmail.co m http://www.gmail.com http://www.yahoo.com http://www.ugc.ac.in/ http://www.ugc.ac.in/org n/regional_offices.html http://www.ugc.ac.in/poli cy/fac_dev.html http://www.ugc.ac.in/poli cy/payorder.html http://www.ugc.ac.in/poli cy/modelcurr.html http://www.ugc.ac.in/insi de/uni.html http://www.ugc.ac.in/con tact/index.html http://www.ugc.ac.in/insi de/fakealerts.html http://www.ugc.ac.in/org n/directory.html Forward Link count Back link count 20 18 0 0 Priority value (Pvalue ) 20 18 11 0 11 4 0 4 3 0 3 40 79 0 0 40 79 65 155 60 21 0 0 0 0 65 155 60 21 13 0 13 52 0 52 60 0 60 15 0 15 5 0 5 16 0 16 53 0 53 E-ISSN: 1817-3195 Forwar d Link count 155 79 Back link count 0 0 Pval ue P_l ist 155 79 1 1 65 60 60 0 0 0 65 60 60 1 1 1 52 0 52 1 53 0 53 2 40 0 40 2 21 0 21 2 20 18 0 0 20 18 2 2 16 0 16 2 15 0 15 3 13 0 13 3 11 0 11 3 5 0 5 3 4 0 4 3 3 0 3 3 3.1.4 Url Allocator The basic function of URL allocator is to select appropriate URLs from sub-lists (P_list1, P_list 2, P_list3) and assign them to client crawlers for downloading the respective web documents. According to the ranking algorithm discussed above, P_liast1 contains higher relevant URLs but we cannot completely ignore the P_list2 & P_list3. P_list 2 contains the URLs which either belongs to a web document having higher forward and backward link counts or both forward link count and back link count are less for these URLs. Similarly P_list3 consists of mostly those URLs which have higher back link count value and less forward link count. Therefore, to avoid complete ignorance for URLs present in these both lists (P_list1 & P_list2), the strategy used is such that for every 4 URLs selected from p-list1, 2 URLs from p-list2 and 1 URL from p-list3 are assigned to client crawlers for downloading. 12 Journal of Theoretical and Applied Information Technology 15 September 2012. Vol. 43 No.1 © 2005 - 2012 JATIT & LLS. All rights reserved. ISSN: 1992-8645 www.jatit.org Receives URL Dispatcher 3.1.6 Repository Indexer It is centrally indexed database of web pages, maintained by server which later may be used by search engines to answer the queries of end users. The incremental parallel crawler maintains index of complete web documents in the repository along with other information. The efficiency by which the repository is managed affects the search results. As users need most relevant results for a given search query the repository and corresponding information maintained for these documents play a major role apart from ranking algorithm to produce the result. Rank_URLs Add URLs in Receiving Order Queue of Unsorted urls E-ISSN: 1817-3195 Repositor y Retrieves URLs Get the required Ranking module data Distribute_URLs Put URLs after sorting based on priority 3.2 Client Crawlers Queue of sorted URLs Client crawlers collectively refer to all the different client machines interacting with the server. As mentioned earlier, there are no inter client crawler communications, all types of communications are between the MT server and an individual client crawler. In Figure 1 these client crawlers are represented as C-crawler 1, C-crawler 2 and so on. The detailed architecture of a client crawler is shown in Figure 3. Select URLs URL Distributor P_list-3 P_list1 P_list-2 Divide URLs in sub lists These client crawlers are involved in actual downloading of the web documents. They depend on MT server (URL allocator) for receiving URLs for which they are supposed to download the web pages. After downloading the web documents for the assigned URLs, a client crawler parses and extracts all URLs present in it and puts them back to document and URL buffer from where they are extracted by change detection module as well as URL dispatcher. After performing preprocessing on the received URLs, they are assigned to client crawlers for further downloading. This whole process continues till no more URLs to be crawled are left. URL Allocator Assign URLs to client crawlers Figure 2: Architecture of Multi threaded server 3.1.5 Indexer Indexer module retrieves web documents and corresponding URLs from document and URL buffer. The documents are stored in repository in search/insert fashion and corresponding index is created for them. The indexes created for the documents mainly consist of keywords present in the document, address of document in repository, corresponding URL through which the web document was downloaded, checksum values and so many other information. Later, all these indexed information for web documents help to produce appropriate result when users fire search query using search engine as well as to detect whether the current downloaded version is same or changed from its previous version existing in the repository. Following are the main components of a client crawler: • URL buffer • Crawl Worker 3.2.1 Url Buffer It stores URLs temporarily which are assigned to the client crawler by URL allocator for downloading the corresponding web documents. This is implemented as simple FIFO (first in first out) queue. The crawl worker retrieves URLs from this queue and starts its downloading process. 13 Journal of Theoretical and Applied Information Technology 15 September 2012. Vol. 43 No.1 © 2005 - 2012 JATIT & LLS. All rights reserved. ISSN: 1992-8645 www.jatit.org After downloading the web documents for assigned URLs, the crawl worker parses its contents with the help of parser for the following further applications: • Each page has a certain number of links present in it. To maintain the index of back link count, each link on that page is stored in repository along with its source/parent URL in which it appears. The client crawler sends the pair of values (link, parent_URL) to the repository/database. When the same link reappears on some other page, only the name of parent URL is required to be added to the link indexer to the already existing value of the link. • The crawl worker also extracts all the links present on a web page to compute the number of forward links counts which it sends to the URL dispatcher which are later used by ranking module to compute relevance of uncrawled URLs based on which they are further redistributed among client crawlers for further downloading as discussed in section 3.1.2 to 3.1.4. • Another motive behind parsing the contents of web pages is to compute the page updating parameters such as checksum etc. that are used by change detection module to check whether the web page has been changed or not. 3.2.2 Crawl Worker Crawl worker is major component of the client crawler in proposed architecture of incremental parallel web crawler. Working of the whole process of the crawl worker may be represented by the following algorithm: Crawl_worker () { Start Repeat Pickup a URL from the URL buffer; Determine the IP address for the host name; Download the Robot.txt file which carries downloading permissions and also specifies the files to be excluded by the crawler; Determine the protocol of underlying host like http, ftp, gopher etc; Based on the protocol of the host, download the document; Identify the document format like .doc, .html, or .pdf etc; Parse the downloaded document and extract the links; Convert the URL links into their absolute URL address; Add the URLs and downloaded document in “document and URL buffer”; Until not empty (URL buffer); End } After parsing, all downloaded web documents are stored in document and URL buffer associated with MT server. After keeping the documents in the document and URL buffer, crawl worker sends extract URLs & web page message to change detection and extract URLs message to MT server (URL dispatcher). After receiving the message, the change detection module and URL dispatcher, extract the web documents along with URLs and other relevant information from the buffer. Extract URLs Fetch URLs 3.3 Change Detection Module Crawl worker Download web Client Crawler The client crawler machines are robust enough to be able to handle all types of web pages appearing on the web and to handle the pages, which are not allowed to be crawled [4]. It also automatically discards URLs that are referenced but do not exist any more on WWW. URL buffer pages Document and URL Buffer Extract URLs and Web Send web pages Assign URLs Retrieve URLs Multi Threaded Server Change Detection Module The change detection module identifies whether two versions of a web document are same or not and thus helps to decide whether the existing web page in the repository should be replaced with changed one or not. Methods for detecting structural and content level changes in web documents have been proposed. There may be other types of changes also like Sends Request to Web Server WWW E-ISSN: 1817-3195 Messag eData flow Figure 3: Architecture of client Crawler 14 Journal of Theoretical and Applied Information Technology 15 September 2012. Vol. 43 No.1 © 2005 - 2012 JATIT & LLS. All rights reserved. ISSN: 1992-8645 www.jatit.org behavioral, presentation etc. It may not always be required to test both types of changes at the same time. The latter one i.e. content level change detection may be performed only when there are no changes detected by structural change detection methods or the changes detected are minor. The details about proposed methods for change detection are being discussed in the section 4. 3.4 Comparison of Proposed Architecture with Existing Architecture of Parallel WebCrawler Central Coordi nator No central server. It works in distributed environment. It has central coordinator in the form of MT server as it works on client server principle. Overla pping Have overlapping problem associated as individual crawler do not have global image of downloaded web documents. As MT server assigns the URLs to be crawled to client crawlers and it has global image of crawled/uncrawled URLs and thus, significantly reduces the overlapping. Quality of Web pages Being distributed coordination, it may not be aware about others collection, so it decreases the quality of web pages downloaded. URLs are provided by server to client crawlers on the basis of priority so high priorities pages are downloaded and thus high quality documents are available at Search engine side. Priority calculat ion The crawlers compute priority of pages to be downloaded based on local repository which may be different if computed on global In our approach all web pages after downloading are sent back to server along with embedded URLs, the priority is computed based on global information. Scalabi lity Architecture is scalable as per requirement and resources It may also be scaled up depending upon resource constraints Networ k bandwi dth Due to problem of overlapping more networks bandwidth is consumed in this architecture. Due to reduction in overlapping problem, network bandwidth is also reduced at the cost of communication between MT server and client crawlers. In this section, novel mechanisms are being proposed that determine whether a web document has been changed from its previous version or not and by what amount. The hallmarks of the proposed mechanisms are that changes may also be known at micro level too i.e. paragraph level. Following 4 types of changes may take place in a web documents [24]: 1. Structural Changes 2. Content level or semantic changes 3. Presentation or cosmetic changes 4. Behavioral changes Table 3: comparison with existing parallel crawler architecture Existing Parallel Crawler structure of web pages 4. CHANGE DETECTION MODULE The architecture has been implemented as well as tested for about 2.5 million web pages from various domains. The results obtained thereof establish the fact that the proposed incremental parallel web crawler does not have problem of overlapping and also downloaded pages are of high quality, thereby, proving the efficiency of ranking method developed. The performance of proposed incremental parallel web crawler is compared with that of existing parallel crawlers [3]. The summary of the comparison is as shown in Table 3. The proposed web crawler’s performance was found to be comparable with [3]. Featu res E-ISSN: 1817-3195 Sometimes the structure of web page is changed by addition/deletion of tags. Similarly, addition/deletion/modification in the link structure can also change the overall structure of the document (see Figure 4). Incremental parallel web crawlers Content level or semantic changes refer to the situation where the page’s contents are changed from reader’s point of view (see Figure 5). <html><body> -------------------------- -- <big>---------------------------------------</big> -----------------------------------------<ul><li>--------------------------</li> <li> -------------------------</li></ul> -------------------------------- </body></html> (a) Initial version <html> <body> ------------ <big>---------------------------------------</big> --------------------------------------------------------------------------- (b) Changed version Figure 4: Structural changes in versions of a web page 15 Journal of Theoretical and Applied Information Technology 15 September 2012. Vol. 43 No.1 © 2005 - 2012 JATIT & LLS. All rights reserved. ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195 engines up-to-date it is mandatory to detect the changes that take place in web documents. In the following sections, mechanisms are being proposed that currently identify content and structural changes. The structural change detection methods are also efficient to detect the presentation level changes as these changes occur due to insertion/modification/deletion of tags in the web documents. <html> <head> <title> this is page title </title> </head> <body> <center>Times of India News 19-Oct-2009 </center> <ul><li>Delhi, Hurriyat talks can't succeed without us: Pak <li>China wants better Indo-Pak ties, denies interference <li>Ties with China not at India's expense: US <li>Sachin slams 43rd Test ton; completes 30,000 runs</ul> (a) <html> <head> <title> this is page title </title> </head> <body> <center>Times of India News 19-Oct-2009 </center> <ul> <li>India-Lanka 1st Test ends in a draw <li>Maya asked to increase Rahul's security <li>Ties with China not at India's expense: US <li>Sachin slams 43rd Test ton; completes 30,000 runs </ul> 4.1 Proposed Methods for Change Detection This section is divided in two parts: first part discusses the mechanisms that detect structural changes whereas second part discusses the content level change detection mechanisms. It may be noted that content level change detection may be performed only when either there are no changes at structural level or the changes at structural level are minute, thereby saving computational time. (b) Figure 5: Content/semantic changes (a) Initial version (b) Changed version of a web page Under the category of presentation or cosmetic changes only the document’s appearance is modified whereas the contents within document remain intact. For example, by changing HTML tags, the appearance of document can change without altering its contents (see Figure 6). 4.1.1 Methods for Detecting Structural Changes Following two separate methods have been designed for the detection of structural changes occurring within a document: • Document tree based structural change detection method, and • Document fingerprint based structural change detection method. <html> <body> India-Lanka 1st Test ends in a draw Maya asked to increase Rahul's security Ties with China not at India's expense: US </body> </html> 4.1.1.1 Document Tree Based Structural Change Detection Method (a) This method works in two steps. In the first step document tree is generated for the downloaded web page while in the second step, level by level comparison between the trees is performed. The downloaded page is stored in the repository in search/insert fashion as given below: Step 1: Search the downloaded document in the repository. 2. If the document is found then compare both the versions of the document for structural changes using level by level comparison of their respective document tree. 3. Else store the document in the repository. <html><body> India-Lanka 1st Test ends in a draw Maya asked to increase Rahul's security Ties with China not at India's expense: US (b) Figure 6: Presentation/cosmetic changes (a) Initial Version (b) Changed version Behavioral changes refer to modifications to the active components present in a document. For example, web pages may contain scripts, applets etc as active components. When such hidden components change, the behavior of the document gets changed. However, it is difficult to catch such changes especially when the codes of these active components are hidden in other files. Generation of the tree is based on the nested structures of tags present in the web page. After parsing the downloaded web page, parser extracts all the tags present in it. Then tags are arranged in the order of their hierarchical relationship and finally document tree is generated with the help of tree generator. All So, in order to keep the repository of search 16 Journal of Theoretical and Applied Information Technology 15 September 2012. Vol. 43 No.1 © 2005 - 2012 JATIT & LLS. All rights reserved. ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195 the tags which are nested at the same level in the web page should be at the same level in document tree too. Each node in the document tree, representing a tag, consist many fields which keep various information about the tag. The structure of a node of the document tree is as given below: Tag_name Child Level_no No_of_siblings Where: • Tag_name: This field of node structure stores the name of tags. • Child: This field contains information about the children of each node. • Level_no: This field contains the level number at which the nodes appear in the constructed document tree. • No_of_siblings: This field in a node structure contains total number of nodes present at that level. For example, consider the document tree given in Figure 8, generated for the initial version of a web page as shown in Figure 7 whose tag structure is shown in Figure 11 (a). Let’s assume that later on some structural changes occurred in the web page as shown in Figure 9 whose tag structure is shown in Figure 11 (b) and the resultant document tree for the changed web page is as shown in Figure 10. A level by level comparison of the two trees is carried out and the result obtained thereof has been tabulated in Table 4. The Table contains a listing of number of nodes/siblings at different levels. From Table 4, it can be concluded that number of nodes have changed at level 3, 4 and 5 indicating that the document has changed. Figure 7: Initial version of web page Figure 9: Changed version of web page Figure 8: Document tree for initial version of web page 17 Journal of Theoretical and Applied Information Technology 15 September 2012. Vol. 43 No.1 © 2005 - 2012 JATIT & LLS. All rights reserved. ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195 method for structural change detection many inferences as tabulated in Table 6 are drawn. Table 5: Node wise attribute details of the initial and the modified tree using level order traversing Attributes Level_no Tag_name Figure 10: Document tree for changed version of web page Child <Html> <Head><Title>-----------------</Title> </Head> <Body> <Center><h1>-------------------</h1> </Center> <Center>-------------------------- </Center> <Center>-----------------------</Center> <Center> <h3>----------------------</h3> </Center> -------------------------------------- --------------------------------- -----------------------------------------<Center>-----------------------------------</Center > <h2>----------------------------------</h2> </Body> </Html> 1, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5 Html, Head, Body, Title, Center, Center, Center, Center, P, H2, H1, B, Br, H3, Br, Br, Center, B 2, 1, 6, null, 1, 2, null, 1, 3, null, null, null, null, null, null, null,1, null 1, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4 Change detected No sibling added/deleted at any level Leaf siblings added/deleted Siblings having children added/deleted Html, Head, Body, Title, Center, Center, Center, Center, Center, P, H2, H1, B, Br, B, H3 2, 1, 7, null, 1, 2, null, 1, 1, null, null, null, null, null, null, null Inference drawn No structural change between the two version of the document Minor changes between the two version of the document Major change has occurred* *This detection helps in locating the area of major change within documents <Html> <Head><Title>------------------</title> </head> <Body> <Center><h1>------------</h1> </Center> <Center>----------------------- </Center> <Center>---------------------------- ---------------</Cente r> <Center>--------------------------------------</Cente r> <Center><h3>--------------</h3></Center> The document tree based structural change detection method is efficient and guarantees to detect the structural changes perfectly if the constructed tree represents the true hierarchical relationship among tags. 4.1.1.2 (b) Figure 11: HTML tag structure of (a) Initial version (b) Changed version of web pages Document Fingerprint Method Structural Change Detection for This method generates two separate fingerprints in the form of strings for each web document based on its structure. To generate the fingerprint, all opening tags present in the web page are arranged in the order of their appearance in the web document whereas all closing tags are discarded. The first fingerprint generated, contains the set of characters appearing at first position in the tag in the order they appear in the web page whereas the second fingerprint contains the characters appearing at last position in the tag for all tags in the order they appear in the web page. For single character tags, same characters are repeated in both the fingerprints. Table 4: Level structure using BFS for above initial and modified tree No. of siblings in initial web page 1 2 7 7 1 For changed version Table 6: Inferences on structural changes (a) Level_n o 1 2 3 4 5 For initial version No. of siblings in changed web page 1 2 8 5 0 For further details about the changes, such as how many tags have been added/deleted and at what level in the document tree etc, the remaining fields of the node structure may be used. The details generated for the document trees (Figure 8 & Figure 10) are as shown in Table 5. After analyzing the document tree For example, if the initial version of a web page is as shown in Figure 12 and later it gets changed as shown in Figure 13, the tag structures of the two 18 Journal of Theoretical and Applied Information Technology 15 September 2012. Vol. 43 No.1 © 2005 - 2012 JATIT & LLS. All rights reserved. ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195 versions are as shown in Figure 14 and Figure 15 respectively. Applying the above scheme, Table 7 is generated. Comparing the fingerprint1 of the both versions it may be concluded that the web page gets changed as its fingerprint gets changed. The comparison of fingerprint2 of both web pages, are required to add surety to the method as the comparison for fingerprint1 may fail in the unlikely case of tags starting with some character being replaced with another tags starting with the same character. Figure 14: Tag structure for initial version Figure 15: Tag structure for changed version Figure 12: Initial version of web page Table 7: Fingerprints for versions of web page using document fingerprint method Versio n Initial Change Fingerprint1 Fingerprint2 hhtbtttpittttttipbfpfbfb bbbfbuulpfbfaflpfafp hhtbtttpittttttipbfuulpf bfaflpfbfaflpfbfaflpfbf af ldeyerdpgrddrddgpbtptrt rrrrtrlliptbtatiptatp ldeyerdpgrddrddgpbtllip tbfatiptbtatipfbtatiptbtat 4.1.1.3 Comparison between Document Tree Based and Fingerprint Based Structural Change Detection Methods Though both methods discussed for detecting structural changes perform efficiently to detect the changes even at minor level but based on certain parameters such as time/space complexity etc, their performance is compared as given in Table 8. Figure 13: Changed version of web page 19 Journal of Theoretical and Applied Information Technology 15 September 2012. Vol. 43 No.1 © 2005 - 2012 JATIT & LLS. All rights reserved. ISSN: 1992-8645 www.jatit.org Table 8: Comparison between document and fingerprint based methods S. No. Document tree based method Document Fingerprint based method 1 This method performs well and is able to detect structural changes even at the minute level. This method is also able to detect changes even at the minute level. Special attention is required to handle the presence of optional tags in the web document as it becomes difficult to maintain the nested structure among tags. No effects of optional tags as only opening tags are considered whereas all closing tags are discarded. The fingerprint generated in the form of string contains information about all opening tags in their order of appearance in the document. The presence of misaligned tag structures in the web document may cause problem while establishing the true hierarchical relationship among tag structures. This method also suffers from the presence of misaligned tag structure as it may produce different fingerprint for the same structure. Time as well as space complexity of document tree based method is higher. To create and compare two trees it consumes more time as compared to creation and comparison of fingerprints. As each node of the tree contains many fields for keeping various information about the tags so it requires more storage space also for document tree. This method is better in terms of both, time as well as space complexity as compared to document tree based method. Fingerprints are in the form of character strings which require less storage space. This method helps in locating the area of minor/major changes within a document. With fingerprint based method, it is difficult to locate the area of major/minor change within a document. 2 3 4 5 E-ISSN: 1817-3195 structural level changes. Following two different methods are being proposed for identifying content level changes: • Root Mean Square (RMS) based content level change detection method, and • Checksum based content level change detection method. 4.1.2.1 Root Mean Square Based Content Level Change Detection Method This method computes Root Mean Square (RMS) value as checksum for the entire web page contents as well as for its various paragraphs appearing in the page. While updating the page in the repository, comparison between the checksums of both versions of web pages is performed and in case of an anomaly, it is concluded that the web page at web server end has been modified as compared to the local copy maintained in the repository at search engine end. Thus the document needs to be updated. Paragraph checksums help to detect changes at micro level and help in identifying the locality of changes within the document. The following formula has been developed to calculate the RMS value. (2) Where a1, a2, --------------, an are the ASCII code of symbols and n is the number of distinct symbols/characters excluding the white space present in the web page. Consider the two versions of web page from Times of India news site (http://timesofindia.indiatimes.com/), taken within three hours of interval, , dated on 28-10-11 as shown in Figure 16 and 17 respectively. Their page checksum (RMS value) using the above method was computed as shown in Table 9. From the table, it can be concluded that the content of both versions of the web page had been changed. The web page, in continuation, was monitored many times from 28-10-11 to 29-10-11 and the data is recorded in Table 10. Due to space constraint it is not possible to include all the monitored web pages. 4.1.2 Methods for Detecting Content Level Changes The content level change detection mechanism may be carried out only after the methods as discussed in 4.1.1 for structural changes do not detect any changes. The content level changes can not be captured by the methods discussed for 20 Journal of Theoretical and Applied Information Technology 15 September 2012. Vol. 43 No.1 © 2005 - 2012 JATIT & LLS. All rights reserved. ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195 Table 9: Checksum values for two version of web page Web Page RMS value of entire page Initial version 161.176 Changed version 131.122 Distinct symbol counts Paragraphs Distinct symbol counts Paragraph RMS P1 33 355.779 P2 29 346.451 P3 27 325.086 P4 32 261.167 P5 26 154.898 P6 24 168.383 P1 33 356.838 P2 29 317.343 P3 25 291.005 P4 27 168.39 P5 25 177.651 43 41 Figure 17: changed version of the web page Table 10: Checksum Table for different versions of web page V. No 1 2 Figure 16: Initial version of the web page 3 21 RMS of page 161.17 6 131.12 2 126.05 3 Distinct symbol counts Paragraphs Distinct symbol counts Paragraph RMS P1 33 355.779 P2 29 346.451 P3 27 325.086 P4 32 261.167 P5 26 154.898 43 41 41 P6 24 168.383 P1 33 356.838 P2 29 317.343 P3 25 291.005 P4 27 168.39 P5 25 177.651 P1 32 381.304 P2 31 288.686 P3 31 266.965 P4 27 160.276 P5 24 163.306 Journal of Theoretical and Applied Information Technology 15 September 2012. Vol. 43 No.1 © 2005 - 2012 JATIT & LLS. All rights reserved. ISSN: 1992-8645 4 5 6 7 129.49 51 121.03 49 116.56 116.21 www.jatit.org 40 35 36 36 P1 30 364.99 P2 27 341.317 P3 28 260.205 P4 27 166.285 P5 24 162.152 P1 30 364.55 P2 27 264.664 P3 24 272.749 P4 26 168.091 P5 26 140.293 P1 30 346.89 P2 29 284.38 P3 26 242.87 P4 28 142.487 P5 25 148.874 P1 29 337.37 P2 29 284.38 P3 26 242.97 P4 28 142.487 P5 25 148.87 E-ISSN: 1817-3195 4.1.2.2 Checksum Based Content Level Change Detection Method Similar to the previous method, checksum based content level change detection also produces a single checksum for entire web page as well as for each paragraph present in the web document. Paragraphs level checksum help to know the changes at micro level i.e. paragraph level. By comparing the checksums of different versions of a web page, it is known whether the web page content has been changed or not. The following formula has been developed to calculate the checksum. Checksum = ∑ I_ parameter * ASCII code * K-factor ----- (3) Where: I_parameter = importance parameter, ASCII code = ASCII code of each symbol (excluding white space) K-factor = scaling factor In the implementation, it was font size which was considered I_parameter. The purpose of introducing scaling factor is to control the size of checksum value. In this work scaling factor was considered as .0001. Based on the data recorded in Table 10, following inferences are drawn about the RMS method for content level change detection: The above method was tested on number of web documents online as well as offline. One such example is shown as in Figure 18 and Figure 19 for two versions of a web document. The checksums for entire page as well as for different paragraphs present within document using the above method was computed as shown in Table 11. From the result shown in the table, it is clear that different checksums are produced if the versions of a web document are not same. By monitoring the paragraphs checksum, it is clear that the changes occurred only in the first paragraph (P1) whereas rests are unchanged and the same is reflected from paragraph checksum recorded in the table. • It produces unique checksum (RMS value) for entire web page as well as for paragraph level contents of a web document. • The formula developed is capable to detect minor changes even addition/deletion of few words accurately. • Paragraph checksum helps to identify changes at smaller level i.e. at paragraph level. • It may be noted that ASCII values have been used in the formula as each symbol has a unique representation in ASCII Table and thus leading to no ambiguity. Though the above method performs efficiently and guarantees to detect content level changes among different versions of web pages but it assigns uniform weightage to the entire contents of the web document whereas in real life changes in some contents carry more weightage than other. For example changes in main headings of a web document carry more weightage than any other content change. So keeping in mind these points the modified checksum based content level change detection method is being proposed. 22 Journal of Theoretical and Applied Information Technology 15 September 2012. Vol. 43 No.1 © 2005 - 2012 JATIT & LLS. All rights reserved. ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195 Table 11: Checksum values for web page Web page No of distinct symbols Checksum of page Initial version 586.806 Change d version 36 585.671 36 Paragraph No of distinct symbols Paragraph Checksum p1 30 339.439 p2 29 318.777 p3 26 312.316 p4 28 238.419 p5 25 256.380 p1 29 340.564 p2 29 318.777 p3 26 312.316 p4 27 238.419 p5 24 256.380 This method was also tested on the same set of web pages on which previous (RMS based) method was tested and the results produced are tabulated in Table 12. Monitoring the checksum results recorded in the table and looking at the corresponding contents of the web pages, it may be concluded that the method is efficient and guarantees to detect the content level changes even if there are minor changes among different versions of the web document. Figure 18: Initial version of web page Table 12: Checksum values for web page V. No 1 2 3 Checksu m of page 625.616 586.280 587.837 No of distinct symbols Paragrap h No of distinct symbols p1 33 330.995 p2 29 339.490 p3 27 357.407 p4 32 295.606 p5 26 258.498 p6 24 280.597 p1 33 330.914 p2 29 339.495 p3 25 351.344 p4 26 262.969 p5 25 282.021 p1 32 353.926 p2 31 321.996 p3 31 309.317 p4 26 262.963 p5 24 Checksu m 43 41 41 282.439 Figure 19: Changed version of the web page 23 Journal of Theoretical and Applied Information Technology 15 September 2012. Vol. 43 No.1 © 2005 - 2012 JATIT & LLS. All rights reserved. ISSN: 1992-8645 4 5 6 7 588.829 607.874 586.806 585.671 www.jatit.org 40 35 36 36 p1 30 346.411 p2 27 366.823 p3 28 312.733 p4 27 263.958 p5 23 274.968 p1 30 346.317 p2 27 318.672 p3 24 345.231 p4 26 270.575 p5 26 247.605 p1 30 339.439 p2 29 318.777 p3 26 312.316 p4 28 238.419 p5 25 256.381 p1 29 340.564 p2 29 318.777 p3 26 312.316 p4 27 238.419 p5 24 256.381 E-ISSN: 1817-3195 Table 13: Comparison of results with well known message digests algorithms Name Of Metho d Both, RMS based and checksum based methods have been compared with well known message digest algorithms such as MD5 and SHA-X1, used to calculate checksum for web pages search engine’s web crawlers. One such result for an input set [31] is shown in Table 13 and the comparison is made in Table 14. On seeing the table, it can be observed that with MD5 and SHA-X1 algorithm, the entire checksum gets changed at large scale even though a single character/symbol of the input gets modified, whereas the checksum produced by the RMS and checksum based methods, do not get changed at that large scale on the modification of single character/symbol in the input set. Even on the modification of few white spaces, the entire message digest gets changed in MD5 and SHA-X1 algorithm whereas it does not affect the checksum produced by RMS based and checksum based method as white spaces are not considered while calculating the checksum. MD5 Lengt h of messa ge digest (bits) 128 SHA-1 160 The quick brown fox jumps over the lazy dog 2fd4e1c6 7a2d28fc ed849ee1 bb76e739 1b93eb12 SHA-2 24 224 The quick brown fox jumps over the lazy dog SHA-2 56 256 The quick brown fox jumps over the lazy dog The quick brown fox jumps over the lazy cog The quick brown fox jumps over the lazy cog SHA-5 12 512 The quick brown fox jumps over the lazy dog 730e109b d7a8a32b 1cb9d9a0 9aa2325d 2430587d dbc0c38b ad911525 d7a8fbb3 07d78094 69ca9abc b0082e4f 8d5651e4 6d3cdb76 2d02d0bf 37c9e592 07e547d9 586f6a73 f73fbac0 435ed769 51218fb7 d0c8d788 a309d785 436bbb64 2e93a252 a954f239 12547d1e 8a3b5ed6 e1bfd709 7821233f a0538f3d b854fee6 Root Mean Square Based method No fixed length messa ge digest No fixed length messa ge digest The quick brown fox jumps over the lazy dog 127.329 The quick brown fox jumps over the lazy cog 129.820 The quick brown fox jumps over the lazy dog 43.356 The quick brown fox 43.344 Checks um Based method 24 Initial Input Message digest for initial input Changed Input Message digest for changed input The quick brown fox jumps over the lazy dog 9e107d9d 372bb682 6bd81d35 42a419d6 The quick brown fox jumps over the lazy cog. The quick brown fox jumps over the lazy cog e4d909c 90d0fb1c a068ffad df22cbd0 The quick brown fox jumps over the lazy cog jumps over the lazy cog de9f2c7f d25e1b3a fad3e85a 0bd17d9b 100db4b3 fee755f4 4a55f20f b3362cdc 3c493615 b3cb574e d95ce610 ee5b1e9b e4c4d8f3 bf76b692 de791a17 3e053211 50f7a345 b46484fe 427f6acc 7ecc81be 3eeee1d0 e11733ef 152a6c29 503b3ae2 0c4f1f3c da4cb26f 1bc1a41f 91c7fe4a b3bd8649 4049e201 c4bd5155 f31ecb7a 3c860684 3c4cc8df cab7da11 c8ae5045 Journal of Theoretical and Applied Information Technology 15 September 2012. Vol. 43 No.1 © 2005 - 2012 JATIT & LLS. All rights reserved. ISSN: 1992-8645 www.jatit.org Table 14: Performance comparison of proposed methods with well known message digests methods S No 1 2. 3 4 MD5/SHA-1,224,256,51 2 These message digest algorithms produce fixed length checksum codes which are larger in size. It considers whole contents present in the input. E.g. even white spaces are counted in input data. So even on insertion/deletion of white spaces, the checksum entirely gets changed. There is restriction on the input size of data e.g. in MD5 input data should not be greater than 264 bits. Similar is the case with others also. These methods are very sensitive. The complete checksum generated gets changed even on changing of single characters in the I/P set. So no conclusion can be drawn about the amount of change on the basis of checksum. 5 It is more suitable for cryptographic application. 6 They generate single checksum for whole web page. So no idea may be drawn about changes at micro level i.e. paragraph level within documents. E-ISSN: 1817-3195 Though some mechanism has been proposed [30] to handle the misaligned tags but its performance is still questionable. Methods developed • There is no impact of optional as well as It does not produce fixed length checksum and also the length of checksum produced are smaller The contents such as those which are enclosed within tags as well as white spaces are not considered. misaligned tags in the performance of document fingerprint based method for structural change detection, as it considers the opening tags only and discards all closing tags. Its performance is also better in terms of space and time complexity as it requires less space to store the fingerprint which is in the form of string, than to store document tree. Generation and comparison of fingerprint is much easier than tree. The change detection module was integrated in the incremental parallel web crawler architecture as shown in Figure 1. It was observed that the methods discussed above for both structural and contextual changes are performing efficiently and guarantee to detect the changes. We also tested the above method offline separately on individual web pages and observed the similar performance. There is no restriction on the input size of data. Only little changes occur in the final checksum if input data gets changed at micro level i.e. by few characters. On seeing the checksum, the amount of change in two versions of documents may be estimated.. It is not suitable for cryptographic application but suitable for change detection. They generate checksum for whole page as well as for individual paragraphs present in the web page. So with these methods, changes at micro level too i.e. paragraph level may be identified too. 5. CONCLUSION AND FUTURE WORKS 5.1 Conclusion In this paper, the complete work done is divided in to two parts. In the first part a novel architecture for incremental parallel crawlers has been proposed whereas in second part methods have been developed which detect whether two versions of a web document have been changed thereby helping to refresh the web documents to keep the repository up-to-date at Search engines side. The novel architecture proposed for incremental parallel web crawler, helps to solve the following challenging problems which are still faced by almost every Search engine while running multiple crawlers in parallel for downloading the web documents for its repository. • Overlapping of web documents • Quality of downloaded web documents • Network bandwidth/traffic In brief, the methods proposed for structural as well as content level change detection may be summarized as follows: 5.1.1 Overlapping of Web Documents Overlap problem occurs when multiple crawlers running in parallel download the same web document multiple times due to the reason that one web crawler may not be aware of another having already downloaded the page. • Though document tree based structural change detection method efficiently identifies the structural changes but its time and space complexity is high. If nesting of tag structures is misaligned then it becomes very difficult to handle the situation while constructing the document tree. In the proposed downloading process 25 architecture the entire of web documents is Journal of Theoretical and Applied Information Technology 15 September 2012. Vol. 43 No.1 © 2005 - 2012 JATIT & LLS. All rights reserved. ISSN: 1992-8645 www.jatit.org performed under the coordination of Multi Threaded server and therefore no URLs have been assigned simultaneously to more than one client crawler executing in parallel. By applying this approach, server has the global image of the entire downloaded web documents and thus, overlapping problem is reduced. 5.1.4 Change detection Methods for Refreshing Web Documents In the second part of the work, the change detection methods have been developed to refresh the web document by detecting the structural, presentation and content level changes. The structural change detection methods also help in detecting presentation changes as well, as the presentation of web documents gets modified through insertion/deletion/modification of the tag structure. 5.1.2 Quality of Downloaded Web Documents The quality of downloaded documents can be ensured only when web pages of high relevance are downloaded by the crawlers. Therefore, to download such relevant web pages by earliest, multiple crawlers running in parallel must have global image of collectively downloaded web pages. Two different schemes, document tree based and document fingerprint based have been developed for structural change detection. The document tree based scheme works efficiently and guarantees to detect structural changes. Apart from providing details about the structural changes within documents, it also helps in locating the area of major/minor change as well. For instance, the information about at what level how many nodes have been inserted/deleted is also reported. In document fingerprint based scheme, two separate fingerprints in the form of strings are generated on the basis of web pagetag structure. Time and space complexity in document tree based scheme is higher than the fingerprint based scheme. Generation and comparison of fingerprints in the form of strings, require less time than to generate and compare two trees. Also, the space required to store the node details (as node consists of multiple fields) in the document tree is high in comparison to space required to store fingerprints. Though fingerprint based scheme is efficient in terms of space and time complexity, it also guarantees to detect structural changes even at micro level but the details about the changes can not be reported, for which document tree based scheme is better. In the proposed architecture the ranking algorithm developed, computes the relevance of the URLs to be downloaded on the basis of global image of the collectively downloaded web documents. It computes relevance on the basis of the forward link count as well as back link count whereas the existing method is only based on back link count. The advantage of the proposed ranking method over existing one is that it does not require the image of the entire Web to know the relevance of a URL as “forward link count” is directly computed from the downloaded web pages where as “back link count” is obtained from in-built repository. 5.1.3 Network Bandwidth/Traffic In order to maintain the quality, the crawling process is carried out using either of the following approaches. • • E-ISSN: 1817-3195 Crawlers can be generously allowed to communicate among themselves or They can not be allowed to communicate among themselves at all. In the first approach network traffic will increase because crawlers communicate among themselves more frequently to reduce the overlap problem whereas in second approach, if they are not allowed at all to communicate then as a result same web document may be downloaded multiple times thereby consuming the network bandwidth. Thus, both approaches put extra burden on the network traffic. Similar to structural change detection scheme, two different schemes, Root Mean Square based and Checksum based have been developed for content level change detection. Both schemes are based on ASCII principle of symbols and both generate checksum for entire web page and also for different paragraphs. The checksum for entire web page helps to draw the picture about content level changes at page level whereas through paragraph checksum, changes at micro level i.e. paragraph level can be detected. Though, both the schemes are efficient and guarantee to detect changes at micro level also but the former scheme considers changes in contents uniformly whereas the later scheme assigns different The proposed architecture helps to reduce the overlapping and because all communications take place through MT Server, there is no direct communication requirement among client crawlers. Both of the above facilities help in reducing network traffic significantly. 26 Journal of Theoretical and Applied Information Technology 15 September 2012. Vol. 43 No.1 © 2005 - 2012 JATIT & LLS. All rights reserved. ISSN: 1992-8645 www.jatit.org weightage to different contents present in the web page based on font size of the contents. E-ISSN: 1817-3195 http://www.w3.org/People/Berners-Lee/1996/p pf.html. [9] Berners-Lee, Tim, and Cailliau, CN, R., “WorldWideWeb: Proposal for a Hypertext Project” CERN October 1990, available at: http://www.w3.org/Proposal.html. [10] Berners-Lee, Tim, Cailliau, CN, R., Groff, J-F, and Pollermann, B., CERN, "World-Wide Web: The Information Universe", published in Electronic Networking: Research, Applications and Policy, Vol. 2 No 1, Meckler Publishing, Westport, CT, USA, 1992. [11] Berners-Lee, Tim, and Cailliau, CN, R., “World Wide Web” Invited talk at the conference: Computing in High Energy Physics 92, France, 23027, Sept 1992. [12] Berners-Lee, Tim, Cailliau, CN, R., Groff, J-F, and Pollermann, B., CERN, "World-Wide Web: An Information Infrastructure for High-Energy Physics", Proceeding of Artificial Intelligence and Software Engineering for High Energy Physics, La Londe, France, published by World Scientific, Singapore, January 1992. [13] CHO, A., J., Garcia-Molina Hector, Paepcke, A., and Raghvan, S., “Searching the Web”. ACM Transactions on Internet Technology, Vol. 1, No. 1, pp. 2–43, August 2001. [14] Bar-Yossef, Z., Berg, A., Chien, S., and Weitz, J. F. D., “Approximating aggregate queries about web pages via random walks”, In Proceedings of the 26th International Conference on Very Large Data Bases, 2000. [15] Bharat, K. and Border, A., “Mirror, mirror on the web: A study of host pairs with replicated content”, In Proceedings of the Eighth International Conference on the World-Wide Web, 1999. [16] Bharat, K., Border, A., Henzinger, M., Kumar, P., and Venkatasubramanian, S., “The connectivity server: fast access to linkage information on the Web”, Computer Network ISDN Syst. 30, 1-7, 469–477, 1998. [17] Lawrence, S. and Giles, C., “Searching the World Wide Web”, Science 280,98–100, 1998. [18] Douglas C. Engelbart, “Augmenting Human Intellect: A Conceptual Framework”. Summary Report AFOSR-3223, October 1962. [19] Smith, L.S. and Hurson, A., R., “A Search Engine Selection Methodology”, Proceeding of the International Conference on Information Technology: Computers and Communications (ITCC’03), pp.122 – 129, April 2003. [20] Sharma, A.K., Gupta, J.P., Agarwal D.P., “Augmented Hypertext Documents suitable for 5.2 Future Work The work done in this paper can be extended with the following list of possible future research issues with respect to incremental parallel web crawler: • Behavioral changes have not been discussed and have been left for future work. • Ideally, changes in the link of an image hyperlink can be detected through the structural change detection methods discussed but in case the image itself is replaced or modified then that can not be detected using the methods discussed. • The architecture along with change detection methods can be synchronized with the frequency of change of web documents to get better results, as web documents from various domains change at different intervals. REFERENCES: Brin, Sergey, and Page, Lawrence, “The Anatomy of a large scale hyper textual web Search Engine”, In Proceedings of the Seventh World-Wide Web Conference, 1998. [2] Maurice de Kunder, “Size of the World Wide Web”, available at: http://www.worldwidewebsize.com (accessed May 10, 2012). [3] Cho, Junghoo, and Garcia-Molina Hector, “Parallel Crawlers”, Proceedings of the 11th international conference on World Wide Web WWW '02”, Honolulu, Hawaii, USA. ACM Press, pp.124 – 135, 2002. [4] Robots exclusion protocol. Available at: http://info.webcrawler.com/mak/projects/robots /exclusion.html [5] Cho, Junghoo, Garcia-Molina, Hector, and Page Lawrence, “Efficient crawling through URL ordering”, Proceedings of the 7th World-Wide Web Conference, pp. 161-172, 1998. [6] Cho, Junghoo, Angeles, Los, and Garcia-Molina, Hector, “Effective Page Refresh Policies for Web Crawlers”, ACM Transactions on Database Systems, Volume 28, Issue 4, pp. 390 – 426, December 2003. [7] Jones, George, “15 World-Widening Years”, InformationWeek, Sept 18, 2006 Issue. [8] Berners-Lee, Tim, “The World Wide Web: Past, Present and Future”, MIT USA, Aug 1996, available at: [1] 27 Journal of Theoretical and Applied Information Technology 15 September 2012. Vol. 43 No.1 © 2005 - 2012 JATIT & LLS. All rights reserved. ISSN: 1992-8645 www.jatit.org parallel crawlers”, Proceeding Of WITSA-2003, a National workshop on Information Technology Services and Applications, New Delhi, Feb 2003. [21] Saeid, Asadi and Hamid, R. Jamali, “Shifts in Search and Future Engine Development: A Review of Past, Present Trends in Research on Search Engines”, Webology, Volume 1, Number 2, December, 2004. [22]Page, Lawrence; Brin, Sergey; Motwani, Rajeev, and Winograd, Terry, “The PageRank Citation Ranking: Bringing Order to the Web”, Technical Report, Stanford University InfoLab, 1999. [23] Cho, Junghoo, Garcia-Molina, Hector, “The Evolution of the Web and Implications for an Incremental Crawler”, Proceedings of the 26th International Conference on Very Large Data Bases, pp. 200 – 209, 2000. [24] Francisco-Revilla, L., Shipman, F., Furuta, R., Karadkar, U. and Arora, A., “Managing Change on the Web”, In Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries, pp. 67 – 76, 2001. [25] P. De Bra, G.-J. Houben, Y. Kornatzky, and R. Post, “Information retrieval in distributed hypertexts”, Proceedings of RIAO'94, Intelligent Multimedia, Information Retrieval Systems and Management, New York, NY, 1994. [26] Hersovici, M., Jacovi, M., Maarek, Y., Pelleg, D., Shtalheim, M. and Ur Sigalit, “The Shark-Search Algorithm – an application: tailored web site mapping”, Computer Networks and ISDN systems, Special Issue on 7th WWW conference, Brisbane, Australia, 30(1-7), 1998. [27] Salton, G. and McGill, M.J., “Introduction to Modern Information Retrieval”, Computer Series. McGraw-Hill, New York, NY, 1983. [28] Pinkerton, B., “Finding what people want: Experiences with the WebCrawler”, Proceeding of the First World Wide Web Conference, Geneva, Switzerland, 1994. [29] Heydon, A. and Najork, M., “Mercator: A scalable, extensible Web crawler”, World Wide Web, 2(4):pp. 219–229, 1999. [30] Wang, Ziyang, “Incremental Web Search: Tracking Changes in the Web”, PhD dissertation, 2007. [31] “MD5, SHA-1, 224, 256 etc”, available at: http://en.wikipedia.org/wiki/MD5 . 28 E-ISSN: 1817-3195 Journal of Theoretical and Applied Information Technology 15 September 2012. Vol. 43 No.1 © 2005 - 2012 JATIT & LLS. All rights reserved. ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195 Seed URL User Queue of unsorted URLs Add URLs URL Dispatcher Extract URLs Refresh URLs Queue Change detection module Rank_URLs Add /Index Web Pages/URLs Ranking module Indexer Distribute_URLs Document and URL Buffer Repository URL Distributor C-crawler _1 Select URLs C-crawler _2 URL Allocato r C-crawler _n P_list-2 Client Crawlers Divide URLs in sub lists P_list-1 Queue of sorted URLs Crawl Web Pages P_list-3 Multi Threaded Server Message WWW Data flow Figure 1: Architecture of Incremental Parallel WebCrawler 29

Log In

An Approach to Design Incremental Parallel Webcrawler

Related papers

Related papers

Related topics