SIMHAR - Smart Distributed Web Crawler For The Hid
SIMHAR - Smart Distributed Web Crawler For The Hid
SIMHAR - Smart Distributed Web Crawler For The Hid
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2020.3004756, IEEE
Access
Sawroop kaur, G. Geetha: SIMHAR - Smart distributed web crawler for the hidden web using SIM+Hash and Redis Server
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.Doi Number
ABSTRACT Developing a distributed web crawler obliges major engineering challenges, all of which are
eventually associated to scale. To retain corpus of search engine and a reasonable state of freshness, the crawler
must be distributed over multiple computers. In distributed crawling, crawling agents are given a task to fetch and
download web pages. The number and heterogeneous structure of web pages are increasing rapidly. This made
the performance a serious challenge to web crawler systems. In this paper, a distributed web crawler for the hidden
web is proposed and implemented. It combines and integrates, scrapy framework and Redis server. Crawling is
split into three stages-adaption, relevant source selection and underlying content extraction. The crawler accurately
detects and submit the searchable forms. Duplication detection is based on hybrid technology using hash-maps of
Redis and Sim+Hash. Redis server is also acting as a data store for a massive amount of web data so that the
growth of hidden web databases is handled ensuring scalability.
INDEX TERMS Distributed crawlers, Duplication detection, Hidden web, Web crawler
I. INTRODUCTION
A web crawler is an automated software to browse the World of crawling is internet forums and performance of the same
Wide Web in an organized manner. By applying distributed has been measured in terms of the number of URLs that has
computing technique to web crawling, the efficiency and been processed. Results have shown that distributed crawling
effectiveness are improved in terms of time, cost, load has gathered a greater number of URLs than single vertical
balancing, and search quality etc. In distributed crawling, crawler, when compared. Weizheng Gao et al. [10] have
multiple agents work together for crawling the URLs and designed geographically distributed web crawler and tested
necessitate more complex approach than simple information on various crawling strategies. Out of which, URL based and
curation. Mercator[1], Heritrix[2], Nutch[3], Scrapy, JSpider extended anchor text-based have given the favorable
, HTTrack[4], YaCy[5], etc. are some of the distributed web performance. Jiankun Yu et al. [11] have presented cluster-
crawlers in use. The hidden web is a part of the web that is based distributed crawler implemented as a data server. This
masked behind the HTML forms and is not generally crawler is shopping product based so do the feature
accessed by web crawlers. For this purpose, we need a extraction. Web server is presented with processed data.
crawler that can find all webpages with in searchable forms Scalability is provided using a Hadoop platform. Hbase is
and can also fill and submit the form automatically without used to store huge data. The assumption for load balancing is
any manual intervention. EFFC[6], Adaptive crawlers[7], that when all the nodes finish their crawling task at the same
FFC [8] are some of the web crawlers that can handle the time. Performance of crawler is compared with Nutch
hidden web. To access the hidden web, there are two crawler. With 8 crawling nodes between 3500 -4000 pages
approaches namely virtual integration and web surfacing. are crawled per minute. Feng Ye et al.[12] have implemented
distributed crawler based on Apache Flink. On the cluster,
II. LITERATURE REVIEW Redis and other databases are deployed to store the web pages
The literature review is organized into three parts as the first that are crawled. Scrapy is selected as an underlying crawling
part review the distributed web crawlers, followed by focused framework. Duplication detection is employed by combining
and hidden web crawlers. the bloom filter with Redis. Performance is measured in
Zhou et al.[9] have designed a distributed vertical crawler terms of crawled pages and execution time. The crawler has
using crawling template-based periodic strategy. The domain managed to crawl 20000 pages in seven hours. CPU
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2020.3004756, IEEE
Access
Sawroop kaur, G. Geetha: SIMHAR - Smart distributed web crawler for the hidden web using SIM+Hash and Redis Server
utilization rate even at the fourth hour is less than 35% as DS- Downloading speed
compared to a single crawler. Duplication detection is MT – Maximum threads
compared with bloom filter, link list, hashmap and treemap CPU-U – CPU utilisation
with bloom filter giving promising results. The number of T – throughput
fetched pages increases to 7000 when system used Table 1, shows the comparison of existing distributed web
Mesos/Marathon platform. crawlers based on their performance measures.
UniCrawl, a geographical distributed web crawler worked
upon by Do Le Quoc et al. [13], have yielded 5000 new URLs HIDDEN WEB CRAWLING: To crawl the data
in 50 crawling rounds and yielding throughput between 10 6 hidden behind the web forms, the following steps are
to 107 for 6000 seconds. M. E. ElAraby et al. [14] have performed.
developed a dynamic web crawler as a service. Each stage of 1) AUTOMATED HIDDEN WEB ENTRY POINT DISCOVE-
this architecture worked as a separate service and deals with RY
its load. So, scalability is also based on individual stages. The The deep web site can be discovered in two ways : either usi
whole system does not need to be scaled. Along with being ng heuristic or machine learning. Madhavan et al. [18] used
dynamic, this architecture is customizable and provided heuristics to discover form tag and other features of forms
standalone services using elastic computing. The system has that includes- presence of a number of a text box . Other
used Amazon RDS service. Performance is compared for way is to use heuristic to discard forms with short input as
fetched pages vs time graph. This crawler can fetch more than implemented used by Bergholz et al.[19]. While Cope et al.
250 pages in less than 400000 seconds. Then using 5 virtual [20] and Barbosa et al. [21] have applied machine-learning
machines 300 pages are crawled in 153.04 seconds. With the algorithm to classify forms to find entry to the hidden web.
same configuration number of discovered URLs are 8452.
2) FORM MODELLING
This system has also worked on discovering new domains
After entry to the deep web, the next step is form modelling.
from newly discovered URLs. Comparison is made between
that includes the identification of the type of classification.
response time for multithreaded crawler and virtual
Forms can also be classified studies based on pre-query or p
machines. For 300 pages, the response time of multithreaded
ost query. In the post query case response page is a source o
crawler is 142132.4 and virtual machines on cloud computing
f classification. Feature of each form is also the source of cla
are 512159.8.
ssification.
According to Gunawan et al.[15] infinite threads curtail the
3) QUERY SELECTION
performance of web crawler. The system has divided
Kashyap et al. [22] have used the concept of static hierarchy
crawling based on the heuristic that the large site are crawler
for query selection. BioNav has been used for the
before smaller size sites. Results are compared for CPU and
hierarchies. The performance measure is based on the
memory utilisation. For 2000 threads CPU utilisation is 70%
overall cost of the queries. The aim is to retrieve a greater
at 550 Mbps bandwidth. Choosing a suitable approach to
number of records. Chen et al. [23] have worked on both
divide the Web is the main issue in parallel crawlers.
the content and the structure of the form for queries as well
Achsan and Wibowo [15] have worked on politeness
as databases. The quality of the query is measured in term of
property. Bosnjak et al. [16] proposed continuous and fault-
difficulty over the database. This model estimates the
tolerant web crawler called Twitter Echo. This crawler
number of queries compulsory to retrieve the whole content
continuously extracts data from twitter like communities.
of the hidden web site. Performance is measured in terms of
Performance is measured in terms of classification accuracy
correlation of average precision. Madhavan et al. [24] have
with 99.4% of the highest classification accuracy for non-
proved that the load on the system increases by increasing
Portuguese sites.
the number of submissions. As some queries generate
Suryansh Raj et al. [16] have developed a platform-
duplicate results. So, the query selection technique should
independent distributed crawler that can handle AJAX-based
have the goal of “minimize the number of queries and
applications. They have also supported the breadth-first
maximize the accurate response”. Barbosa and Freire [25]
search for complete coverage. Performance is compared up
proposed model for unsupervised keyword selection. This m
to 64 active threads to crawl two-page application and
odel starts with keywords extracted from the form page.
medium sized application.
First, most frequent words are calculated, and submission is
Hongsheng Xu et al. [17] have implemented distributed
repeated until maximum results are obtained. Performance is
crawler based on Hadoop and P2P. All the files are stored and
measured in terms of
shared from in the distributed file system. Performance is
measured as time to crawl vs nodes.
DG- Distributed General
DF- Distributed focused
CT- crawling time
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3004756, IEEE Access
Sawroop kaur, G. Geetha: SIMHAR - Smart distributed web crawler for the hidden web using SIM+Hash and Redis Server
the effectiveness of the technique with and without using a retrieve maximum relevant pages. Focused web crawlers like
wrapper. Ntoulas et al. [26] proposed a technique in which general web crawler have same components called:
for submission is done with the user-provided keyword. It
also extracts keywords from the response pages. The 1. Fetcher or downloader which fetches the web page
keywords with higher informativeness are selected., which is and retrieves its contents.
calculated as their accumulated frequency.
2. Frontier that stores the URLs of unvisited websites
along with the visited one, for extraction of further
4) CRAWLING PATH LEARNING: information
Searchable forms can be reached using a path learning In addition to these three components, focused web crawler
process. Relevant page with correct response can be has a topic-specific crawling model, relevance estimation
and ranking module. Focused crawler first collects some
generated by following the pages in an order. Based on path
URLs as seed sites. From these URLs, a crawler begins its
learning crawlers can be categorized as blind crawlers, with crawling process and give results in the form of webpages
aimless but perpetual download capacity and focused crawled.
crawlers[28], [29] that works on a fabricated path and reach From the above literature, we conclude that distributed
the desired web page. The proposed crawler is based on the crawlers for the hidden web are limited and they face
focused crawlers. Overall performance check of the crawlers performance issues in terms of scalability, duplication, and
is based on their access to the hidden websites and to unable to support frequent change in the underlying
technology of web pages. To address the above-mentioned
measure the finding capacity and accuracy of the desired
challenges, focused distributed web crawler is developed
forms. Hidden web crawling neither have standard dataset that can handle duplication and scalability. Duplication
nor comparison framework and testing environment to detection is based on hybrid technology using hash-maps of
compare features of the techniques. So the reported research Redis and Sim+Hash. Redis server is also acting as a data
is compared using different techniques used in hidden web store for a massive amount of web data so that the growth of
crawling. hidden web databases is handled ensuring scalability.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3004756, IEEE Access
Sawroop kaur, G. Geetha: SIMHAR - Smart distributed web crawler for the hidden web using SIM+Hash and Redis Server
crawling is equally challenging. Performance of crawler is Path of the URL is learned to reach the exact location of the
highly influenced by architecture and techniques of form. special symbol related to the path is the forward-slash
crawling. From the literature review, it is found that (/). Path of the URL is found after the hostname. Anchors
Distribution can be implemented to covercome the are helpful in internal navigation in URL. And we need to
drawbacks such as scalability, duplication, and inability to find the internal links as well.
support frequent changes in the underlying technology of
web pages. The crawlers working in a methodical approach
A. WEIGHT CALCULATION OF TERMS
can effectively touch specific topics. As per the literature
Based on feature vector construction, we have to compute
there isn’t any focused distributed web based crawler
the weight of term corresponding to its occurrence in
designed to uncover hidden data.
URL(U), anchor(A), text around the anchor (T) and path (P).
Term frequency of term Ti in U, A, T and P and is defined
III THE PROPOSED ARCHITECTURE.
as:
The proposed architecture work in three stages.
tifi= α×tifi+ β×tifi+ γ ×tifi+ δ×tifi (1)
1. URL adaption and classification: Frontier is
initialized in this phase, followed by parameter where α, β, γ, and δ are the weight coefficient. Ig is the
learning, ranking and domain classification. information gain of terms.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3004756, IEEE Access
Sawroop kaur, G. Geetha: SIMHAR - Smart distributed web crawler for the hidden web using SIM+Hash and Redis Server
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3004756, IEEE Access
Sawroop kaur, G. Geetha: SIMHAR - Smart distributed web crawler for the hidden web using SIM+Hash and Redis Server
8. Stop, if crawler has reached the threshold of 0.8, Rule1: If crawler do not find any <form> tag, consider this
i.e 80 new URLs and 0.1, i.e 100 new forms at a non-searchable form.
depth one. Follow steps 1-8 for depth 2 and depth
3.
Rule2: If crawler found the <form> tag. Then extract the
C. RANKING attribute type. If the attribute type is not in repository call
Aim of ranking in hidden web crawling is to extract top n it a non-searchable page.
documents for the queries. The cost is expected to be the
least for this work. We have adopted the formula for Rule 3: If crawler found the <form> tag, and extracted
ranking from [31]. But our reward function (€) is based on
attribute type matched in a repository. But the attributes <
a number of out-links, site similarity and term weighting.
Let SF be the frequency of out-links. 3, consider this page non-searchable.
SF= ⅀ Ii (4)
Rule 4: If the number of attributes is >3, but the submit
I =0, site has not appeared, button is not found, consider this page as non-searchable.
I=1, if it has appeared Rule 5: If there exists <form> tag, and attributes are similar
€ = wij +S+SF (5) to the repository, and submit button is also there. But
button marker is not present then consider this page as non-
(rj)= (1-w). δj + w. ranking reward (€) / cj (6) searchable.
w is the weight of balancing € and cj.. δj is the number of
new documents. Computation of rj shows the similarity of Rule 6: If there exists <form> tag, and attributes are similar
€ and returned documents. If the value of € is closer to 0, to the repository, submit button and button marker is are
it means that returned value is more similar to the already
present. It is a searchable form.
seen document i.e the new URL is similar to the already
discovered URL. cj is a function of network
communication and bandwidth consumption. Rule 7: If there exists <form> tag, but crawler found login,
then this is non-searchable.
D. DOMAIN CLASSIFICATION
Domain classification is based on topical relevance of the Rule 8: if there exist <form> tag, but crawler found
site and the home page. As the new URL is received its registration, consider this page as non-searchable.
homepage content is parsed and feature vector is
constructed. The resulting vector is fed to the classifier to Rule 9: if there exist <form> tag, but crawler found
check relevancy. The crawler gets the URLs and the subscribe, then consider this page as non-searchable.
request is sent to a server to fetch the page. The crawler
will first check for the presence of a search interface. The
forms are two types defined as follows: Rule 10: If there exist <form> tag, but crawler found
mailing list subscription, then consider this page as non-
searchable.
• Searchable form: “Webform is called searchable for
Figure 2 shows the diagrammatic view of the above-
m if it is capable of submitting a query to an online
mentioned rules.
database which in turn, return the results of a query.”
.
• Non-Searchable form: “The forms, for example, logi
n registration, mailing list subscriptions forms, and s
o on are called non-searchable forms. These forms d
o not represent database queries.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3004756, IEEE Access
Sawroop kaur, G. Geetha: SIMHAR - Smart distributed web crawler for the hidden web using SIM+Hash and Redis Server
E. FORM STRUCTURE EXTRACTION • Text: This area of a form can be edited with
After the webpage with form is found, the content of the multiple lines of words.
form is extracted. Search forms have controls that a human • Input: This editable area has following attributes
can easily fill and submit. If a crawler has to fill the forms types- type as text, Submit, checkbox and radio
automatically it has to have a set of resources to button
automatically fill and submit the forms with suitable • Select: Select has two options like drop-down list
values. A task-specific database called repository, is box and multi-choice list box.
initialized at the launch. This database contains the set of
values for filling the forms, created by parsing the form as TABLE II
CONTENTS OF REPOSITORY
shown in table [2]. And form element table is created with
a control element type, label and domain values. The
Control Label Domain Type Of Size Status
crawler will adaptively learn filling values with associated Element Domain
forms. When the first run of the crawler is completed, the (Visible
parsed values will be analyzed to collect data. Form Fields)
submission is of two types: post-form submission and get Submit Search Submit Infinite More VR
than 3
form submission. This crawler work on both type of kb
submissions. After the form is submitted crawler got the Radio Flight Round Bounded More VR
response status. Response status is either a valid page or no trip trip than 3
page found code. Following steps are performed during One-way kb
trip
form parsing: Select From, Name of Bounded More VR
to place than 3
• Using Request library of python, HTTP GET (eg: Delhi kb
to
request is sent to URL of a webpage. America)
• The response of HTTP request is HTML content
of a webpage. Two heuristics are implemented based on visible fields. If
• Data is fetched and parsed using Beautiful soup. the number of visible fields is one or two -forms are
• HTML tags and their attributes are analysed. classified using query probing, label extraction otherwise.
• Data is output in CSV file. As explained in [32] form submission include problems
like 404 error page, duplicate information, and sometimes
F. FORM AND RESPONSE ANALYSIS all information is retrieved in single submission otherwise
multiple submission are required. The crawler has also
Forms have multiple control elements. It could be of any faced these problems.
type:
•
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3004756, IEEE Access
Sawroop kaur, G. Geetha: SIMHAR - Smart distributed web crawler for the hidden web using SIM+Hash and Redis Server
I. STOPPING CRITERIA
Exhaustive crawling is waste or resources. This system
has implemented following stopping criteria.
• Maximum depth of crawl: the crawler will stop
following the link when the depth of three is
reached. It is proved in [23] that most of the deep
web pages are found till depth 3. While at each
depth maximum number of pages to be crawl is
100.
• At any depth maximum number of forms to be
found is 100 or less than a hundred.
• If the crawler is at depth 1, it has crawled 50
pages, but no searchable form is found, it will
directly move to next depth. And the same rule is
followed at depth 2. Suppose if at depth 2, 50
pages are crawled and no searchable form is
found. The crawler will fetch new link from URL.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3004756, IEEE Access
Sawroop kaur, G. Geetha: SIMHAR - Smart distributed web crawler for the hidden web using SIM+Hash and Redis Server
Web pages are popped out from the frontier first-in-first- V. EVALUATION
out way. We have considered this as our baseline This crawler has to first check if the page belongs to hidden
assumption. Crawling in breadth-first fashion is web or not, by following the rules in Figure 2. After the
implemented as URL and server-based. It is proved in [34] seed database URLs are checked for <form> tag, the
that it yields promising results. Following steps are crawler has to pull the contents of the webpage using the
performed in Job Scheduling. URL. The request library help make use of HTTP with in
the python program. Beautiful Soup can extract any type
Step 1. To start crawling, scrapy send schedule request to of data from a webpage. After the HTML Markup’s are
message to a crawler. removed page is saved for further processing. Beautiful
soup is combined with urllib3 to work with web pages.
Step2. As crawler receives the request it starts crawling. Other way is to download a copy of webpage then use it
From the Redis URL queue, a URL is selected and sent as locally. Beautiful soup has feature called “prettify”, in
a request to schedular. which all the unnecessary tags can be dropped. We have
selected 6 domains from the dataset. This dataset contains
Step 3. Schedular receives a request of URL, it forward this more than 260000 associated URLs.
to Redis (request queue), and then again contact (request Initially the DMOZ dataset is used. The
scheduled) is made with scrapy. performance of the classifier is measured using confusion
matrix. Rows of confusion matrix denote actual class,
Step 4, 5. Now the associated webpage is to be while column indicate classes predicted by SVM and knn
downloaded, for this request is popped from the top of a classifiers. We have computed accuracy for each class.
request queue, and downloader on receiving the request, Average of each class denote the performance of the
download the request page. classifier. The performance metrics are precision, recall
and f1. Precision is classification of portion of webpages
Step 6, 7. Downloader after getting the contents of a page, that are relevant to the class. It means how correct the
submit it page to the crawler. system is to reject the web pages that are not relevant.
Recall is how correctly classifier can find relevant
Step 8,9. Crawler parses the webpage, collect the new document. In Redis multiple jobs are separated using
URLs and send a new list of URLs to Redis pipeline. Redis unique keys. So, jobs are not mixed
pipeline send these URLs to Redis queue. One another
advantage of
IV. CONFIGURATION
The system hardware environment includes: CPU is Table III shows the description of status codes. Web
Intel®,Core™ C5-7200@ 2.50 GHZ 2.70GHZ,with forms cannot be submitted for these codes. Table IV
installed RAM-12.0 GB, and Redis 3.0.509. the crawler is shows the values of precision, recall and accuracy using
implemented in python. The internet speed during the support vector machine for status codes mentioned in
experiment was 50-100mbps. table III. This shows how accurately system has detected
the forms which cannot be submitted.
Table V shows the confusion matrix for correctly
submitted forms. Table VI and table VII show the
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3004756, IEEE Access
Sawroop kaur, G. Geetha: SIMHAR - Smart distributed web crawler for the hidden web using SIM+Hash and Redis Server
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3004756, IEEE Access
Sawroop kaur, G. Geetha: SIMHAR - Smart distributed web crawler for the hidden web using SIM+Hash and Redis Server
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3004756, IEEE Access
Sawroop kaur, G. Geetha: SIMHAR - Smart distributed web crawler for the hidden web using SIM+Hash and Redis Server
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.