Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

SIMHAR - Smart Distributed Web Crawler For The Hid

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2020.3004756, IEEE
Access

Sawroop kaur, G. Geetha: SIMHAR - Smart distributed web crawler for the hidden web using SIM+Hash and Redis Server

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.Doi Number

SIMHAR - Smart distributed web crawler for the


hidden web using SIM+Hash and Redis Server.
Sawroop kaur1, G.Geetha2
1
Department of Research and Development, Lovely Professional University,Phagwara,Punjab,India.
2
Dean, Associate Profeesor Department of Research and Development, Lovely Professional University,Phagwara,Punjab,India.

Corresponding author: G. Geetha (e-mail: gitaskumar@yahoo.in).

ABSTRACT Developing a distributed web crawler obliges major engineering challenges, all of which are
eventually associated to scale. To retain corpus of search engine and a reasonable state of freshness, the crawler
must be distributed over multiple computers. In distributed crawling, crawling agents are given a task to fetch and
download web pages. The number and heterogeneous structure of web pages are increasing rapidly. This made
the performance a serious challenge to web crawler systems. In this paper, a distributed web crawler for the hidden
web is proposed and implemented. It combines and integrates, scrapy framework and Redis server. Crawling is
split into three stages-adaption, relevant source selection and underlying content extraction. The crawler accurately
detects and submit the searchable forms. Duplication detection is based on hybrid technology using hash-maps of
Redis and Sim+Hash. Redis server is also acting as a data store for a massive amount of web data so that the
growth of hidden web databases is handled ensuring scalability.

INDEX TERMS Distributed crawlers, Duplication detection, Hidden web, Web crawler

I. INTRODUCTION
A web crawler is an automated software to browse the World of crawling is internet forums and performance of the same
Wide Web in an organized manner. By applying distributed has been measured in terms of the number of URLs that has
computing technique to web crawling, the efficiency and been processed. Results have shown that distributed crawling
effectiveness are improved in terms of time, cost, load has gathered a greater number of URLs than single vertical
balancing, and search quality etc. In distributed crawling, crawler, when compared. Weizheng Gao et al. [10] have
multiple agents work together for crawling the URLs and designed geographically distributed web crawler and tested
necessitate more complex approach than simple information on various crawling strategies. Out of which, URL based and
curation. Mercator[1], Heritrix[2], Nutch[3], Scrapy, JSpider extended anchor text-based have given the favorable
, HTTrack[4], YaCy[5], etc. are some of the distributed web performance. Jiankun Yu et al. [11] have presented cluster-
crawlers in use. The hidden web is a part of the web that is based distributed crawler implemented as a data server. This
masked behind the HTML forms and is not generally crawler is shopping product based so do the feature
accessed by web crawlers. For this purpose, we need a extraction. Web server is presented with processed data.
crawler that can find all webpages with in searchable forms Scalability is provided using a Hadoop platform. Hbase is
and can also fill and submit the form automatically without used to store huge data. The assumption for load balancing is
any manual intervention. EFFC[6], Adaptive crawlers[7], that when all the nodes finish their crawling task at the same
FFC [8] are some of the web crawlers that can handle the time. Performance of crawler is compared with Nutch
hidden web. To access the hidden web, there are two crawler. With 8 crawling nodes between 3500 -4000 pages
approaches namely virtual integration and web surfacing. are crawled per minute. Feng Ye et al.[12] have implemented
distributed crawler based on Apache Flink. On the cluster,
II. LITERATURE REVIEW Redis and other databases are deployed to store the web pages
The literature review is organized into three parts as the first that are crawled. Scrapy is selected as an underlying crawling
part review the distributed web crawlers, followed by focused framework. Duplication detection is employed by combining
and hidden web crawlers. the bloom filter with Redis. Performance is measured in
Zhou et al.[9] have designed a distributed vertical crawler terms of crawled pages and execution time. The crawler has
using crawling template-based periodic strategy. The domain managed to crawl 20000 pages in seven hours. CPU

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2020.3004756, IEEE
Access

Sawroop kaur, G. Geetha: SIMHAR - Smart distributed web crawler for the hidden web using SIM+Hash and Redis Server

utilization rate even at the fourth hour is less than 35% as DS- Downloading speed
compared to a single crawler. Duplication detection is MT – Maximum threads
compared with bloom filter, link list, hashmap and treemap CPU-U – CPU utilisation
with bloom filter giving promising results. The number of T – throughput
fetched pages increases to 7000 when system used Table 1, shows the comparison of existing distributed web
Mesos/Marathon platform. crawlers based on their performance measures.
UniCrawl, a geographical distributed web crawler worked
upon by Do Le Quoc et al. [13], have yielded 5000 new URLs HIDDEN WEB CRAWLING: To crawl the data
in 50 crawling rounds and yielding throughput between 10 6 hidden behind the web forms, the following steps are
to 107 for 6000 seconds. M. E. ElAraby et al. [14] have performed.
developed a dynamic web crawler as a service. Each stage of 1) AUTOMATED HIDDEN WEB ENTRY POINT DISCOVE-
this architecture worked as a separate service and deals with RY
its load. So, scalability is also based on individual stages. The The deep web site can be discovered in two ways : either usi
whole system does not need to be scaled. Along with being ng heuristic or machine learning. Madhavan et al. [18] used
dynamic, this architecture is customizable and provided heuristics to discover form tag and other features of forms
standalone services using elastic computing. The system has that includes- presence of a number of a text box . Other
used Amazon RDS service. Performance is compared for way is to use heuristic to discard forms with short input as
fetched pages vs time graph. This crawler can fetch more than implemented used by Bergholz et al.[19]. While Cope et al.
250 pages in less than 400000 seconds. Then using 5 virtual [20] and Barbosa et al. [21] have applied machine-learning
machines 300 pages are crawled in 153.04 seconds. With the algorithm to classify forms to find entry to the hidden web.
same configuration number of discovered URLs are 8452.
2) FORM MODELLING
This system has also worked on discovering new domains
After entry to the deep web, the next step is form modelling.
from newly discovered URLs. Comparison is made between
that includes the identification of the type of classification.
response time for multithreaded crawler and virtual
Forms can also be classified studies based on pre-query or p
machines. For 300 pages, the response time of multithreaded
ost query. In the post query case response page is a source o
crawler is 142132.4 and virtual machines on cloud computing
f classification. Feature of each form is also the source of cla
are 512159.8.
ssification.
According to Gunawan et al.[15] infinite threads curtail the
3) QUERY SELECTION
performance of web crawler. The system has divided
Kashyap et al. [22] have used the concept of static hierarchy
crawling based on the heuristic that the large site are crawler
for query selection. BioNav has been used for the
before smaller size sites. Results are compared for CPU and
hierarchies. The performance measure is based on the
memory utilisation. For 2000 threads CPU utilisation is 70%
overall cost of the queries. The aim is to retrieve a greater
at 550 Mbps bandwidth. Choosing a suitable approach to
number of records. Chen et al. [23] have worked on both
divide the Web is the main issue in parallel crawlers.
the content and the structure of the form for queries as well
Achsan and Wibowo [15] have worked on politeness
as databases. The quality of the query is measured in term of
property. Bosnjak et al. [16] proposed continuous and fault-
difficulty over the database. This model estimates the
tolerant web crawler called Twitter Echo. This crawler
number of queries compulsory to retrieve the whole content
continuously extracts data from twitter like communities.
of the hidden web site. Performance is measured in terms of
Performance is measured in terms of classification accuracy
correlation of average precision. Madhavan et al. [24] have
with 99.4% of the highest classification accuracy for non-
proved that the load on the system increases by increasing
Portuguese sites.
the number of submissions. As some queries generate
Suryansh Raj et al. [16] have developed a platform-
duplicate results. So, the query selection technique should
independent distributed crawler that can handle AJAX-based
have the goal of “minimize the number of queries and
applications. They have also supported the breadth-first
maximize the accurate response”. Barbosa and Freire [25]
search for complete coverage. Performance is compared up
proposed model for unsupervised keyword selection. This m
to 64 active threads to crawl two-page application and
odel starts with keywords extracted from the form page.
medium sized application.
First, most frequent words are calculated, and submission is
Hongsheng Xu et al. [17] have implemented distributed
repeated until maximum results are obtained. Performance is
crawler based on Hadoop and P2P. All the files are stored and
measured in terms of
shared from in the distributed file system. Performance is
measured as time to crawl vs nodes.
DG- Distributed General
DF- Distributed focused
CT- crawling time

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3004756, IEEE Access

Sawroop kaur, G. Geetha: SIMHAR - Smart distributed web crawler for the hidden web using SIM+Hash and Redis Server

REFEREN DG DF DV MAX NO CT PAGES/U DS MT CLASSI CPU-U T


CE OF RL FICATI
NODES ON
[11] - - ✓ 3 - 26136 361.50 - - - -
[12] - - 60 secs 4000 - - - - -
[13] ✓ - - 7HRS 7000 - - - 35% -
[14] ✓ - 50 rounds 6000 - - - - - 106
-107
[15] ✓ - 5VM 153.04 8452 - - - - -
[16] - ✓ - 4 hours - - 2000 at - 70 -
550
Mbps
[17] - ✓ ✓ - - - - - 99.4 - -

TABLE 1. Comparison of reported distributed web crawlers.

the effectiveness of the technique with and without using a retrieve maximum relevant pages. Focused web crawlers like
wrapper. Ntoulas et al. [26] proposed a technique in which general web crawler have same components called:
for submission is done with the user-provided keyword. It
also extracts keywords from the response pages. The 1. Fetcher or downloader which fetches the web page
keywords with higher informativeness are selected., which is and retrieves its contents.
calculated as their accumulated frequency.
2. Frontier that stores the URLs of unvisited websites
along with the visited one, for extraction of further
4) CRAWLING PATH LEARNING: information

Searchable forms can be reached using a path learning In addition to these three components, focused web crawler
process. Relevant page with correct response can be has a topic-specific crawling model, relevance estimation
and ranking module. Focused crawler first collects some
generated by following the pages in an order. Based on path
URLs as seed sites. From these URLs, a crawler begins its
learning crawlers can be categorized as blind crawlers, with crawling process and give results in the form of webpages
aimless but perpetual download capacity and focused crawled.
crawlers[28], [29] that works on a fabricated path and reach From the above literature, we conclude that distributed
the desired web page. The proposed crawler is based on the crawlers for the hidden web are limited and they face
focused crawlers. Overall performance check of the crawlers performance issues in terms of scalability, duplication, and
is based on their access to the hidden websites and to unable to support frequent change in the underlying
technology of web pages. To address the above-mentioned
measure the finding capacity and accuracy of the desired
challenges, focused distributed web crawler is developed
forms. Hidden web crawling neither have standard dataset that can handle duplication and scalability. Duplication
nor comparison framework and testing environment to detection is based on hybrid technology using hash-maps of
compare features of the techniques. So the reported research Redis and Sim+Hash. Redis server is also acting as a data
is compared using different techniques used in hidden web store for a massive amount of web data so that the growth of
crawling. hidden web databases is handled ensuring scalability.

Plethora of information is available on the methods to


retrieve information, however there is no refined information
B.) FOCUSED CRAWLERS: about the working and liability of distributed web crawlers
Focused web crawlers play an important role in creating and and their role in bringing hidden web data to light. Some
maintaining subject-specific web collections. Application of emerging research in this area used MapReduce to compute
focused crawlers includes search engines, digital libraries, term frequency etc. As opposed to general web crawling,
specialized information extraction and text classification, hidden web crawling requires a complex approach to parse,
high-quality result page, minimizing the time, space and process and extract information from the hidden websites.
network bandwidth. The goal of a focused crawler is to And similarly, the process of distribution in hidden web

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3004756, IEEE Access

Sawroop kaur, G. Geetha: SIMHAR - Smart distributed web crawler for the hidden web using SIM+Hash and Redis Server

crawling is equally challenging. Performance of crawler is Path of the URL is learned to reach the exact location of the
highly influenced by architecture and techniques of form. special symbol related to the path is the forward-slash
crawling. From the literature review, it is found that (/). Path of the URL is found after the hostname. Anchors
Distribution can be implemented to covercome the are helpful in internal navigation in URL. And we need to
drawbacks such as scalability, duplication, and inability to find the internal links as well.
support frequent changes in the underlying technology of
web pages. The crawlers working in a methodical approach
A. WEIGHT CALCULATION OF TERMS
can effectively touch specific topics. As per the literature
Based on feature vector construction, we have to compute
there isn’t any focused distributed web based crawler
the weight of term corresponding to its occurrence in
designed to uncover hidden data.
URL(U), anchor(A), text around the anchor (T) and path (P).
Term frequency of term Ti in U, A, T and P and is defined
III THE PROPOSED ARCHITECTURE.
as:
The proposed architecture work in three stages.
tifi= α×tifi+ β×tifi+ γ ×tifi+ δ×tifi (1)
1. URL adaption and classification: Frontier is
initialized in this phase, followed by parameter where α, β, γ, and δ are the weight coefficient. Ig is the
learning, ranking and domain classification. information gain of terms.

2. Relevant source selection: When frontier 𝑤𝑖𝑗̇ =𝑡𝑖 𝑓𝑖 ×ⅈ ̇ ⅆ𝑓𝑗 ×𝐼𝑔


(2)
encountered a URL, all the links are extracted in √∑𝑁 ̇
𝑁=1(𝑡𝑖 𝑓𝑖 ×ⅈ ⅆ𝑓𝑗 )2
link frontier, and in fetched link frontier.
It is proved in [12] that outcome of the tf-idf alone is an
3. Underlying content extraction: While in the third inappropriate distribution of feature vector. In their
stage the form structure is extracted to fill and approach, information is combined with the segmentation of
submit the forms. page in a major four sections. In our approach weights are
based on URLs and associated terms. After term weighting
This system has implemented the frontier as a queue from similarity is computed. Similarity(S) is computed between
which URLs are taken out for further processing. The the already discovered URL and newly discovered URL. The
frontier starts from the seed URLs. We have implemented similarity is required in the ranking section. The similarity is
three queues as frontiers. The frontier for seed URLs consists computed as.
of URLs from the directory. The frontier for links consists of
URLs extracted from the seed URLs. The frontier for fetched S = sim(U, Unew)+sim(A, Anew)+sim (T, Tnew) (3)
links consists of URLs from links. The frontier depletes very
easily. So as the frontier for seed URLs will have a scarcity Similarity has a different meaning concerning each step-in
of URLs, frontier for links will be used. A webpage can have web crawling. The crawler has to work on finding similar
multiple hyperlinks but not all are relevant. Aim of a web URLs so that it can prevent similar data retrieval. It also
crawler is to fetch maximum deep websites by minimizing needs to find similar content for top k queries, as well as text
the visited URLs. Following figure 1 show three phases of with semantic similarity.
the crawler. All the stages are interconnected with each After pre-processing, the system has a list (k) of more than
other. As the URL is extracted from the frontier, the next step 50k keywords. Now using the similarity model (V), the
is pre-processing of URLs. system read the reference file. In the next step, the elements
are removed from the list (k) one by one. The similarity is
A. PRE-PROCESSING OF URLS
The baseline components of URLs are extracted. These are computed between two lists. For example, the flight is a
host, extension, documents, path etc). from all the word in list (k) and it has a close match in (V). If cosine
components URL, path, anchor and text around anchor are similarity is 1, it means an exact match is found. Close match
results are used for queries during repository generation in
fed to feature vector. The system has implemented python
NLTK for stemming, stop word removal and tokenization. form submission. Crawler removes duplicate URLs from the
Now all the segmented words are fed to feature vector. frontier using Redis hash maps. The websites have a
complex relationship with each other. Same URL can be
Feature space for the hidden website is defined as: found on multiple websites. Downloading the same URLs
FD= [ URL, anchor, text around anchor]. multiple times is a waste of resources. Redis database
includes a de-duplication set. So that’s why the unique
Feature space for links of the hidden website is defined as: fingerprint for each request is calculated first. Fingerprints
FL= [path, anchor, text]. are put in this set. All the repeated requests are removed here.
Simhash [30] is combined with Redis for improvement in
results.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3004756, IEEE Access

Sawroop kaur, G. Geetha: SIMHAR - Smart distributed web crawler for the hidden web using SIM+Hash and Redis Server

FIGURE 1. Architecture of SIMHAR as a single entity focused crawler

3. Extract the links from X.


4. Extracted links are saved in the link queue. The
B. LEARNING link queue is ordered using the similarity model
Feature construction is explained in section A. The crawler with respect to [P, A, T].
is adaptive in nature, results from the first run are used in
successive runs. Following steps are performed in the 5. Check for searchable forms by following
learning algorithm. rejection rules as shown in Figure 2.
1. A new website (X) is encountered, extract [U, A,
T]. 6. If the form is searchable, extract path, anchor and
text.
2. For each URL, arrange the frontier using a
similarity model with respect to [U, A, T]. 7. Update the information in parameter learning
module in stage 1, and link ranking in stage 2.
And new features are reflected in these two
modules.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3004756, IEEE Access

Sawroop kaur, G. Geetha: SIMHAR - Smart distributed web crawler for the hidden web using SIM+Hash and Redis Server

8. Stop, if crawler has reached the threshold of 0.8, Rule1: If crawler do not find any <form> tag, consider this
i.e 80 new URLs and 0.1, i.e 100 new forms at a non-searchable form.
depth one. Follow steps 1-8 for depth 2 and depth
3.
Rule2: If crawler found the <form> tag. Then extract the
C. RANKING attribute type. If the attribute type is not in repository call
Aim of ranking in hidden web crawling is to extract top n it a non-searchable page.
documents for the queries. The cost is expected to be the
least for this work. We have adopted the formula for Rule 3: If crawler found the <form> tag, and extracted
ranking from [31]. But our reward function (€) is based on
attribute type matched in a repository. But the attributes <
a number of out-links, site similarity and term weighting.
Let SF be the frequency of out-links. 3, consider this page non-searchable.
SF= ⅀ Ii (4)
Rule 4: If the number of attributes is >3, but the submit
I =0, site has not appeared, button is not found, consider this page as non-searchable.

I=1, if it has appeared Rule 5: If there exists <form> tag, and attributes are similar
€ = wij +S+SF (5) to the repository, and submit button is also there. But
button marker is not present then consider this page as non-
(rj)= (1-w). δj + w. ranking reward (€) / cj (6) searchable.
w is the weight of balancing € and cj.. δj is the number of
new documents. Computation of rj shows the similarity of Rule 6: If there exists <form> tag, and attributes are similar
€ and returned documents. If the value of € is closer to 0, to the repository, submit button and button marker is are
it means that returned value is more similar to the already
present. It is a searchable form.
seen document i.e the new URL is similar to the already
discovered URL. cj is a function of network
communication and bandwidth consumption. Rule 7: If there exists <form> tag, but crawler found login,
then this is non-searchable.
D. DOMAIN CLASSIFICATION
Domain classification is based on topical relevance of the Rule 8: if there exist <form> tag, but crawler found
site and the home page. As the new URL is received its registration, consider this page as non-searchable.
homepage content is parsed and feature vector is
constructed. The resulting vector is fed to the classifier to Rule 9: if there exist <form> tag, but crawler found
check relevancy. The crawler gets the URLs and the subscribe, then consider this page as non-searchable.
request is sent to a server to fetch the page. The crawler
will first check for the presence of a search interface. The
forms are two types defined as follows: Rule 10: If there exist <form> tag, but crawler found
mailing list subscription, then consider this page as non-
searchable.
• Searchable form: “Webform is called searchable for
Figure 2 shows the diagrammatic view of the above-
m if it is capable of submitting a query to an online
mentioned rules.
database which in turn, return the results of a query.”
.
• Non-Searchable form: “The forms, for example, logi
n registration, mailing list subscriptions forms, and s
o on are called non-searchable forms. These forms d
o not represent database queries.

A crawler encounter number of types of web pages, but


not every web page is searchable. So following rules are
designed to help crawler decide whether the encountered
form is searchable or not. After these rules are applied, the
crawler has a set of URLs which have <form> tag as well
as the property of being searchable. On being provided
with suitable values, these forms will retrieve the data from
the associated database.

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3004756, IEEE Access

Sawroop kaur, G. Geetha: SIMHAR - Smart distributed web crawler for the hidden web using SIM+Hash and Redis Server

Figure 2: Rejection framework for forms

E. FORM STRUCTURE EXTRACTION • Text: This area of a form can be edited with
After the webpage with form is found, the content of the multiple lines of words.
form is extracted. Search forms have controls that a human • Input: This editable area has following attributes
can easily fill and submit. If a crawler has to fill the forms types- type as text, Submit, checkbox and radio
automatically it has to have a set of resources to button
automatically fill and submit the forms with suitable • Select: Select has two options like drop-down list
values. A task-specific database called repository, is box and multi-choice list box.
initialized at the launch. This database contains the set of
values for filling the forms, created by parsing the form as TABLE II
CONTENTS OF REPOSITORY
shown in table [2]. And form element table is created with
a control element type, label and domain values. The
Control Label Domain Type Of Size Status
crawler will adaptively learn filling values with associated Element Domain
forms. When the first run of the crawler is completed, the (Visible
parsed values will be analyzed to collect data. Form Fields)
submission is of two types: post-form submission and get Submit Search Submit Infinite More VR
than 3
form submission. This crawler work on both type of kb
submissions. After the form is submitted crawler got the Radio Flight Round Bounded More VR
response status. Response status is either a valid page or no trip trip than 3
page found code. Following steps are performed during One-way kb
trip
form parsing: Select From, Name of Bounded More VR
to place than 3
• Using Request library of python, HTTP GET (eg: Delhi kb
to
request is sent to URL of a webpage. America)
• The response of HTTP request is HTML content
of a webpage. Two heuristics are implemented based on visible fields. If
• Data is fetched and parsed using Beautiful soup. the number of visible fields is one or two -forms are
• HTML tags and their attributes are analysed. classified using query probing, label extraction otherwise.
• Data is output in CSV file. As explained in [32] form submission include problems
like 404 error page, duplicate information, and sometimes
F. FORM AND RESPONSE ANALYSIS all information is retrieved in single submission otherwise
multiple submission are required. The crawler has also
Forms have multiple control elements. It could be of any faced these problems.
type:

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3004756, IEEE Access

Sawroop kaur, G. Geetha: SIMHAR - Smart distributed web crawler for the hidden web using SIM+Hash and Redis Server

The above architecture (figure 1) describes the working of


a single entity of focused crawler for the hidden web. The
G. QUERY PROBING
Redis server is implemented as shared storage for URLs.
Aim of query probing is to develop a set of queries for each
Redis store information in cache, unlike databases, that is
class. On submitting these queries crawler will retrieve the
same documents for that category. Our approach is similar why information access is faster. The proposed crawler is
[33], but we have implemented hierarchal classification, developed in Python. Scrapy is an application framework.
and the system can expand to the number of classes as it Scrapy helps to extract web pages and structural data. For
crawls more URLs. Currently, classes are based on seed distributed crawling, Scrapy and Redis are integrated to
dataset. implement more than one server. A crawler is implemented
with breadth-first search per host. Data can be extracted
either by using API of a website or by extracting
H. FORM SUBMISSION
information by accessing the webpage. To create a tree
Two related techniques with form submission are:
structure of HTML data html5lib parser library is used. To
navigate through parse tree beautiful soup is used. It can
HTTP POST: In post query technique, forms are submitted
with (name, value) tuple. This pair is sent encoded in the pull any type of data. The following figure 3, explained
body of the request. Query probing is implemented in post distribution using multiple Redis to make it fault-tolerant.
method technique. Also the links captured from the web must be shared with
all the crawlers.
HTTP GET: In get query technique, forms submission
takes place with giving (name, value) pairs in URL. The L. JOB SCHEDULING:
pre-query technique is implemented in get method Because scrapy has no mechanism for link sharing even
technique. In get method URL has three symbols a though it has schedular, so the URLs are shared from
question mark (?), equals to (=) and ampersand (&). (?) memory of crawler. Figure 4 show the step wise
differentiates encoded (name, value) from the base URL implementation of job scheduling. The task of job
and action path. (=) ,(&) separates the field name and field scheduler is to prevent overloading the websites.
value.

I. STOPPING CRITERIA
Exhaustive crawling is waste or resources. This system
has implemented following stopping criteria.
• Maximum depth of crawl: the crawler will stop
following the link when the depth of three is
reached. It is proved in [23] that most of the deep
web pages are found till depth 3. While at each
depth maximum number of pages to be crawl is
100.
• At any depth maximum number of forms to be
found is 100 or less than a hundred.
• If the crawler is at depth 1, it has crawled 50
pages, but no searchable form is found, it will
directly move to next depth. And the same rule is
followed at depth 2. Suppose if at depth 2, 50
pages are crawled and no searchable form is
found. The crawler will fetch new link from URL.

J. ASSUMPTIONS AND THRESHOLDS


• The size of the frontier should not decrease below
100 URLs at a time, as the number decreases, it
will crawl URLs from link frontier and fetched
link frontier.
• Learning threshold is 80 new sites and 100 new
searchable forms.
• URLs are picked out from a crawler using first in
FIGURE 3. Distribution of SIMHAR based on Redis server.
first out order.
K. DISTRIBUTION

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3004756, IEEE Access

Sawroop kaur, G. Geetha: SIMHAR - Smart distributed web crawler for the hidden web using SIM+Hash and Redis Server

Web pages are popped out from the frontier first-in-first- V. EVALUATION
out way. We have considered this as our baseline This crawler has to first check if the page belongs to hidden
assumption. Crawling in breadth-first fashion is web or not, by following the rules in Figure 2. After the
implemented as URL and server-based. It is proved in [34] seed database URLs are checked for <form> tag, the
that it yields promising results. Following steps are crawler has to pull the contents of the webpage using the
performed in Job Scheduling. URL. The request library help make use of HTTP with in
the python program. Beautiful Soup can extract any type
Step 1. To start crawling, scrapy send schedule request to of data from a webpage. After the HTML Markup’s are
message to a crawler. removed page is saved for further processing. Beautiful
soup is combined with urllib3 to work with web pages.
Step2. As crawler receives the request it starts crawling. Other way is to download a copy of webpage then use it
From the Redis URL queue, a URL is selected and sent as locally. Beautiful soup has feature called “prettify”, in
a request to schedular. which all the unnecessary tags can be dropped. We have
selected 6 domains from the dataset. This dataset contains
Step 3. Schedular receives a request of URL, it forward this more than 260000 associated URLs.
to Redis (request queue), and then again contact (request Initially the DMOZ dataset is used. The
scheduled) is made with scrapy. performance of the classifier is measured using confusion
matrix. Rows of confusion matrix denote actual class,
Step 4, 5. Now the associated webpage is to be while column indicate classes predicted by SVM and knn
downloaded, for this request is popped from the top of a classifiers. We have computed accuracy for each class.
request queue, and downloader on receiving the request, Average of each class denote the performance of the
download the request page. classifier. The performance metrics are precision, recall
and f1. Precision is classification of portion of webpages
Step 6, 7. Downloader after getting the contents of a page, that are relevant to the class. It means how correct the
submit it page to the crawler. system is to reject the web pages that are not relevant.
Recall is how correctly classifier can find relevant
Step 8,9. Crawler parses the webpage, collect the new document. In Redis multiple jobs are separated using
URLs and send a new list of URLs to Redis pipeline. Redis unique keys. So, jobs are not mixed
pipeline send these URLs to Redis queue. One another
advantage of

FIGURE 4. Job scheduling in proposed crawler

IV. CONFIGURATION
The system hardware environment includes: CPU is Table III shows the description of status codes. Web
Intel®,Core™ C5-7200@ 2.50 GHZ 2.70GHZ,with forms cannot be submitted for these codes. Table IV
installed RAM-12.0 GB, and Redis 3.0.509. the crawler is shows the values of precision, recall and accuracy using
implemented in python. The internet speed during the support vector machine for status codes mentioned in
experiment was 50-100mbps. table III. This shows how accurately system has detected
the forms which cannot be submitted.
Table V shows the confusion matrix for correctly
submitted forms. Table VI and table VII show the

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3004756, IEEE Access

Sawroop kaur, G. Geetha: SIMHAR - Smart distributed web crawler for the hidden web using SIM+Hash and Redis Server

submission accuracy results using support vector machine


TABLE V
and K- nearest neighbor algorithms. Results have shown
CONFUSION MATRIX FOR CORRECTLY SUBMITTED WEB PAGES
that SVM has performed better than Knn. Figure 5 shows
the percentage of forms submitted per class. This
percentage can vary with the number of URLs. The size of Actual Total
the database decreases as many of the URLs which do not Class
fulfill the criteria of searchable forms are rejected. Figure Flight 2369 0 0 0 0 0 2369
6 shows the comparisons for similarity detection using Book 0 779 3 26 0 67 875
cosine similarity, Simhash and hybrid technique of Redis 0
+Simhash. From the figure it evident that Redis+Simhash
Hotel 0 56 23 0 0 14 93
has given better results.
Produc 0 105 0 55 0 8 168
TABLE III t
STATUS CODE AND THEIR DESCRIPTION Music 0 46 0 0 0 0 46
Auto 0 11 0 2 0 7 20
Status Predict Flight Boo Hote Prod Mus Auto 10698
Description
code
ed k l uct ic
200 Asynchronous response
Class
400 Bad request error

404 Page not found TABLE VI


PRECISION, RECALL AND F1 SCORE FOR CORRECTLY
413 Payload too large -Request entity SUBMITTED CODE USING SVM
is large
CLASS PRECISION RECALL F1 SCORE
414 Payload too large -URI too long

500 Internal server error FLIGHT 1.0 1.0 1.0

BOOK 0.99 0.97 0.98


503 Service unavailable error
HOTEL 0.97 0.88 0.39

PRODUCT 0.25 0.68 0.44


TABLE IV
PRECISION, RECALL AND F1 SCORE FOR STATUS CODE, USING MUSIC 0.33 0.0
0.0
SVM
AUTO 0.39 0.73 0.12
STATUS PRECISION RECALL F1- SUPPORT
CODE SCORE
TABLE VII
200 0.65 0.95 0.79 875 PRECISION, RECALL AND F1 SCORE FOR CORRECTLY
400 0.00 0.00 0.00 72 SUBMITTED FORMS USING KNN
404 0.56 0.20 0.30 168
413 0.00 0.00 0.00 60 CLASS PRECISION RECALL F1 SCORE
414 0.00 0.00 0.00 16
500 0.00 0.00 0.00 13
FLIGHT 0.70 0.89 0.79
503 0.97 0.85 0.91 1150
0.89 4761 BOOK 0.79 0.25 0.38
accuracy
HOTEL 0.66 0.33 0.44
Macro avg 0.40 0.38 0.37 4761
Weighted 0.88 0.89 0.88 4761 PRODUCT 0.00 0.00 0.00
avg MUSIC 0.00 0.00
0.00
AUTO 0.92 0.91 0.91
Table IV show the performance of system in terms of
classification of web forms that can not be submitted. Now
the system has to submit searchable forms. For this
confusion matrix is generated and classification
performance is checked using SVM and KNN.

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3004756, IEEE Access

Sawroop kaur, G. Geetha: SIMHAR - Smart distributed web crawler for the hidden web using SIM+Hash and Redis Server

2. This crawler is a successful implementation of


focused crawling in the hidden web, then crawler
is put on distribution mode.
3. The similarity is implemented in two ways,
cosine similarity is first modified by adding
information gain, then it is used for finding
similar terms for top k queries.
4. This system is scalable as it can handle the size of
growing web databases. And it is scalable in
terms of processing and storage as well.
5. The crawler has inherent Redis security and fault
tolerance. If one Redis server will stop
responding the other will come into working as
shown in figure [3]. Redis port can be made open
for specific clients only. In our future work, we
FIGURE 5. Percentage of correctly submitted forms submitted per
class. will work on this by implementing the crawler for
client-based web harvesting.
6. The crawler is efficient in computing similarity
with Redis and simhash combination.
7. The crawler work with both pre and post query
approaches.
8. We have implemented stopping criteria’s with
which crawler resources will never deplete. As
the number is fixed for forms as well as new
found URLs.
VII. CONCLUSION
In this paper, we have proposed Redis based distributed
web crawler for the hidden web called SIMHAR. This
crawler is practical i.e its retrieved pages can be used for
indexing and harvesting. The evaluation is detailed with a
FIGURE 6. Comparison of similarity detection of SIMHAR with cosine controlled environment using DMOZ, jasmine and amazon
and Simhash.
as the seed directories. We expect that system on being
expanded to full scale can perform even better. The future
VI. DISCUSSION work will introduce deep learning, and enhanced rejection
criteria so that the goal of minimize visit and maximize the
A. MAIN CONTRIBUTION
number of hidden websites could produce more promising
After carefully review the literature, we found that focused results.
hidden web crawling using distribution is still and
unexplored area. The crawler named as SIMHAR is REFERENCES
proposed and implemented based on this research gap.
Results have proved that crawler detect and submit forms [1] A. Heydon and M. Najork, “Mercator: A scalable, extensible
Web crawler,” World Wide Web, vol. 2, no. 4, pp. 219–229,
efficiently. Similarity detection in SIMHAR is hybrid 1999.
technique based on Simhash and Redis server. Results are [2] G. Mohr, M. Stack, I. Rnitovic, D. Avery, and M. Kimpton,
promising on compared with existing technologies. “An Introduction to Heritrix,” 4th International Web Archiving
Following point discuss the other contributions Workshop, pp. 109–115, 2004.
[3] M. Cafarella and D. Cutting, “Building Nutch: Open Source
Search,” Queue, vol. 2, no. 2, pp. 54–61, 2004.
[4] M. Yadav and N. Goyal, “Comparison of Open Source
1. The values of forms submitted vs classes [figure Crawlers- A Review,” International Journal of Scientific &
5] can vary with the number of available URLs Engineering Research, vol. 6, no. 9, pp. 1544–1551, 2015.
because not all URLs would fall in the category [5] M. Herrmann, K. Ning, C. Diaz, and B. Preneel, “Description
of hidden web sites. Even if the dataset is huge, of the YaCy Distributed Web Search Engine,” 2014.
[6] Y. Li, Y. Wang, and J. Du, “E-FFC: An enhanced form-focused
these values depend on the searchable forms. crawler for domain-specific deep web databases,” Journal of
Intelligent Information Systems, vol. 40, no. 1, pp. 159–184,

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3004756, IEEE Access

Sawroop kaur, G. Geetha: SIMHAR - Smart distributed web crawler for the hidden web using SIM+Hash and Redis Server

2013. Web sites,” WebDB, pp. 1–6, 2011.


[7] L. Barbosa and J. Freire, “An adaptive crawler for locating [32] S. Liddle, D. Embley, D. Scott, and S. H. Yau, “Extracting Data
hiddenwebentry points,” Proceedings of the 16th international Behind Web Forms,” Lecture Notes in Computer Science, no.
conference on World Wide Web - WWW ’07, p. 441, 2007. 2784, pp. 402–413, 2003.
[8] L. Barbosa and J. Freire, “Searching for Hidden-Web [33] P. G. Ipeirotis, L. Gravano, and M. Sahami, “Automatic
Databases,” Proceedings of WebDB, vol. 5, pp. 1–6, 2005. classification of text databases through query probing,” Lecture
[9] B. Zhou, B. Xiao, Z. Lin, and C. Zhang, “A distributed vertical Notes in Computer Science (including subseries Lecture Notes
crawler using crawling-period based strategy,” Proceedings of in Artificial Intelligence and Lecture Notes in Bioinformatics),
the 2010 2nd International Conference on Future Computer vol. 1997, pp. 245–255, 2001.
and Communication, ICFCC 2010, vol. 1, pp. 306–311, 2010. [34] C. Castillo, A. Nelli, and A. Panconesi, “A memory-efficient
[10] W. Gao, H. C. Lee, and Y. Miao, “Geographically focused strategy for exploring the Web,” Proceedings - 2006
collaborative crawling,” Proceedings of the 15th international IEEE/WIC/ACM International Conference on Web Intelligence
conference on World Wide Web - WWW ’06, p. 287, 2006. (WI 2006 Main Conference Proceedings), WI’06, pp. 680–686,
[11] J. Yu, M. Li, and D. Zhang, “A Distributed Web Crawler Model 2007.
based on Cloud Computing,” no. 66, pp. 276–279, 2016.
[12] F. Ye, Z. Jing, Q. Huang, and C. Hu, “The Research and
Implementation of a Distributed Crawler System Based on
Apache Flink,” vol. 3, pp. 90–98.
[13] H. M. Moftah and S. M. Abuelenin, “Elastic Web Crawler
Service-Oriented Architecture Over Cloud Computing,” 2018.
[14] D. Gunawan, Amalia, and A. Najwan, “Improving Data Sawroop kaur, received her degree in Mtech
Collection on Article Clustering by Using Distributed Focused from Lovely Professional University, Punjab,
Crawler,” Journal of Computing and Applied Informatics, vol. India. She is currently pursuing Phd in computer
1, no. 1, pp. 39–50, 2017. science from Lovely professional university.
[15] H. T. Y. Achsan and W. C. Wibowo, “A fast distributed
focused-web crawling,” Procedia Engineering, vol. 69, pp.
492–499, 2014.
[16] M. Bošnjak, E. Oliveira, J. Martins, E. M. Rodrigues, and L.
Sarmento, “TwitterEcho - A Distributed Focused Crawler to G.Geetha is a Professor in Computer
Support Open Research with Twitter Data,” WWW - MSND Science and Heads Division of Research and
Workshop, pp. 1233–1239, 2012. Development at Lovely Professional
[17] H. Xu, K. Li, and G. Fan, “An Improved Strategy of Distributed University, Punjab. She works in the area of
Network Crawler Based on Hadoop and P2P,” vol. 2, pp. 849– cybersecurity. She has published several
855, 2019. research papers and has graduated six PhD
[18] J. Madhavan, D. Ko, and A. Rasmussen, “Google ’ s Deep-Web students.
Crawl,” pp. 1241–1252.
[19] A. Bergholz and B. Chidlovskii, “Crawling for domain-specific
hidden web resources,” Proceedings - 4th International
Conference on Web Information Systems Engineering, WISE
2003, pp. 125–133, 2003.
[20] J. Cope, N. Craswell, and D. Hawking, “Automated Discovery
of Search Interfaces on the Web BT - Fourteenth Australasian
Database Conference (ADC2003),” vol. 17, pp. 181–189, 2003.
[21] L. Barbosa and J. Freire, “Combining classifiers to identify
online databases,” Proceedings of the 16th international
conference on World Wide Web - WWW ’07, p. 431, 2007.
[22] A. Kashyap, V. Hristidis, M. Petropoulos, and S. Tavoulari,
“Effective navigation of query results based on concept
hierarchies,” IEEE Transactions on Knowledge and Data
Engineering, vol. 23, no. 4, pp. 540–553, 2011.
[23] K. C.-C. Chang, B. He, C. Li, M. Patel, and Z. Zhang,
“Structured databases on the web,” ACM SIGMOD Record, vol.
33, no. 3, p. 61, 2004.
[24] J. Madhavan et al., “Structured data meets the web: A few
observations,” IEEE Data Eng. Bull, vol. 29, no. 4, pp. 19–26,
2006.
[25] L. Barbosa and J. Freire, “Siphoning Hidden-Web Data through
Keyword-Based Interfaces,” vol. 1, no. 1, pp. 133–144, 2010.
[26] A. Ntoulas, “Keyword Queries,” Framework, pp. 100–109,
2005.
[27] J. Caverlee, L. Liu, D. B. Probe, and Cluster, “and discover:
Focused extraction of qa-pagelets from the deep web,” in:
ICDE, no. 1, pp. 103–115, 2004.
[28] T. Furche et al., “DIADEM: Thousands of Websites to a Single
Database,” Proceedings of the VLDB Endowment, vol. 7, no.
14, 2014.
[29] P. Liakos, A. Ntoulas, A. Labrinidis, and A. Delis, “Focused
crawling for the hidden web,” World Wide Web, vol. 19, no. 4,
pp. 605–631, 2016.
[30] C. Sadowski and G. Levin, “SimHash : Hash-based Similarity
Detection,” Techreport, pp. 1–10, 2007.
[31] G. Valkanas and A. Ntoulas, “Rank-Aware Crawling of Hidden

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

You might also like