Dark Web Crawling Using Focused and Classified Algorithm
Dark Web Crawling Using Focused and Classified Algorithm
Dark Web Crawling Using Focused and Classified Algorithm
Corresponding Author:
Putri Rahmasari Yunelfi
Department of Computer Engineering
School of Electrical Engineering, Telkom University
Bandung, Indonesia
Email: putriyunelfi@student.telkomuniversity.ac.id
1. INTRODUCTION
The dark web is an intranet whose information is intentionally hidden from surface networks. Accessing them
requires specific configuration, methods, and assistive software (TOR) to access these resources. In this layer of the
dark web, it is known that there are considerable efforts to maintain user anonymity [1]. The dark web has two
activities that users do when accessing, namely legal and illegal activities [2]. An example of abuse of a website with
illegal activities is the presence of data that openly provides confidential company, government, and other sensitive
information for the misuse of that information. This behavior can have a negative impact on related companies,
governments, or even individuals. In addition, illegal activities on the dark web that are often carried out are buying
and selling goods/things that are illegal by the state such as drugs, organs, alcohol, etc.
One solution that can be used to overcome the use of the dark web for illegal activities is the use of web crawlers.
A web crawler or what is now commonly known as a crawler is software that can surf the web automatically and can
extract data from the internet [3]. The web crawler is a crawler feature that, is extensible, can add a lot and expand
bandwidth, is robust, can work on the static and dynamic web, is scalable to handle new data and protocols, and quality
content can be seen from the selected index, duplicate content can differentiate and remove duplicates executable data
submitted in multiple passes, exclude content that is prohibited from crawling, and removes spam from blacklisted
URLs with priority below [4]. Web crawlers or spiders can automatically index web pages a crawler database is used
to store HTML documents [5]. In web crawler implementation, this can be done by writing socket communication-
based crawler, HTTP: protocol-based crawler, PhantomJS based interfaceless browser crawler, and selenium based
interface browser crawler [6]. The web crawler method has been widely circulated at this time, especially for its use
on conventional websites. However, the dissemination of sensitive data to the public is not only done on conventional
websites. Not many web crawler methods are provided for use on the dark web, considering how to access the dark
web is quite difficult and requires a different method from conventional sites.
The crawling method used on the dark web is different because basically, the dark web is a network under the
normal internet that has a layer of protection (encryption) to access it. With this multi-layer encryption, not everyone
can access the dark web. In addition to multi-layer encryption, the dark web is also commonly used because it is
impossible to know the identity of the perpetrators of these illegal transactions due to the anonymity system. Dark
web crawling is done through deep crawling of specific web pages on the dark web using TOR proxy [7]. One of the
crawling methods that can be used to explore the dark web is the focused crawling method, as it can find links on
potentially relevant websites while avoiding irrelevant areas of the site. [8][9]. Focused crawling is a method in which
a crawler automatically crawls to a relevant web page. Targeted exploration in this process is carried out at the harvest
level obtained from calculations carried out with precision and memory by incorporating priority strategies, learning
strategies, evaluation strategies and training strategies to be collected in specific areas before being transferred to local
repositories [3][5][10][11]. By crawling focused on the dark web, you can collect various relevant URLs on certain
topics from the dark web by classifying them according to the keywords you are looking for.
3. METHOD
In this system used focused crawling method to explore sites. The crawler will search relevant data on the onion
site as quickly and efficiently as possible. This crawler works in the following steps: Site Locating, In-site Exploring,
Duster, and Pre-query.
1. User: In this step, the user will provide input a link in the search query. Based on the input given by the user,
the crawler will search for relevant content or topic according to the user's needs.
CEPAT Journal of Computer Engineering: Progress, Application, and Technology, Vol. 1, No. 2, August 2022: 1-6
CEPAT Journal of Computer Engineering: Progress, Application, and Technology 3
2. Site Locating: During the Site Locating phase, the user will find the most relevant results for a particular
topic. It also finds different results in different website formats. The site placement phase consists of three
stages, namely: Site Collection, Site Ranking, and Site Classification. This stage will collect all websites both
visited and not visited using the reverse search method. After all, sites that have never been visited are
collected, a ranking will be given according to relevance and the site will be classified. If the website is
relevant, the crawling process will start. Otherwise, the website will be ignored and a new website will be
recorded.
3. In-site exploring: At the stage of In-site exploring using 2 stages, namely link ranking and form classification.
link ranking serves to give priority to each link that is crawled to get the form. Classification of forms is also
carried out to collect and classify the contents of the form to get accurate results
4. Duster: In the crawling results we will find duplicate links. To remove and detect these links can use Duster.
Duster is a technique for detecting and eliminating DUST containing duplicate links with similar text from
search engines. The use of Duster is also to normalize the data obtained and prevent waste of resources.
5. Pre-Query: A pre-query approach is used to improve the efficiency of the crawling engine. In search by word
by word. pre-query will manage history, every time the user will perform a search query it will be checked
on the database if there are results in the database it will be given to the user. If the request is not in the
database, the data will be searched according to the user's request and the results will be stored in the database.
This program designed to access web pages is a form of web crawling. Search engines for surface web access use
the crawl method to get to the page they want to access. Basically, in the process of searching web pages, general
crawling on the surface web is done by entering keywords from the information on the page you want to open. Next,
the search engine will look for URLs related to the keywords entered and will display a collection of URLs relevant
to the keywords entered.
Basically, web crawlers work on the dark web the same way search engines do to crawl the surface web. The
method used for crawling on the dark web usually uses a crawl focus to classify the content of web pages, where
crawling is done by entering keywords from the information on the page to be accessed as well. However, while
crawling the dark web, it is accessed by a special network called TOR to access dark web pages that are encrypted in
layers. In crawling the dark web using TOR, it will generate a collection of dark web page URLs that match keywords.
Dark web crawling using focused and classified algorithm (Putri Rahmasari Yunelfi)
4 ISSN: 2963-6728
As a result, this paper will show the differences between the existing system and the proposed system. We will
compare the accuracy of the two systems. In conducting the experiment, the experiment was carried out 3 times using
3 different input links. For more information about each system results are written in the Table 1.
Experiment 1:
Url input: http://nanochanqzaytwlydykbg5nxkgyjxk3zsrctxuoxdmbx5jbh2ydyprid.onion/
Keyword: chan
Table 1. Existing System and Proposed System Result Comparison Url Experiment 1
System Used
Result
Existing Proposed
In the table above, it is shown the hat existing crawling method finds more URLs than our system. But, the
total relevant URLs found by our system is actually more, and with this number, if we compare system precision
between the existing system and ours, the existing system only has 61,26% of precision, this is far less than our system
precision result which is 85,32%. The difference in total URLs found between the existing system and ours is due to
the difference of focus when crawling is held, where our system, instead of collecting as many as possible URLs first,
is collecting and calculating the relevance of URLs found sequentially. But another reason why this experiment has a
different result between existing and our system can be caused by internet connection, or how a website server reacts
to the crawling process. For some relevant URL results in experiment 1 obtained by our system, it can be seen in table
2.
Table 2 above is some results of the founded and relevant URLs conducted in this experiment 1, with input
link http://nanochanqzaytwlydykbg5nxkgyjxk3zsrctxuoxdmbx5jbh2ydyprid.onion/, and searched keyword is “chan”.
Table 2 shows some URL results from 93 relevant URLs found by our system.
Experiment 2:
Url input: http://nv3x2jozywh63fkohn5mwp2d73vasusjixn3im3ueof52fmbjsigw6ad.onion/
CEPAT Journal of Computer Engineering: Progress, Application, and Technology, Vol. 1, No. 2, August 2022: 1-6
CEPAT Journal of Computer Engineering: Progress, Application, and Technology 5
Keyword: book
Table 3 Existing System and Proposed System Result Comparison Url Experiment 2
System Used
Result
Existing Proposed
In table 3, the system crawling result is roughly the same as a result conducted in experiment 1, which is the
existing system still has more total URLs found than our system, but in a matter of total relevant URLs, our system is
better and it’s affected the precision percentage which our system has 76,92%, far above the existing system precision
percentage which is 57,28%. The difference in total URLs found between the existing and our system is the same as
explained in experiment 1.
Table 4 above is some results of found and relevant URLs conducted in this experiment 2, with input link
http://nv3x2jozywh63fkohn5mwp2d73vasusjixn3im3ueof52fmbjsigw6ad.onion/, and searched keyword is “book”.
Table 4 shows some URL results from 60 relevant URLs found by our system.
Experiment 3:
Table 5. Existing System and Proposed System Result Comparison Url Experiment 3
System Used
Result
Existing Proposed
Dark web crawling using focused and classified algorithm (Putri Rahmasari Yunelfi)
6 ISSN: 2963-6728
Table 5 above is a result of experiment 3, with a result just like what happened to experiment 1 and 2, an
existing system found more URLs than ours, but still, our system has more total relevant URLs found. This affects
the precision percentage result, with our system having 87,05% of precision, and the existing system has 68,42%.
Table 6 above is some results of found and relevant URLs conducted in this experiment 2, with input link
http://p53lf57qovyuvwsc6xnrppyply3vtqm7l6pcobkmyqsiofyeznfu5uqd.onion/, and searched keyword is “news”.
Table 7 shows some URL results from 148 relevant URLs found by our system.
5. CONCLUSION
This crawler is able to perform its functionality on the dark web and produces high precision. The way this
crawler system identifies a given query and starts work by not only collecting every available link but also classifying
the results to find the most relevant results and ranking the collected links based on the given query. The system then
uses Duster to prevent duplication of all collected links so that the percentage precision represents the true value of
and capabilities of this crawler. This paper conducts three experiments on dark web links with different topics. In the
first experiment, taking the keyword "chan" on the existing system only has a precision of 61.26%, which is much
smaller than our system's precision of 85.32%. In the second experiment using the keyword "book" our system is
better and affects the proportion of our system having 76.92%, far above the percentage of the existing precision
system which is 57.28%. While the third experiment using the keyword "news" our system has a precision of 87.05%,
and the existing system has 68.42%. That way the best percentage level of precision is in our system. But on the use
of crawler focus, we must enter the right keywords, if it does not describe the desired object then the results obtained
are maximized.
REFERENCES
[1] S. M. M. Monterrubio, J. E. A. Naranjo, L. I. B. Lopez, and A. L. V. Caraguay, “Black widow crawler for TOR network to search for criminal
patterns,” Proc. - 2021 2nd Int. Conf. Inf. Syst. Softw. Technol. ICI2ST 2021, pp. 108–113, 2021, doi: 10.1109/ICI2ST51859.2021.00023.
[2] B. AlKhatib and R. Basheer, “Crawling the dark web: a conceptual perspective, challenges and implementation,” J. Digit. Inf. Manag., vol.
17, no. 2, p. 51, 2019, doi: 10.6025/jdim/2019/17/2/51-60.
[3] S. R. Mani Sekhar, G. M. Siddesh, S. S. Manvi, and K. G. Srinivasa, “Optimized focused web crawler with natural language processing based
relevance measure in bioinformatics web sources,” Cybern. Inf. Technol., vol. 19, no. 2, pp. 146–158, 2019, doi: 10.2478/cait-2019-0021.
[4] V. Shrivastava, S. S. J. Subodh, P. G. A. College, and J. National, “A methodical study of web crawler,” J. Eng. Res. Appl., vol. 8, no. 11, pp.
1–8, 2018, doi: 10.9790/9622-0811010108.
[5] K. Velkumar and P. Thendral, “Web crawler and web crawler algorithms: a perspective,” Int. J. Eng. Adv. Technol., vol. 9, no. 5, pp. 203–
205, 2020, doi: 10.35940/ijeat.e9362.069520.
[6] T. Fang, T. Han, C. Zhang, and Y. J. Yao, “Research and construction of the online pesticide information center and discovery platform based
on web crawler,” Procedia Comput. Sci., vol. 166, pp. 9–14, 2020, doi: 10.1016/j.procs.2020.02.004.
[7] P. Koloveas, T. Chantzios, C. Tryfonopoulos, and S. Skiadopoulos, “A crawler architecture for harvesting the clear, social, and dark web for
IoT-related cyber-threat intelligence,” Proc. - 2019 IEEE World Congr. Serv. Serv. 2019, vol. 2642–939X, no. i, pp. 3–8, 2019, doi:
10.1109/SERVICES.2019.00016.
[8] M. Kumar, A. Bindal, R. Gautam, and R. Bhatia, “Keyword query based focused web crawler,” Procedia Comput. Sci., vol. 125, pp. 584–590,
2018, doi: 10.1016/j.procs.2017.12.075.
[9] A. Khazaie, N. B. Seghouani, and F. Bugiotti, “Smart crawling: a new approach toward focus crawling from Twitter,” 2021. [Online].
Available: http://arxiv.org/abs/2110.06022
[10] P. Mishra and A. Khurana, “Accuracy crawler: an accurate crawler for deep web data extraction,” 2018 Int. Conf. Control. Power, Commun.
Comput. Technol. ICCPCCT 2018, pp. 25–29, 2018, doi: 10.1109/ICCPCCT.2018.8574286.
[11] D. S. Santoso and R. V. H. Ginardi, “Kompresi multilevel pada metaheuristic focused web crawler,” JUTI J. Ilm. Teknol. Inf., vol. 17, no. 1,
p. 52, 2019, doi: 10.12962/j24068535.v17i1.a785.
[12] Y. Yang, G. Zhu, H. Yu, and L. Yang, “Crawling and analysis of dark network data,” ACM Int. Conf. Proceeding Ser., pp. 116–120, 2020,
doi: 10.1145/3379247.3379272.
[13] I. N. Husada, E. H. Fernando, H. Sagala, A. E. Budiman, and H. Toba, “Ekstraksi dan analisis produk di marketplace secara otomatis dengan
memanfaatkan teknologi web crawling,” J. Tek. Inform. dan Sist. Inf., vol. 5, no. 3, pp. 350–359, 2020, doi: 10.28932/jutisi.v5i3.1977.
[14] A. Gupta and G. N. Campus, Web Crawling Model and Architecture, May, 2021.
CEPAT Journal of Computer Engineering: Progress, Application, and Technology, Vol. 1, No. 2, August 2022: 1-6