Dark Web Crawler Research Paper
Dark Web Crawler Research Paper
Dark Web Crawler Research Paper
Abstract— Due to the widespread use of powerful encryption evidence, requiring a substantial investment of time, labor,
algorithms and advanced anonymity routing, the field of cyber- experience, and technology. The associate editor, Tiago Cruz,
crime investigation has greatly changed, posing difficult obsta- oversaw the review and approval of this paper for publication.
cles for law enforcement organizations (LEAs). Consequently,
law enforcement agencies (LEAs) are increasingly relying on There are several software programs that allow access to
unencrypted web information or anonymous communication the about six dark networks that are currently operational.
networks (ACNs) as potential sources of leads and evidence Modern encryption and network traffic routing algorithms
for their investigations. LEAs have access to a significant that leave little traces are among the aspects that these
tool for gathering and storing potentially important data for products have in common, notwithstanding their variances.
investigative purposes: automated web content harvesting from
servers. Although web crawling has been studied since the early Because there aren’t many traffic traces and it’s not feasible
days of the internet, relatively little research has been done on to decrypt data in these networks, LEAs have to look for
web crawling on the ”dark web” or ACNs like IPFS, Freenet, proof in other ways. Tor, the biggest and most well-known
Tor, I2P, and others. anonymous communication network (ACN), is made up of a
This work offers a thorough systematic literature review network of computers, some of which are web servers that
(SLR) with the goal of investigating the characteristics and
prevalence of dark web crawlers. After removing pointless are a part of the so-called ”dark web.” Other ACNs, such
entries, a refined set of 34 peer-reviewed publications about as I2P, Freenet, IPFS, and Lokinet, also have servers that
crawling and the dark web remained from an original pool of make up black webs unique to their networks. For individ-
58 articles. According to the review, most dark web crawlers uals residing in non-democratic nations, journalists wanting
are written in Python and frequently use Selenium or Scrapy complete anonymity, and whistleblowers, Tor has become an
as their main web scraping libraries.
The lessons learned from the SLR were applied to the indispensable resource. Tor’s anonymity, however, is indis-
creation of an advanced Tor-based web crawling model that was criminate, helping both criminals and whistleblowers, cre-
easily incorporated into an already-existing software toolbox ating a challenging environment for digital policing. Using
designed for ACN-focused research. Following that, a series of the most recent encryption methods, Tor’s network encrypts
thorough experiments were conducted to thoroughly analyze data in many layers between servers, using a different key
the model’s performance and show that it was effective at
extracting web content from both conventional and dark This for each tier. With more than 8,000 servers, or relays, around
work provides more than just a review; it advances our the world, Tor distributes transmission pathways among
knowledge of ACN-based web crawlers and provides a reliable numerous relays to create encrypted layers that resemble an
model for digital forensics applications including the crawling onion and protect privacy. Considering the shortcomings of
and scraping of both clear and dark web domains. The study network traffic analysis and decryption inside ACNs, online
also emphasizes the important ramifications of retrieving and
archiving content from the dark web, emphasizing how crucial content collecting shows itself to be a useful workaround.
it is for generating leads for investigations and offering crucial One effective method of extracting unencrypted data from
supporting evidence. Tor websites without requiring a lot of manual labor is
To sum up, this study highlights how important it is to web crawling, often known as automated online content
keep researching dark web crawling techniques and how they collecting. Web crawling is widely utilized on the open
might be used to improve cybercrime investigations. It also
highlights promising directions for future study in this quickly
web for commercial and archive purposes, but it has also
developing sector, highlighting how crucial it is to use cutting- been important in criminal investigations, with previous
edge technologies to effectively fight cybercrime in a digital screenshots of illegal websites frequently serving as proof.
environment that is becoming more complicated. The fleeting nature of Tor servers emphasizes how crucial
it is to consistently record and store online information in
I. INTRODUCTION order to preserve data integrity and ensure legal admissibility.
Due to the high level of secrecy and restricted traceability Even though there are many different web crawler programs
provided by sophisticated encryption and anonymity proto- available, ACNs have not received as much attention in the
cols, cybercrime investigations on the Internet are becom- study as the clear web.
ing more and more difficult, especially inside elusive dark Because there is a dearth of research on dark web crawlers,
networks. The extensive security of data traveling across it is critical to investigate this area in order to comprehend the
these networks presents a considerable challenge to law particulars of dark web crawling in ACNs. This knowledge
enforcement organizations (LEAs) in their efforts to obtain can help with the creation of useful tools for practitioners
and researchers who crawl webpages on ACNs. latest HTTP 3, backward compatibility ensures uniformity
This work not only adds to the body of information across all requests and responses, maintaining consistency
but also describes the development and assessment of a with HTTP version 1.0. Essentially, web crawling automates
Tor-based crawler, utilizing the knowledge gained from the the procedure of sending GET requests to websites, tracking
literature review to improve an already-existing dark web embedded URLs, and saving the results that are obtained.
research toolbox. Web crawlers can be standalone software entities or browser-
Within the scholarly community, the investigation of dark based programs designed to communicate over HTTP. These
web crawlers is still a largely unknown field. Through further days, a wide range of HTTP communication libraries for
exploration of this topic and defining the distinct features of various programming languages—such as Lisp, Go, and
dark web crawlers—particularly in light of the peculiarities Haskell—make it easier to create HTTP-based clients and
of anonymous communication networks (ACNs) in contrast servers, which speeds up the process of developing web
to the surface web—a thorough synopsis of the state of dark crawling programs.
web crawler technology can be formulated. The purpose
of this analysis is to improve knowledge about the design C. An Adaptive Crawler
and workings of dark web crawlers. For academics and An Adaptive Crawler for Locating Hidden Web Entry
professionals working on the creation and implementation of Points Luciano Barbosa and Juliana Freire
efficient instruments for locating and examining content on Adaptive crawling strategies have been demonstrated to
the dark web, these insights may prove to be quite beneficial. be exceptionally efficient at locating the entry locations of
In addition, this work presents and evaluates a Tor-based concealed web sources. These techniques focus the content
crawler, demonstrating its usefulness. The architecture of this of the retrieved pages by giving priority to links that are most
crawler incorporates knowledge from the previous literature relevant to the subject. This method maximises the applica-
review. tion of learned information, allowing for the identification
” of connections displaying hitherto unidentified patterns. As
a result, the approach shows resilience and the capacity to
II. LITERATURE SURVEY correct for biases introduced throughout the learning process.
A. DIPOSTION Mangesh Manke, Kamlesh Kumar Singh, Vinay Tak, and
The work, which consists of two main research contri- Amit Kharade’s research article presents an advanced inte-
butions—a systematic literature analysis and an experiment- grated crawling system designed specifically for exploring
based web crawler implementation—is organized into seven the deep web. The researchers of this extensive investigation
thorough chapters and a bibliography. introduce a novel adaptive crawler that utilizes offline and
The foundational overview is given in the first chapter, online learning mechanisms to train link classifiers. As a
which also explores the nuances of online crawling, website result, the crawler effectively gathers concealed web entries.
content, and anonymous communication networks, high-
D. WEBSITE ACQUISTION TOOLS
lighting their importance in digital investigations. Chapter
two expands on the scientific foundations that give rise Both open-source and closed-source technologies are
to the research challenge and the development of research available for forensic website acquisition that are designed
questions, building on the introduction. to archive and maintain web material according to forensic
science guidelines. Although some of these tools can be used
B. THE WORLD WIDE WEB for basic web crawling, and some can even be used in dark
Before HTTP became the standard protocol for trans- web contexts, their main purpose is not as powerful web
ferring information, other internet protocols, such as Go- crawlers.
pher, fought for supremacy in the early 1990s, when the One notable feature of OSIRT, an intuitive web browser
World Wide Web was beginning to take shape. The World designed specifically for investigators, is its ability to support
Wide Web was widely adopted by 1994–1995 thanks to the both ordinary and dark websites (such as Tor). OSIRT is a
graphical features and open nature of HTML, which finally widely used tool by law enforcement agencies in the United
overtook Gopher, a previous text-based protocol centered on Kingdom. It helps to preserve the integrity of evidence in
network file exchange. Using the Internet Protocol (IP) and investigative processes by facilitating functions such as video
the Transport Control Protocol (TCP), HTTP evolved into recording, screenshot generation, and audit log generation.
the industry standard for data transmission, including photos, Police departments throughout the world use FAW, a
videos, and HTML pages. The fundamental process of fetch- proprietary internet forensic collection tool that looks like a
ing an HTML page via HTTP has remained unchanged since web browser, to collect web content from popular websites,
its inception, with a client (typically a web browser) sending social media platforms, and the dark web (Tor). Notably,
a GET request to a web server and receiving a corresponding FAW’s collection of forensic tools includes the ability to
response. A successful retrieval returns an HTTP code of crawl websites.
200, while a failed attempt yields a 404 code, in line with the Another proprietary tool, called Hunchly, is available as
HTTP standard that includes various other response codes. an add-on for web browsers and is intended to suit the
Despite revisions from HTTP versions 0.9 to 1.1, 2.0, and the demanding needs of law enforcement personnel conducting
investigations. It supports the acquisition of content from parameters on a regular basis in response to security alerts
both the clear and dark web (Tor). and Tor network updates.
Users can reduce the risk of request leaks and other
E. WEB CRAWLERS privacy vulnerabilities by adhering to these instructions and
A number of privacy risks, such as the possibility of keeping a close eye on Tor network parameters, protecting
request leaks and other serious privacy breaches, might result the integrity and efficacy of Tor’s anonymization capabilities.
from the mishandling or incorrect configuration of the Tor
network. Users are exposed to grave privacy threats when F. Google’s Deep Web Crawler
the Tor network is not configured appropriately, creating Jayant Madhavan, David Ko, Ju-wei Chiu, Vignesh Gana-
opportunities for possible leaks. pathy, Alex Rasmussen, and Alon Halevy started working
Request leaks pose a serious risk since they can unin- together on a project that would transform how deep-web
tentionally reveal private information about a user’s online material is found and used. They encountered and overcame
activities outside of the Tor network. Misconfigured settings the numerous obstacles that come with uncovering and using
or bugs in the Tor client or connected apps may be the cause the deep web’s enormous reservoirs, and as a result, they cre-
of this leakage. The main goal of Tor is to anonymize users ated a clever solution. This technology is a ground-breaking
and shield their identities and activities from monitoring or advancement made possible by an incredibly sophisticated
interception. Request leaks weaken this goal. A number of and flexible algorithm.
privacy risks, such as the possibility of request leaks and Fundamentally, this algorithm is the driving force behind
other serious privacy breaches, might result from the mishan- the effective navigation of the complex network of possible
dling or incorrect configuration of the Tor network. Users are input combinations. It navigates the complex terrain of the
exposed to grave privacy threats when the Tor network is not deep web with systematic accuracy and forethought. The
configured appropriately, creating opportunities for possible system finds and isolates those difficult combinations that
leaks. are the key to unlocking valuable URLs through methodical
Request leaks pose a serious risk since they can unin- analysis and deliberate selection. These URLs have been
tentionally reveal private information about a user’s online thoroughly examined and are ready to be included in our
activities outside of the Tor network. Misconfigured settings web search index.
or bugs in the Tor client or connected apps may be the cause The expedition made by Madhavan, Ko, Chiu, Ganapathy,
of this leakage. The main goal of Tor is to anonymize users Rasmussen, and Halevy is evidence of the inventiveness
and shield their identities and activities from monitoring or and spirit of cooperation of people. Through their combined
interception. Request leaks weaken this goal. endeavors, they have not only surmounted technical obstacles
Inadequate Tor network settings can also put users at but also shed light on novel avenues for investigating and
danger of other privacy issues, like: IP Address Exposure: capitalizing on the concealed riches of the deep web. Their
Users’ anonymity may be jeopardized if Tor is not config- ground-breaking system is expected to have a significant and
ured correctly and their actual IP address is made public. wide-ranging impact on the digital world as it develops and
This exposure may be the result of Tor browser leaks or matures.
incorrectly setup proxies. Insufficient setup can expose users
to traffic analysis, which is the process by which attackers G. THE TOR WEBSITE AND NETWORK
track and examine encrypted data flows over the network. Anonymous communication networks (ACNs) or dark
The identification of users or their online activities may result networks use the same transport protocols (TCP/IP) as the
from this analysis. clear web, but they use different anonymous protocols. The
DNS Leaks: When DNS requests are made outside of the clear web uses TCP/IP to send HTTP requests. For example,
Tor network, they betray the websites that are being browsed. Tor uses its onion routing (OR) protocol, and I2P uses
This might happen as a result of improper configuration. garlic routing (GR), both of which are TCP/IP wrapped.
Inadequate usage of Tor-specific DNS resolution techniques Within ACNs, these protocols make it easier for HTTP to
or incorrectly configured settings can result in DNS leaks. be transmitted and used for web page serving.
Risks Associated with Exit Nodes: Incorrect setups may The idea of Onion Services, formerly known as Hidden
affect the security and choice of Tor exit nodes. Users’ traffic Services, is essential to the Tor network. These services
may be intercepted or monitored by malevolent actors if inse- host websites that are only accessible through known URLs
cure or hacked exit nodes are used. Following recommended and are a component of the ”dark web,” like the following
methods when configuring Tor is crucial to reducing these example: In contrast to the ordinary internet, it is not possible
dangers and guaranteeing strong privacy protection within to search through a variety of IP addresses to find Onion
the network. These best practices include: Services on the Tor network. Furthermore, while theoretically
using the most recent version of the official Tor software. feasible, trying to guess Onion Service URLs pseudoran-
setting up programs to effectively use Tor proxies. domly is not a viable strategy. Tor websites, also called
enabling the Tor browser’s recommended privacy settings. ”onionsites,” are identical to regular web pages in appearance
avoiding third-party plugins and custom changes that can and structure. They are made up of text, graphics, HTML,
jeopardize anonymity. analyzing and adjusting configuration CSS, JavaScript, and other elements that are delivered over
HTTP. Onionsites’ material, however, usually captures the not just page navigation but all of the complex interactions
special qualities of existing on an anonymous communication that are needed at every stage, like form submissions, user
network. Onionsites put anonymity, privacy, and secrecy event simulations, and link traversals. Although certain ap-
above usefulness and speed. They also frequently refrain proaches integrate form interactions directly into crawling
from using JavaScript because of the possibility that doing paths, others trigger the procedure from result pages acquired
so could reveal a Tor user’s true identity. Onionsites stand subsequent to form submissions. Because Deep Web infor-
out from their obvious online competitors thanks to their mation is so diverse, different pages have different levels of
attention on security and anonymity, which influences their relevance, which has led to the creation of different crawling
design and content decisions within the Tor network. strategies.
The most basic method is represented by blind crawlers,
III. RESEARCH MOTIVATION which gather as many pages as they can from a website.
Although there is a wealth of literature on clear web They commence their expedition from a seed page and
crawlers, there isn’t a single thorough review that concen- methodically adhere to each link it furnishes until each page
trates on dark web crawlers. Unlike the ordinary Internet that is accessible has been downloaded. Thus, all of the URLs
and the conventional clear web, anonymous communication that are reachable within the website’s domain are included
networks function under different protocols and setups, re- in the crawling pathways that they take. Conversely, targeted
quiring programming and configuration specific to their own crawlers take a more discriminating approach, focusing on
features. links that are likely to direct users to pages with relevant
As was noted in the sections before this one, both public content related to a given topic. Crawlers utilise advanced
and commercial organizations have created a variety of classification methods to evaluate the pertinence of down-
dark web crawlers, but a thorough scientific analysis of loaded pages and proceed to follow links that are considered
these instruments is noticeably absent. Doing a thorough pertinent.
assessment of the dark web crawlers that are now in use Conversely, ad-hoc crawlers ignore topical alignment in
and have been reported in scholarly publications is one of favour of the unique requirements and preferences of each
the main goals of this research project. This evaluation aims user. They adjust the crawling experience to each user’s
to improve our knowledge of the changing field of dark specific preferences and needs by carefully selecting links
web crawling technologies by offering insightful information that connect to pages that are judged relevant.
about the strengths and weaknesses of different crawlers. Concentrated and ad-hoc crawlers require more complex
Implementing the most popular dark web crawler found by path-generating techniques than blind crawlers, which usu-
academic research and customizing it to fit neatly into an ally use simple algorithms to queue URLs for fast traversal.
already-existing toolset that is currently devoid of a reliable Crawlers may be classified as recorders, supervised learn-
and all-inclusive crawler solution is another important goal. ers, or unsupervised learners, according to the employed
The performance evaluation that follows will provide impor- methodology and the necessary level of oversight. These
tant information about this integrated crawler’s effectiveness classifications reflect the unique strategies utilized for path
and suitability for use in investigative and analytical settings. generation and content discovery by the crawlers.
By tackling these goals, the study intends to close impor- Essentially, a deep comprehension of crawling patterns and
tant information gaps about dark web crawling techniques the nuances of content retrieval is necessary to navigate the
and make a significant contribution to the creation and complexity of the Deep Web. Crawlers have the ability to
improvement of instruments for examining and navigating discover concealed treasures of information that are beyond
the complex world of anonymous communication networks. the reach of conventional web crawlers by employing sophis-
This thorough method emphasizes how important it is to have ticated crawling techniques that are customized to achieve
reliable and specialized crawling strategies that are suited to particular goals.
the particular opportunities and problems that the dark web
ecosystem presents. V. SYSTEMATIC LITERATURE REVIEW
IV. CRAWLING PATHS LEARNING Planning the Systematic Literature Review (SLR) is the
Investigating the Deep Web necessitates a multimodal first phase, or (1) planning. This includes developing strate-
strategy that goes beyond the traditional techniques of gies for data capture and dissemination, defining the research
surface-level web crawling. Deep Web crawlers are fre- topic, setting criteria for study selection and quality assess-
quently tasked with traversing through layers of content to ment, and summarising the research background.
uncover subsets of information pertinent to particular users More in-depth work is done in the second phase, which
or processes, as opposed to the straightforward tasks of is (2) doing the literature review. It consists of four main
traditional crawlers, which consist of completing out forms tasks: (1) choosing studies, (2) evaluating research quality,
and retrieving result pages. (3) extracting data, and (4) synthesising data. The document’s
Central to deep web crawling methodologies are crawling later sections go into great depth about each of these tasks.
paths, which comprise sequences of pages that are crucial The definition of the dissemination mechanisms, re-
for accessing the intended content. These pathways cover port formatting, and report evaluation are all part of the
third phase, (3) reporting. Given the nature of the docu- Qualifications for Inclusion:
ment—which is intended to be peer-reviewed—this phase is articles that concentrate on information collection, mon-
essential. This guarantees that the research will be subjected itoring, crawling, and scraping on the Tor network. studies
to a thorough assessment and made available for public use. in which data was gathered from the Tor network using a
crawler or scraper. Criteria for Exclusion:
A. RESEARCH QUESTIONS
articles that discussed crawling and scraping without men-
After following the instructions for each activity, the tioning the Tor network. research that don’t involve down-
following tangible results were obtained: research questions loading content from distant Tor servers. publications that
unique to the SLR (i.e., this section of the article), which have not undergone peer review, such as articles published
differ from the research questions for the full article: outside of journals, conference proceedings, or workshop
1) What types of crawlers and/or scrapers have been proceedings.
utilised to gather data from the Tor network in scientific A manual evaluation was possible because the search
publications? query produced a tolerable amount of articles. Initially,
2) How are traffic routes made by crawlers and/or scrapers depending on the aforementioned criteria, the abstracts of
that gather information from the Tor network? all 59 papers were reviewed and either included or excluded.
3) Which frameworks and programming languages are After comparing the titles, it was discovered that papers [20]
most frequently used to create crawlers and/or scrapers on and [18] had the same title; the former being a fourteen-
the Tor network? page journal article, while the latter was a shorter conference
B. STUDY SELECTION STRATEGY proceeding of nine pages. Since the conference piece was
thought to be a truncated version of the fuller journal
The search parameters TITLE-ABS-KEY ((dark AND publication, it was disregarded.
web AND crawler) OR (dark AND web AND scraper) OR
Similar to this, the journal article was chosen over the
(tor AND crawler) OR (tor AND scraper)) AND LAN-
conference version for articles [39] and [40], which were
GUAGE (english) yielded 59 items in total that were re-
conference and journal papers, respectively, because of its
trieved from the database. In this case, ”TITLE-ABS-KEY”
greater length and level of detail. This bias for journal articles
refers to searches that concentrate on the metadata elements
also applies to the following pair: the conference article
included in the articles’ titles, abstracts, and keywords. The
[72] was excluded in favour of the more thorough journal
prefix ”LANG” signifies that only English-language items
publication because it had the same title and DOI as the
were found; results for searches in other languages were
journal article [72].
routinely filtered out.
There were only 56 papers left in the systematic literature
Following identification, the articles were downloaded and
review for additional analysis after these duplicate entries
locally saved with all of their metadata (authors, DOI, title,
between conference and journal articles were eliminated. The
abstract, and keywords) intact, as shown in the example
rigorous selection procedure makes sure that more in-depth,
below. Keeping these items locally instead of depending on
peer-reviewed sources are prioritised, which improves the
web services made data processing simpler. Publication the
calibre and dependability of the review’s conclusions.
source code of the script used to find and choose these
articles also contributes to the transparency of the study
approach. This procedure guarantees a better comprehension D. DATA EXTRACTION STRATEGY
of the search and selection procedure in addition to helping
to replicate the study. Forty-one documents were found to be appropriate for
additional analysis following the quality evaluation. A META
C. INCLUSION AND EXCLUSION CRITERIA data link pointing to the complete text was supplied with
It is imperative to define precise criteria for the inclusion every article that was downloaded from the Scopus database.
and exclusion of studies in order to discover pertinent Each of these papers was downloaded separately in order to
research for the original questions that have been addressed, carefully extract the data.
as suggested by Kitchenham [45]. Only English-language Table 2 provides a complete list of the selected relevant
articles that were relevant to the predetermined search criteria articles. The matching ACN-based web crawler or scraper
were included in the initial database search phase. Further for each article is displayed in this table along with the
inclusion and exclusion criteria for the papers in this sys- research instrument that was used. Furthermore, a link to
tematic literature review, which mainly focuses on content the crawler/scraper’s open-source code is supplied where it
crawling and scraping on the Tor network, are described in is accessible, which improves transparency and permits the
this section. study to be replicated.
It was agreed, therefore, that articles that did not specifi- However, seven publications were deemed irrelevant dur-
cally address the Tor network would be added if they alluded ing the data extraction stage and were thus removed from the
to or explored possible effects on the network. This strategy review. Among the exclusions were articles about crawling
was used to reduce the possibility of leaving out research that weren’t expressly on the dark web, like those that were
that were either slightly or somewhat relevant. described.
VI. CHALLENGES
Web scanning, although seemingly uncomplicated in con-
cept, presents an array of intricacies and difficulties that
transcend its fundamental principle. The overall task appears
straightforward: it begins with pre-specified seed URLs, then
iteratively crawls through the network of links, extracting
embedded hyperlinks, downloading every page under the
specified addresses, and so on. However, there are many
challenges in the way of this operational execution. In
addition to being intrinsic to the vastness and dynamism of
the internet as a whole, the dark web environment possesses
particular qualities and subtleties that further compound these
difficulties.
A. SCALABILITY
One of the most significant problems with web crawling
efforts is scalability. Attaining optimal productivity in crawl-
ing operations is significantly challenging due to the web’s
immense scale and nonstop evolution. The web is growing
at an exponential rate, and new ways of crawling are needed
to keep up with this growth. One potential solution to this
problem is distributed crawling, which involves running the
crawler over several devices. Distributed crawling expands
coverage of the web landscape by increasing efficiency and
scalability through the partitioning of the URL space and
allocation of specific subsets of URLs to individual devices.
R EFERENCES
[1] Darkweb research: Past, present, and future trends and mapping to
sustainable development goals by Raghu Raman
[2] Deep Web, Dark Web, Dark Net: A Taxonomy of “Hidden” Internet
by Masayuki HATTA
[3] THE DARKNET: AN ENORMOUS BLACK BOX OF CY-
BERSPACE by Ms. Paridhi Saxena
[4] The Dark Web Phenomenon: A Review and Research Agenda by
Abhineet Gupta
[5] “Frederick Barr-Smith and Joss Wright. “Phishing With A Darknet:
Imitation of Onion Services”. In: 2020 APWG Symposium on Elec-
tronic Crime Research (eCrime). IEEE, Nov. 2020.