Dark Web Crawler Research Paper

A Comprehensive Survey of Dark Web Crawlers
Mentor : Prof Gaurav Parashar

Computer Science and Engineering
KIET Group of Institutions
Kartikeya Srivastava, Rishi Srivastava, Really Singh
Abstract— Due to the widespread use of powerful encryption evidence, requiring a substantial investment of time, labor,
algorithms and advanced anonymity routing, the field of cyber- experience, and technology. The associate editor, Tiago Cruz,
crime investigation has greatly changed, posing difficult obsta- oversaw the review and approval of this paper for publication.
cles for law enforcement organizations (LEAs). Consequently,
law enforcement agencies (LEAs) are increasingly relying on There are several software programs that allow access to
unencrypted web information or anonymous communication the about six dark networks that are currently operational.
networks (ACNs) as potential sources of leads and evidence Modern encryption and network traffic routing algorithms
for their investigations. LEAs have access to a significant that leave little traces are among the aspects that these
tool for gathering and storing potentially important data for products have in common, notwithstanding their variances.
investigative purposes: automated web content harvesting from
servers. Although web crawling has been studied since the early Because there aren’t many traffic traces and it’s not feasible
days of the internet, relatively little research has been done on to decrypt data in these networks, LEAs have to look for
web crawling on the ”dark web” or ACNs like IPFS, Freenet, proof in other ways. Tor, the biggest and most well-known
Tor, I2P, and others. anonymous communication network (ACN), is made up of a
This work offers a thorough systematic literature review network of computers, some of which are web servers that
(SLR) with the goal of investigating the characteristics and
prevalence of dark web crawlers. After removing pointless are a part of the so-called ”dark web.” Other ACNs, such
entries, a refined set of 34 peer-reviewed publications about as I2P, Freenet, IPFS, and Lokinet, also have servers that
crawling and the dark web remained from an original pool of make up black webs unique to their networks. For individ-
58 articles. According to the review, most dark web crawlers uals residing in non-democratic nations, journalists wanting
are written in Python and frequently use Selenium or Scrapy complete anonymity, and whistleblowers, Tor has become an
as their main web scraping libraries.
The lessons learned from the SLR were applied to the indispensable resource. Tor’s anonymity, however, is indis-
creation of an advanced Tor-based web crawling model that was criminate, helping both criminals and whistleblowers, cre-
easily incorporated into an already-existing software toolbox ating a challenging environment for digital policing. Using
designed for ACN-focused research. Following that, a series of the most recent encryption methods, Tor’s network encrypts
thorough experiments were conducted to thoroughly analyze data in many layers between servers, using a different key
the model’s performance and show that it was effective at
extracting web content from both conventional and dark This for each tier. With more than 8,000 servers, or relays, around
work provides more than just a review; it advances our the world, Tor distributes transmission pathways among
knowledge of ACN-based web crawlers and provides a reliable numerous relays to create encrypted layers that resemble an
model for digital forensics applications including the crawling onion and protect privacy. Considering the shortcomings of
and scraping of both clear and dark web domains. The study network traffic analysis and decryption inside ACNs, online
also emphasizes the important ramifications of retrieving and
archiving content from the dark web, emphasizing how crucial content collecting shows itself to be a useful workaround.
it is for generating leads for investigations and offering crucial One effective method of extracting unencrypted data from
supporting evidence. Tor websites without requiring a lot of manual labor is
To sum up, this study highlights how important it is to web crawling, often known as automated online content
keep researching dark web crawling techniques and how they collecting. Web crawling is widely utilized on the open
might be used to improve cybercrime investigations. It also
highlights promising directions for future study in this quickly
web for commercial and archive purposes, but it has also
developing sector, highlighting how crucial it is to use cutting- been important in criminal investigations, with previous
edge technologies to effectively fight cybercrime in a digital screenshots of illegal websites frequently serving as proof.
environment that is becoming more complicated. The fleeting nature of Tor servers emphasizes how crucial
it is to consistently record and store online information in
I. INTRODUCTION order to preserve data integrity and ensure legal admissibility.
Due to the high level of secrecy and restricted traceability Even though there are many different web crawler programs
provided by sophisticated encryption and anonymity proto- available, ACNs have not received as much attention in the
cols, cybercrime investigations on the Internet are becom- study as the clear web.
ing more and more difficult, especially inside elusive dark Because there is a dearth of research on dark web crawlers,
networks. The extensive security of data traveling across it is critical to investigate this area in order to comprehend the
these networks presents a considerable challenge to law particulars of dark web crawling in ACNs. This knowledge
enforcement organizations (LEAs) in their efforts to obtain can help with the creation of useful tools for practitioners
and researchers who crawl webpages on ACNs. latest HTTP 3, backward compatibility ensures uniformity
This work not only adds to the body of information across all requests and responses, maintaining consistency
but also describes the development and assessment of a with HTTP version 1.0. Essentially, web crawling automates
Tor-based crawler, utilizing the knowledge gained from the the procedure of sending GET requests to websites, tracking
literature review to improve an already-existing dark web embedded URLs, and saving the results that are obtained.
research toolbox. Web crawlers can be standalone software entities or browser-
Within the scholarly community, the investigation of dark based programs designed to communicate over HTTP. These
web crawlers is still a largely unknown field. Through further days, a wide range of HTTP communication libraries for
exploration of this topic and defining the distinct features of various programming languages—such as Lisp, Go, and
dark web crawlers—particularly in light of the peculiarities Haskell—make it easier to create HTTP-based clients and
of anonymous communication networks (ACNs) in contrast servers, which speeds up the process of developing web
to the surface web—a thorough synopsis of the state of dark crawling programs.
web crawler technology can be formulated. The purpose
of this analysis is to improve knowledge about the design C. An Adaptive Crawler
and workings of dark web crawlers. For academics and An Adaptive Crawler for Locating Hidden Web Entry
professionals working on the creation and implementation of Points Luciano Barbosa and Juliana Freire
efficient instruments for locating and examining content on Adaptive crawling strategies have been demonstrated to
the dark web, these insights may prove to be quite beneficial. be exceptionally efficient at locating the entry locations of
In addition, this work presents and evaluates a Tor-based concealed web sources. These techniques focus the content
crawler, demonstrating its usefulness. The architecture of this of the retrieved pages by giving priority to links that are most
crawler incorporates knowledge from the previous literature relevant to the subject. This method maximises the applica-
review. tion of learned information, allowing for the identification
” of connections displaying hitherto unidentified patterns. As
a result, the approach shows resilience and the capacity to
II. LITERATURE SURVEY correct for biases introduced throughout the learning process.
A. DIPOSTION Mangesh Manke, Kamlesh Kumar Singh, Vinay Tak, and
The work, which consists of two main research contri- Amit Kharade’s research article presents an advanced inte-
butions—a systematic literature analysis and an experiment- grated crawling system designed specifically for exploring
based web crawler implementation—is organized into seven the deep web. The researchers of this extensive investigation
thorough chapters and a bibliography. introduce a novel adaptive crawler that utilizes offline and
The foundational overview is given in the first chapter, online learning mechanisms to train link classifiers. As a
which also explores the nuances of online crawling, website result, the crawler effectively gathers concealed web entries.
content, and anonymous communication networks, high-
D. WEBSITE ACQUISTION TOOLS
lighting their importance in digital investigations. Chapter
two expands on the scientific foundations that give rise Both open-source and closed-source technologies are
to the research challenge and the development of research available for forensic website acquisition that are designed
questions, building on the introduction. to archive and maintain web material according to forensic
science guidelines. Although some of these tools can be used
B. THE WORLD WIDE WEB for basic web crawling, and some can even be used in dark
Before HTTP became the standard protocol for trans- web contexts, their main purpose is not as powerful web
ferring information, other internet protocols, such as Go- crawlers.
pher, fought for supremacy in the early 1990s, when the One notable feature of OSIRT, an intuitive web browser
World Wide Web was beginning to take shape. The World designed specifically for investigators, is its ability to support
Wide Web was widely adopted by 1994–1995 thanks to the both ordinary and dark websites (such as Tor). OSIRT is a
graphical features and open nature of HTML, which finally widely used tool by law enforcement agencies in the United
overtook Gopher, a previous text-based protocol centered on Kingdom. It helps to preserve the integrity of evidence in
network file exchange. Using the Internet Protocol (IP) and investigative processes by facilitating functions such as video
the Transport Control Protocol (TCP), HTTP evolved into recording, screenshot generation, and audit log generation.
the industry standard for data transmission, including photos, Police departments throughout the world use FAW, a
videos, and HTML pages. The fundamental process of fetch- proprietary internet forensic collection tool that looks like a
ing an HTML page via HTTP has remained unchanged since web browser, to collect web content from popular websites,
its inception, with a client (typically a web browser) sending social media platforms, and the dark web (Tor). Notably,
a GET request to a web server and receiving a corresponding FAW’s collection of forensic tools includes the ability to
response. A successful retrieval returns an HTTP code of crawl websites.
200, while a failed attempt yields a 404 code, in line with the Another proprietary tool, called Hunchly, is available as
HTTP standard that includes various other response codes. an add-on for web browsers and is intended to suit the
Despite revisions from HTTP versions 0.9 to 1.1, 2.0, and the demanding needs of law enforcement personnel conducting
investigations. It supports the acquisition of content from parameters on a regular basis in response to security alerts
both the clear and dark web (Tor). and Tor network updates.
Users can reduce the risk of request leaks and other
E. WEB CRAWLERS privacy vulnerabilities by adhering to these instructions and
A number of privacy risks, such as the possibility of keeping a close eye on Tor network parameters, protecting
request leaks and other serious privacy breaches, might result the integrity and efficacy of Tor’s anonymization capabilities.
from the mishandling or incorrect configuration of the Tor
network. Users are exposed to grave privacy threats when F. Google’s Deep Web Crawler
the Tor network is not configured appropriately, creating Jayant Madhavan, David Ko, Ju-wei Chiu, Vignesh Gana-
opportunities for possible leaks. pathy, Alex Rasmussen, and Alon Halevy started working
Request leaks pose a serious risk since they can unin- together on a project that would transform how deep-web
tentionally reveal private information about a user’s online material is found and used. They encountered and overcame
activities outside of the Tor network. Misconfigured settings the numerous obstacles that come with uncovering and using
or bugs in the Tor client or connected apps may be the cause the deep web’s enormous reservoirs, and as a result, they cre-
of this leakage. The main goal of Tor is to anonymize users ated a clever solution. This technology is a ground-breaking
and shield their identities and activities from monitoring or advancement made possible by an incredibly sophisticated
interception. Request leaks weaken this goal. A number of and flexible algorithm.
privacy risks, such as the possibility of request leaks and Fundamentally, this algorithm is the driving force behind
other serious privacy breaches, might result from the mishan- the effective navigation of the complex network of possible
dling or incorrect configuration of the Tor network. Users are input combinations. It navigates the complex terrain of the
exposed to grave privacy threats when the Tor network is not deep web with systematic accuracy and forethought. The
configured appropriately, creating opportunities for possible system finds and isolates those difficult combinations that
leaks. are the key to unlocking valuable URLs through methodical
Request leaks pose a serious risk since they can unin- analysis and deliberate selection. These URLs have been
tentionally reveal private information about a user’s online thoroughly examined and are ready to be included in our
activities outside of the Tor network. Misconfigured settings web search index.
or bugs in the Tor client or connected apps may be the cause The expedition made by Madhavan, Ko, Chiu, Ganapathy,
of this leakage. The main goal of Tor is to anonymize users Rasmussen, and Halevy is evidence of the inventiveness
and shield their identities and activities from monitoring or and spirit of cooperation of people. Through their combined
interception. Request leaks weaken this goal. endeavors, they have not only surmounted technical obstacles
Inadequate Tor network settings can also put users at but also shed light on novel avenues for investigating and
danger of other privacy issues, like: IP Address Exposure: capitalizing on the concealed riches of the deep web. Their
Users’ anonymity may be jeopardized if Tor is not config- ground-breaking system is expected to have a significant and
ured correctly and their actual IP address is made public. wide-ranging impact on the digital world as it develops and
This exposure may be the result of Tor browser leaks or matures.
incorrectly setup proxies. Insufficient setup can expose users
to traffic analysis, which is the process by which attackers G. THE TOR WEBSITE AND NETWORK
track and examine encrypted data flows over the network. Anonymous communication networks (ACNs) or dark
The identification of users or their online activities may result networks use the same transport protocols (TCP/IP) as the
from this analysis. clear web, but they use different anonymous protocols. The
DNS Leaks: When DNS requests are made outside of the clear web uses TCP/IP to send HTTP requests. For example,
Tor network, they betray the websites that are being browsed. Tor uses its onion routing (OR) protocol, and I2P uses
This might happen as a result of improper configuration. garlic routing (GR), both of which are TCP/IP wrapped.
Inadequate usage of Tor-specific DNS resolution techniques Within ACNs, these protocols make it easier for HTTP to
or incorrectly configured settings can result in DNS leaks. be transmitted and used for web page serving.
Risks Associated with Exit Nodes: Incorrect setups may The idea of Onion Services, formerly known as Hidden
affect the security and choice of Tor exit nodes. Users’ traffic Services, is essential to the Tor network. These services
may be intercepted or monitored by malevolent actors if inse- host websites that are only accessible through known URLs
cure or hacked exit nodes are used. Following recommended and are a component of the ”dark web,” like the following
methods when configuring Tor is crucial to reducing these example: In contrast to the ordinary internet, it is not possible
dangers and guaranteeing strong privacy protection within to search through a variety of IP addresses to find Onion
the network. These best practices include: Services on the Tor network. Furthermore, while theoretically
using the most recent version of the official Tor software. feasible, trying to guess Onion Service URLs pseudoran-
setting up programs to effectively use Tor proxies. domly is not a viable strategy. Tor websites, also called
enabling the Tor browser’s recommended privacy settings. ”onionsites,” are identical to regular web pages in appearance
avoiding third-party plugins and custom changes that can and structure. They are made up of text, graphics, HTML,
jeopardize anonymity. analyzing and adjusting configuration CSS, JavaScript, and other elements that are delivered over
HTTP. Onionsites’ material, however, usually captures the not just page navigation but all of the complex interactions
special qualities of existing on an anonymous communication that are needed at every stage, like form submissions, user
network. Onionsites put anonymity, privacy, and secrecy event simulations, and link traversals. Although certain ap-
above usefulness and speed. They also frequently refrain proaches integrate form interactions directly into crawling
from using JavaScript because of the possibility that doing paths, others trigger the procedure from result pages acquired
so could reveal a Tor user’s true identity. Onionsites stand subsequent to form submissions. Because Deep Web infor-
out from their obvious online competitors thanks to their mation is so diverse, different pages have different levels of
attention on security and anonymity, which influences their relevance, which has led to the creation of different crawling
design and content decisions within the Tor network. strategies.
The most basic method is represented by blind crawlers,
III. RESEARCH MOTIVATION which gather as many pages as they can from a website.
Although there is a wealth of literature on clear web They commence their expedition from a seed page and
crawlers, there isn’t a single thorough review that concen- methodically adhere to each link it furnishes until each page
trates on dark web crawlers. Unlike the ordinary Internet that is accessible has been downloaded. Thus, all of the URLs
and the conventional clear web, anonymous communication that are reachable within the website’s domain are included
networks function under different protocols and setups, re- in the crawling pathways that they take. Conversely, targeted
quiring programming and configuration specific to their own crawlers take a more discriminating approach, focusing on
features. links that are likely to direct users to pages with relevant
As was noted in the sections before this one, both public content related to a given topic. Crawlers utilise advanced
and commercial organizations have created a variety of classification methods to evaluate the pertinence of down-
dark web crawlers, but a thorough scientific analysis of loaded pages and proceed to follow links that are considered
these instruments is noticeably absent. Doing a thorough pertinent.
assessment of the dark web crawlers that are now in use Conversely, ad-hoc crawlers ignore topical alignment in
and have been reported in scholarly publications is one of favour of the unique requirements and preferences of each
the main goals of this research project. This evaluation aims user. They adjust the crawling experience to each user’s
to improve our knowledge of the changing field of dark specific preferences and needs by carefully selecting links
web crawling technologies by offering insightful information that connect to pages that are judged relevant.
about the strengths and weaknesses of different crawlers. Concentrated and ad-hoc crawlers require more complex
Implementing the most popular dark web crawler found by path-generating techniques than blind crawlers, which usu-
academic research and customizing it to fit neatly into an ally use simple algorithms to queue URLs for fast traversal.
already-existing toolset that is currently devoid of a reliable Crawlers may be classified as recorders, supervised learn-
and all-inclusive crawler solution is another important goal. ers, or unsupervised learners, according to the employed
The performance evaluation that follows will provide impor- methodology and the necessary level of oversight. These
tant information about this integrated crawler’s effectiveness classifications reflect the unique strategies utilized for path
and suitability for use in investigative and analytical settings. generation and content discovery by the crawlers.
By tackling these goals, the study intends to close impor- Essentially, a deep comprehension of crawling patterns and
tant information gaps about dark web crawling techniques the nuances of content retrieval is necessary to navigate the
and make a significant contribution to the creation and complexity of the Deep Web. Crawlers have the ability to
improvement of instruments for examining and navigating discover concealed treasures of information that are beyond
the complex world of anonymous communication networks. the reach of conventional web crawlers by employing sophis-
This thorough method emphasizes how important it is to have ticated crawling techniques that are customized to achieve
reliable and specialized crawling strategies that are suited to particular goals.
the particular opportunities and problems that the dark web
ecosystem presents. V. SYSTEMATIC LITERATURE REVIEW
IV. CRAWLING PATHS LEARNING Planning the Systematic Literature Review (SLR) is the
Investigating the Deep Web necessitates a multimodal first phase, or (1) planning. This includes developing strate-
strategy that goes beyond the traditional techniques of gies for data capture and dissemination, defining the research
surface-level web crawling. Deep Web crawlers are fre- topic, setting criteria for study selection and quality assess-
quently tasked with traversing through layers of content to ment, and summarising the research background.
uncover subsets of information pertinent to particular users More in-depth work is done in the second phase, which
or processes, as opposed to the straightforward tasks of is (2) doing the literature review. It consists of four main
traditional crawlers, which consist of completing out forms tasks: (1) choosing studies, (2) evaluating research quality,
and retrieving result pages. (3) extracting data, and (4) synthesising data. The document’s
Central to deep web crawling methodologies are crawling later sections go into great depth about each of these tasks.
paths, which comprise sequences of pages that are crucial The definition of the dissemination mechanisms, re-
for accessing the intended content. These pathways cover port formatting, and report evaluation are all part of the
third phase, (3) reporting. Given the nature of the docu- Qualifications for Inclusion:
ment—which is intended to be peer-reviewed—this phase is articles that concentrate on information collection, mon-
essential. This guarantees that the research will be subjected itoring, crawling, and scraping on the Tor network. studies
to a thorough assessment and made available for public use. in which data was gathered from the Tor network using a
crawler or scraper. Criteria for Exclusion:
A. RESEARCH QUESTIONS
articles that discussed crawling and scraping without men-
After following the instructions for each activity, the tioning the Tor network. research that don’t involve down-
following tangible results were obtained: research questions loading content from distant Tor servers. publications that
unique to the SLR (i.e., this section of the article), which have not undergone peer review, such as articles published
differ from the research questions for the full article: outside of journals, conference proceedings, or workshop
1) What types of crawlers and/or scrapers have been proceedings.
utilised to gather data from the Tor network in scientific A manual evaluation was possible because the search
publications? query produced a tolerable amount of articles. Initially,
2) How are traffic routes made by crawlers and/or scrapers depending on the aforementioned criteria, the abstracts of
that gather information from the Tor network? all 59 papers were reviewed and either included or excluded.
3) Which frameworks and programming languages are After comparing the titles, it was discovered that papers [20]
most frequently used to create crawlers and/or scrapers on and [18] had the same title; the former being a fourteen-
the Tor network? page journal article, while the latter was a shorter conference
B. STUDY SELECTION STRATEGY proceeding of nine pages. Since the conference piece was
thought to be a truncated version of the fuller journal
The search parameters TITLE-ABS-KEY ((dark AND publication, it was disregarded.
web AND crawler) OR (dark AND web AND scraper) OR
Similar to this, the journal article was chosen over the
(tor AND crawler) OR (tor AND scraper)) AND LAN-
conference version for articles [39] and [40], which were
GUAGE (english) yielded 59 items in total that were re-
conference and journal papers, respectively, because of its
trieved from the database. In this case, ”TITLE-ABS-KEY”
greater length and level of detail. This bias for journal articles
refers to searches that concentrate on the metadata elements
also applies to the following pair: the conference article
included in the articles’ titles, abstracts, and keywords. The
[72] was excluded in favour of the more thorough journal
prefix ”LANG” signifies that only English-language items
publication because it had the same title and DOI as the
were found; results for searches in other languages were
journal article [72].
routinely filtered out.
There were only 56 papers left in the systematic literature
Following identification, the articles were downloaded and
review for additional analysis after these duplicate entries
locally saved with all of their metadata (authors, DOI, title,
between conference and journal articles were eliminated. The
abstract, and keywords) intact, as shown in the example
rigorous selection procedure makes sure that more in-depth,
below. Keeping these items locally instead of depending on
peer-reviewed sources are prioritised, which improves the
web services made data processing simpler. Publication the
calibre and dependability of the review’s conclusions.
source code of the script used to find and choose these
articles also contributes to the transparency of the study
approach. This procedure guarantees a better comprehension D. DATA EXTRACTION STRATEGY
of the search and selection procedure in addition to helping
to replicate the study. Forty-one documents were found to be appropriate for
additional analysis following the quality evaluation. A META
C. INCLUSION AND EXCLUSION CRITERIA data link pointing to the complete text was supplied with
It is imperative to define precise criteria for the inclusion every article that was downloaded from the Scopus database.
and exclusion of studies in order to discover pertinent Each of these papers was downloaded separately in order to
research for the original questions that have been addressed, carefully extract the data.
as suggested by Kitchenham [45]. Only English-language Table 2 provides a complete list of the selected relevant
articles that were relevant to the predetermined search criteria articles. The matching ACN-based web crawler or scraper
were included in the initial database search phase. Further for each article is displayed in this table along with the
inclusion and exclusion criteria for the papers in this sys- research instrument that was used. Furthermore, a link to
tematic literature review, which mainly focuses on content the crawler/scraper’s open-source code is supplied where it
crawling and scraping on the Tor network, are described in is accessible, which improves transparency and permits the
this section. study to be replicated.
It was agreed, therefore, that articles that did not specifi- However, seven publications were deemed irrelevant dur-
cally address the Tor network would be added if they alluded ing the data extraction stage and were thus removed from the
to or explored possible effects on the network. This strategy review. Among the exclusions were articles about crawling
was used to reduce the possibility of leaving out research that weren’t expressly on the dark web, like those that were
that were either slightly or somewhat relevant. described.
VI. CHALLENGES
Web scanning, although seemingly uncomplicated in con-
cept, presents an array of intricacies and difficulties that
transcend its fundamental principle. The overall task appears
straightforward: it begins with pre-specified seed URLs, then
iteratively crawls through the network of links, extracting
embedded hyperlinks, downloading every page under the
specified addresses, and so on. However, there are many
challenges in the way of this operational execution. In
addition to being intrinsic to the vastness and dynamism of
the internet as a whole, the dark web environment possesses
particular qualities and subtleties that further compound these
difficulties.
A. SCALABILITY
One of the most significant problems with web crawling
efforts is scalability. Attaining optimal productivity in crawl-
ing operations is significantly challenging due to the web’s
immense scale and nonstop evolution. The web is growing
at an exponential rate, and new ways of crawling are needed
to keep up with this growth. One potential solution to this
problem is distributed crawling, which involves running the
crawler over several devices. Distributed crawling expands
coverage of the web landscape by increasing efficiency and
scalability through the partitioning of the URL space and
allocation of specific subsets of URLs to individual devices.
B. FUTURE TRENDS AND IMPLICATIONS

Technological Innovations and Evolving Threat Landscape
Technological innovations and the evolving threat landscape
pose challenges and opportunities for the future of the dark
web. Advances in encryption, blockchain technology, and
decentralized networking may enhance user privacy and
security, while also enabling new forms of criminal activity
and regulatory evasion.
Regulatory Trends and Policy Directions Regulatory
trends and policy directions shape the legal and regulatory
environment surrounding the dark web. Initiatives such as
Fig. 1. Illustration of how a crawler crawls linked pages and stores extracted
the EU Cybersecurity Act and the US Cybersecurity En- data in a database
hancement Act aim to enhance cybersecurity and combat
online crime through legislative measures and regulatory
frameworks. C. WEBSITE LIFE CYCLE
Social and Cultural Shifts in Dark Web Usage Social and
cultural shifts in dark web usage reflect broader trends in Websites hosted on private encrypted networks, like the
technology adoption and online behaviour. Changes in user dark web, have a substantially shorter lifespan than their
demographics, platform preferences, and content consump- counterparts on the public internet. These websites are tem-
tion patterns may influence the future trajectory of the dark porary because they move around a lot using different IP
web and its societal impact. addresses; this is a strategy used by web administrators to
Web crawlers have unique obstacles in addition to these avoid being tracked down and discovered. Particularly on
general difficulties while attempting to navigate the dark the dark web, electronic markets are infamous for being
web, an obscure and secretive area of the internet dis- transient, with site owners frequently changing their IP
tinguished by anonymity, encryption, and covert activity. addresses to avoid detection by law enforcement and preserve
In this underground world, the peculiar characteristics and operational security. The technical limitation of bandwidth
workings of the Tor network—the main entry point to the further exacerbates the problem by jeopardizing the avail-
dark web—make crawlers’ struggles much more difficult. ability and dependability of dark websites. Furthermore, the
Notable difficulties unique to crawling the dark web include: intricate process of traffic routing across numerous nodes
inside the Tor network lengthens the loading duration of
black web pages, making them more difficult for crawlers
to access.
The accessibility of dark web sites is hindered by an
extensive array of obstacles, including the need to perform
complex security measures to prevent automated crawling
and rigorous authentication requirements. Before allowing
access, many dark web sites demand that users register and
follow community guidelines. This means that completing
registration procedures and getting past security measures
sometimes requires direct intervention. Moreover, to discour-
age automated login attempts and reduce the vulnerability to
DoS attacks, crawling is further complicated by the imple-
mentation of authentication mechanisms such as CAPTCHA,
graphical riddles, and quizzes.
Community Dynamics and Management: Communities on
the dark web function within a unique socio-technical envi-
ronment that is defined by social dynamics, hierarchies, and
strict regulations. Webmasters have a great deal of control
over how these communities are run and governed; they put
policies in place to keep their online forums professional Fig. 2. Illustration of how a crawler crawls linked pages and stores extracted
data in a database
and efficient. Potential measures to address this issue include
the adoption of social stratification systems that consider
the professional backgrounds, skill sets, and activity levels
of members. Additionally, protocols may be established to VIII. RESULTS
identify and exclude inactive members to discourage suspi-
cious conduct. Effectively navigating dark web communities
as a crawler operator requires a nuanced comprehension of The results from the first segment of this research study,
community management strategies and social dynamics due the systematic literature review, were presented in the previ-
to their dynamic nature. ous section. In this section, the results from the implemen-
tation and evaluation of the developed clear web and dark
VII. APPLICATIONS web crawler are presented. The crawler was implemented
A. Use Cases for scraping both clear web and dark websites, and the data
collected from each web type was compared using a couple
”Crawling the dark web has uses in academic research to of different techniques and measures. First, the semi-manual
study behavioral trends, cybersecurity for threat intelligence, inspection of the website pairs was done using GNU Dif-
and assist law enforcement in countering unlawful activity. It futils. Diffutils identified discrepancies between the scraped
helps reveal patterns and insights that are concealed behind web content files. The crawler fetched the same number of
the shadowy world of the Dark Web. ” pages from both the clear web- and the dark websites for
Debian, QubeS, and CIA. In the case of the CIA’s website,
B. Applications of web crawler however, the index.html was downloaded twice from the
In a variety of businesses, web crawlers are essential clear web crawler. This was due to a programmatic error
tools for competition analysis and market research. These related to internal URLs in the clear web in the crawler
crawlers gather a multitude of useful information from com- where both the index referrers “/” and ”https://cia.gov/” were
petitor websites, including pricing tactics, product details, downloaded. On the Guardian’s website, 12 files were not
customer reviews, and marketing efforts. They accomplish retrieved from its onion site. The random wait was set to
this by methodically browsing the websites of their rivals. 0-4 seconds to avoid blocking and therefore the complete
Businesses can learn a great deal about the product offers, scraping of the 201 web pages took circa 26 minutes. The
consumer sentiment, and market positioning of their rivals scraping of the clear web version of The Guardian took circa
thanks to this data. Equipped with this data, businesses six minutes with the same random delay of 0-4 seconds
may decide on product development, pricing policies, and between each HTTP request. The files that were missing
marketing campaigns with knowledge. Web crawlers also from the scraping of The Guardian’s Onion were URLs
help firms remain flexible and responsive in ever-changing that were not available over their Onion site B. DARK
market circumstances by enabling real-time rival activity MARKETPLACE CRAWLING RESULTS The SIDE CT2S
monitoring. All things considered, web crawlers are essential dark web crawler was used to crawl a dark marketplace to
to helping companies obtain a competitive advantage, spot demonstrate and validate that fits its purpose as a digital
new industry trends, and seize expansion prospects. investigation tool.
IX. CONCLUSIONS [6] “Van Buskirk, J., Roxburgh”, A., Farrell, M., and Burns, L. 2014. ”The
Closure of the S Ilk R Oad: What Has This Meant for Online Drug
The difficulties in gathering evidence for cybercrime in- Trading?,” Addiction (109:4), pp. 517-518.
vestigations—especially those utilising the dark web—are [7] Dark Web Illegal Activities Crawling and Classifying Using Data
Mining Techniques by Abdul Hadi M. Alaidi
addressed in this study paper. It draws attention to the need [8] “Winkler, I., and Gomes, A.T. 2016. Advanced Persistent Security: A
for specialised tools to support dark web investigations and Cyberwarfare Approach to Implementing Adaptive Enterprise Protec-
presents a prototype built to satisfy particular specifications tion, Detection, and Reaction Strategies. Syngress.
[9] “Weimann, G. 2016b. ”Terrorist Migration to the Dark Web,” Perspec-
that are critical for software used for dark web investigations. tives on Terrorism (10:3), pp. 40- 44.
This development was based on a rigorous study of the [10] “Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-
literature, which examined 58 papers on dark web crawling. Web Interfaces Feng Zhao, Jingyu Zhou, Chang Nie, Heqing Huang,
Hai Jin.
The principal aim of this investigation was to construct [11] “Soumen Chakrabarti, Martin Van den Berg, and Byron Dom. Focused
a thorough comprehension of dark web crawlers in the crawling: a new approach to topic-septic web resource discovery.
academic domain. A dark web crawler was created as an Computer Networks, 31(11):16231640, 1999.
[12] “Luciano Barbosa and Juliana Freire. Combining classier to identify
adjunct to the D3 cybercrime toolkit based on this insight. online databases. In Proceedings of the 16th international conference
With the use of databases of previously annotated online on World Wide Web, pages 431440. ACM, 2007.
pages, the recently created crawler—which is coupled with [13] ”Jayant Madhavan, David Ko, ucja Kot, Vignesh Ganapathy, Alex
Rasmussen, and Alon Halevy. Google’s deep web crawl. Proceedings
an annotation-based machine learning classifier within the of the VLDB Endowment, 1(2):12411252, 2008.
D3 toolset—aims to automate the collecting and classifica- [14] ”Andre Bergholz and Boris Childlovskii. Crawling for domain specific
tion of web material. The goal of this automation is to lessen hidden web resources. In Web Information Systems Engineering, 2003.
WISE 2003. Proceedings of the Fourth International Conference on,
the amount of manual labour that cybercrime investigators pages 125133. IEEE, 2003.
must expend to sort through enormous volumes of online [15] ”Kevin Chen-Chuan Chang, Bin He, Chengkai Li, Mitesh Patel, and
content without sacrificing the crawling process’s control or Zhen Zhang. Structured databases on the web: Observations and
implications. ACM SIGMOD Record, 33(3):6170, 2004.
the investigation’s forensic integrity. The interactive features, [16] Samtani, S., Chinn, R., Chen, H., and Nunamaker Jr, J.F. 2017. ”Ex-
which include user login, crawling parameters, and URL ploring Emerging Hacker Assets and Key Hackers for Proactive Cyber
selection, are intended to sustain the requisite level of in- Threat Intelligence,” Journal of Management Information Systems
[17] Sapienza, A., Bessi, A., Damodaran, S., Shakarian, P., Lerman, K.,
vestigator involvement. and Ferrara, E. 2017. ”Early Warnings of Cyber Threats in Online
Subsequent studies ought to concentrate on improving and Discussions,” IEEE International Conference on Data Mining Work-
assessing this collection of tools using input from experts or shops
[18] Sun, Y., Edmundson, A., Vanbever, L., Li, O., Rexford, J., Chiang,
users. It’s also critical to comprehend and combat the Tor M., and Mittal, P. 2015. ”Raptor: Routing Attacks on Privacy in Tor,”
network’s crawler blocking algorithms. USENIX Security Symposium, pp. 271-286.
The main purpose of this research study was to estab- [19] Tanenbaum, A.S., and Van Steen, M. 2007. Distributed Systems:
Principles and Paradigms. PrenticeHall.
lish knowledge regarding dark web crawlers in academic [20] Jansen, R., Juarez, M., Gálvez, R., Elahi, T., and Diaz, C. 2017. ”Inside
research. From this knowledge, a dark web crawler was Job: Applying Traffic Analysis to Measure Tor from Within,” Network
developed to fit an already existing dark web cybercrime and Distributed System Security Symposium: IEEE Internet Society.
[21] Jansen, R., Tschorsch, F., Johnson, A., and Scheuermann, B. 2014.
toolset called D3. In combination with machine learning- ”The Sniper Attack: Anonymously Deanonymizing and Disabling the
based annotation and categorisation tools in D3, the crawler Tor Network,” Office of Naval Research, Arlington.
developed and presented in this article, will capacitate the [22] Ahmad, A. 2010. ”Tactics of Attack and Defense in Physical and
Digital Environments: An Asymmetric Warfare Approach,” Journal of
toolset to automatically collect and classify web content Information Warfare.
based on previously annotated web pages. Ultimately, this [23] The Anonymity of the Dark Web: A Survey by Javeriah Saleem
will save manual labour for cybercrime investigators, without [24] Dark Web 101 by Major Jeremy Cole
[25] Dark Web: A Web of Crimes by Shubhdeep Kaur
losing control over the crawling process. Neither will it [26] Beneath the Surface: Exploring the Dark Web and its Societal Impacts
compromise the forensic soundness of the overall process, by Hasan Saleh
since a certain amount of operator presence and interaction [27] The Dark Web: An Overview by Kristin Finklea
[28] The Dark Web Dilemma: Tor, Anonymity and Online Policing by Eric
is necessary for URL selection, crawling scope specification, Jardine
and user authentication for example. A logical continuation [29] Dark Web by Kristin Finklea
of this research would be to further elaborate on and test the [30] The Dark Web: A Dive into the Darkest Side of the Internet by Divya
Yadav
toolset, and also make an expert or user evaluation of it.
R EFERENCES
[1] Darkweb research: Past, present, and future trends and mapping to
sustainable development goals by Raghu Raman
[2] Deep Web, Dark Web, Dark Net: A Taxonomy of “Hidden” Internet
by Masayuki HATTA
[3] THE DARKNET: AN ENORMOUS BLACK BOX OF CY-
BERSPACE by Ms. Paridhi Saxena
[4] The Dark Web Phenomenon: A Review and Research Agenda by
Abhineet Gupta
[5] “Frederick Barr-Smith and Joss Wright. “Phishing With A Darknet:
Imitation of Onion Services”. In: 2020 APWG Symposium on Elec-
tronic Crime Research (eCrime). IEEE, Nov. 2020.

Dark Web Crawler Research Paper

Uploaded by

Copyright:

Available Formats

Dark Web Crawler Research Paper

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dark Web Crawler Research Paper

Uploaded by

Copyright:

Available Formats

A Comprehensive Survey of Dark Web Crawlers

Mentor : Prof Gaurav Parashar

B. FUTURE TRENDS AND IMPLICATIONS

You might also like