Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Towards Image-Based Dark Vendor Profiling

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Session: Multi-modal Data Analysis for Security IWSPA ’20, March 18, 2020, New Orleans, LA, USA

Towards Image-Based Dark Vendor Profiling


An Analysis of Image Metadata and Image Hashing in Dark Web Marketplaces

Susan Jeziorowski Muhammad Ismail Ambareen Siraj


sjeziorow42@students.tntech.edu mismail@tntech.edu asiraj@tntech.edu
Department of Computer Science Department of Computer Science Department of Computer Science
Tennessee Technological University Tennessee Technological University Tennessee Technological University
Cookeville, Tennessee Cookeville, Tennessee Cookeville, Tennessee
ABSTRACT ACM Reference Format:
Anonymity networks, such as Tor, facilitate the hosting of hidden Susan Jeziorowski, Muhammad Ismail, and Ambareen Siraj. 2020. Towards
Image-Based Dark Vendor Profiling: An Analysis of Image Metadata and
online marketplaces where dark vendors are able to anonymously
Image Hashing in Dark Web Marketplaces. In Sixth International Workshop
trade paraphernalia such as drugs, weapons, and hacking services. on Security and Privacy Analytics (IWSPA ’20), March 18, 2020, New Orleans,
Effective dark marketplace analysis and dark vendor profiling tech- LA, USA. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3375708.
niques support dark web investigations and help to identify and 3380311
locate these perpetrators. Existing automated techniques are text-
based, leaving non-textual artifacts, such as images, out of con- 1 INTRODUCTION
sideration. Though image data can further improve investigative Anonymity networks, such as Tor, The Onion Router, have grown in-
analysis, there are two primary challenges associated with dark creasingly popular among web users who want to conceal their on-
web image analysis: (a) ethical concerns over the presence of child line identity and activities. Though Tor and other popular anonymity
exploitation imagery in illegal markets, and (b) the computational networks promote our human right to privacy, they provide an av-
overhead needed to download, analyze, and store image content. enue for criminals to conduct illegal activities online without fear
In this research, we investigate and address the aforementioned of consequences. Specifically, Tor allows for the hosting of hidden
challenges to enable dark marketplace image analysis. Namely, online marketplaces where dark vendors are able to anonymously
we examine image metadata and explore several image hashing trade paraphernalia such as drugs, weapons, and hacking services.
techniques to represent image content, allowing us to collect image- Due to Tor’s hidden service infrastructure, owners, vendors, and
based intelligence and identify reused images among dark market- users of these marketplaces are difficult to identify and locate.
places while preventing exposure to illegal content and decreasing In typical web browsing sessions, various types of information
computational overhead. Our study reveals that approximately 75% about a user can be collected and stored including, but not limited
of dark marketplace listings include image data, indicating the im- to, location, searches conducted, browsed sites, social networks,
portance of considering image content for investigative analysis. banking data, email addresses, online behavior, and more. Thus,
Additionally, 2% of considered images were found to contain meta- in typical surface web settings, law enforcement and intelligence
data and approximately 50% of image hashes were repeated among agencies can leverage open source intelligence to investigate cy-
marketplace listings, suggesting the presence of easily obtainable ber criminals. However, utilization of web anonymity techniques
incriminating evidence and frequency of image reuse among dark by criminals makes such open source intelligence neither readily
vendors. Finally, through an image hash analysis, we demonstrate available nor easily obtainable. Exposing anonymous activity is
the effectiveness of using image hashing to identify similar images integral in locating and prosecuting cyber criminals, making dark
between dark marketplaces. web analytic techniques essential for investigators.
An important capability of dark web analytics is Dark Vendor
CCS CONCEPTS Profiling (DVP), i.e. the collection and analysis of a dark vendor’s
• Applied computing → Evidence collection, storage and anal- characteristics for the purpose of establishing incriminating evi-
ysis. dence against them and de-anonymizing their identity. Examples
of such characteristics include vendor names, products they sell,
KEYWORDS countries they ship goods from, marketplaces they participate in,
dark web, Tor, dark marketplace, metadata, image hashing and alias accounts they control among others. The vast major-
ity of the related work has been conducted using only text-based
data scraped from dark forums and marketplaces. For example,
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed many of the works rely on fingerprinting users’ writing styles for
for profit or commercial advantage and that copies bear this notice and the full citation the purpose of identifying their aliases. Thus, existing studies are
on the first page. Copyrights for components of this work owned by others than ACM missing some important hidden service artifacts which could lead
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a to more incriminating evidence, especially for dark marketplaces
fee. Request permissions from permissions@acm.org. that are image-based, rather than text-based. However, considering
IWSPA ’20, March 18, 2020, New Orleans, LA, USA images in profiling is challenging due to several factors, namely,
© 2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-7115-5/20/3. . . $15.00 the availability of complete datasets, computational overhead, and
https://doi.org/10.1145/3375708.3380311 relevant ethical considerations. In this paper, we aim to address

15
Session: Multi-modal Data Analysis for Security IWSPA ’20, March 18, 2020, New Orleans, LA, USA

these challenges to pave the road for efficient image-based DVP The rest of this paper is organized as follows. Section 2 describes
techniques. the anonymity techniques that allow cyber criminals to conduct
In general, previous studies have avoided downloading image illegal activity online. Section 3 explores related dark web analysis
content due to ethical concerns, such as unintentionally access- studies conducted and discusses their limitations. The methodology
ing child pornography and other paraphernalia. In our study, we and experimental results of this work are discussed in Sections
explore methods to represent image content without the need to 4 and 5 respectively, followed by a discussion of limitations and
view, download, or store images for DVP. Specifically, we examine future work in Section 6. Finally, Section 7 concludes this research.
the metadata of images and evaluate the effectiveness of storing
hashes of dark marketplace images, rather than the image content. 2 BACKGROUND
This avoids unintentional exposure to child obscenity and, at the In this section, Tor and its hidden service infrastructure is explained
same time, saves computational resources. The nature of hashing to illustrate the challenges in identifying owners, vendors, and users
allows for large amounts of data to be represented in short charac- of dark marketplaces when anonymity techniques help to conceal
ter streams. Thus, it is a natural candidate for representing large user behavior, location, and other potentially incriminating data.
amounts of image data without the need to store the actual image
content. Classic hashing algorithms are also designed such that any
2.1 Tor: The Anonymous Network
small change in data results in a major change in the hash. Conse-
quently, this work also examines several types of image hashing Anonymity networks, which are known as overlay networks, use
algorithms, which allow similar images to generate the same hash, software solutions deployed on top of existing infrastructure, i.e.
even if resized, cropped, or filtered. In particular, our study analyzes the Internet, to map virtual links between clients and services for
images from 47 Tor-based dark marketplaces and compares four the creation of new virtualized network infrastructures [10]. By far,
image hashing techniques to determine the most effective means the most prevalent anonymity network is The Onion Router, Tor,
to identify similar images within and between dark marketplaces, developed by The Tor Project, Inc. and initially released in 2002 [7].
leading to DVP. Typical Tor connections are based on circuits of three relay nodes
In addition, our study aims to support alias attribution, i.e. the - namely, an entry node, a middle node, and an exit node. When
correlation of several accounts belonging to the same vendor, using preparing a stream of data to be sent down a circuit, a user will
image-based data. According to Black Widow [12], a cyber intel- encrypt their data three times, using each relay’s public key once.
ligence gathering framework for dark web applications, there is As the data is passed from the entry node, to the middle node, and to
substantial overlap between actors across dark forums, even if the the exit node, a layer of encryption is removed at each hop. Finally,
forums are not based in the same language. Therefore, it is reason- once the data has reached the circuit’s exit node, the data is fully
able to suspect a similar overlap exists between dark marketplaces decrypted and passed to the destination node. This scheme allows
as well. Presumably, images have the potential to the identifica- anonymity for the user not only by performing several rounds of
tion of dark vendor aliases, since it is likely a vendor would use encryption but also by ensuring each node is only aware of its
the same images to sell their product if they were participating on neighboring nodes in a circuit, i.e. no node is aware of the overall
several dark marketplaces. Furthermore, images could assist in the end-to-end communication, ensuring clients and services are never
development of incriminating evidence if they contain metadata directly connected.
concerning the image’s author, date and time of creation, location, Additionally, when the Tor browser is used, little to no remnants
camera make and model, and more. of internet activity can be forensically recovered from the device.
Since very few reliable datasets exist for the purpose of dark web Specifically, forensic analysis may verify whether or not the Tor
analysis research, developing complete, reusable data is undeniably browser was installed on a client computer, but not if and when it
one of the largest roadblocks. For the purpose of this study, we was used, nor what it was used for [13]. It is important to note that
started with a publicly available Darknet Market (DNM) Archive the connection between a user and an anonymity network is not
consisting of data scraped from 89 dark marketplaces from 2013- hidden in Tor. However, the user’s location and the content of the
2015 [6]. Despite the author’s warning of potential incompleteness communications within the network remain concealed. Most often,
of each crawl, the 1.6TB dataset has been used by several researchers the user’s traffic is delivered on shared bandwidth, making it even
in a variety of work. In an attempt to address the incompleteness more difficult to distinguish between individual connections.
issue, another goal of our study is to identify the most complete and
useful set of dark marketplaces from the DNM Archive to support 2.2 Hidden Services
future dark marketplace analysis research. Tor’s most distinctive feature is its ability to provide hidden services,
In summary, this research directly supports intelligence and law each of which are hosted with .onion addresses [7]. Tor hidden
enforcement communities’ investigative efforts in DVP by offering services enable users to host anonymous, theoretically untraceable
the following contributions: we demonstrate how more effective websites by implementing additional security measures. This fea-
DVP can be achieved by including image data in the analysis of ture enables dark marketplaces to be hosted and dark vendors to
dark marketplaces; we determine the most effective image hashing conduct criminal activity. Unlike typical Tor network connections,
technique for the identification of images repeatedly used by ven- connections to hidden service involve additional interactions with
dors, leading to dark vendor alias attribution; and we identify the Introduction Points and Rendezvous Points and result in six total
most complete and useful set of dark marketplaces from the well relay nodes: one entry, one middle, and one exit node for both the
known DNM Archive. client and the service [10, 11].

16
Session: Multi-modal Data Analysis for Security IWSPA ’20, March 18, 2020, New Orleans, LA, USA

With double-sided anonymity, both users and service providers


are able to mask their identities, thereby disabling either party’s
ability to discover the other party’s true location. For the anony-
mous community, this is a very attractive web hosting solution. In
fact, it is estimated that 70,000-100,000 hidden services are running
on the Tor network at any time. This statistic (and many others
like it) are reported by the Tor Project [5] and accessible on their
metrics portal [2].
As previously mentioned, the location of a hidden service is the-
oretically untraceable. However, many studies have challenged the
design of the hidden service system in an attempt to de-anonymize
their user base and owners. These studies will be further discussed
in Section 3. Likewise, there are many cases where, leveraging
user error, law enforcement has been able to successfully locate
a criminal hidden server and prosecute the owner of the service
subsequently.
One of the most notable such cases was that of the Silk Road
anonymous marketplace take-down executed by the FBI and Eu-
ropol in 2013. The Silk Road was a multi-million U.S dollar dark
marketplace specialized in narcotics and controlled substances. Ul-
timately, the owner of this marketplace was identified by an FBI
agent who was able to expose the owner’s email address and full
Figure 2: Screenshot of Quality King prescription pill list-
name. Despite the successful take-down, newer versions of the
ings in November 2019 from quality2ui4uooym.onion.
Silk Road became available through other hidden service operators,
as shown in Figure 1. In fact, dozens, if not hundreds of similar
dark marketplaces, such as the one in Figure 2, are available today,
enabling the sale of paraphernalia and demonstrating the need 3.1 Marketplace Analysis
for effective methodologies for targeting both dark marketplace Most current research analyzes hidden services in general, not
administrators and vendors. specifically dark marketplaces. In this study, we focus on authorship
analysis, which is two-fold. One part is user attribution, where the
goal of an investigator is either to de-anonymize a given user based
on their browsing behavior, traffic, or semantic styles. The other
is alias attribution, where distinct online profiles are correlated
between communities. This domain of work determines whether a
user in one dark platform is that same user acting in another forum,
marketplace, or other platform.
Although identities are protected in anonymous environments
like Tor, users may leave traces of their textual identities in writing
styles and through the nature of their participation in dark forum
and marketplaces. Furthermore, they may be characterized by the
number of forums, marketplaces, chatrooms and other users they
are associated with, or by the time of day when they are active
in anonymous network environments. To achieve user attribution,
work in [8] considers scenarios where a set of suspects is known,
and the task is to determine which of the suspects are responsible for
the web activities related to a cybercrime. Relying on knowledge of
a set of potential suspects, none of these works attempt to attribute
Figure 1: Screenshot of Silk Road marketplace listings in Oc- a user based on dark marketplace activity data.
tober 2019 from silkroad7rn2puhj.onion. When individuals operate under a number of different accounts,
they are considered to have aliases. Attributing these aliases is an
important aspect in performing authorship analysis because it may
assist in the identification of a user and potential prosecution of a
cybercriminal. In an attempt to achieve alias attribution, Spitters
3 RELATED WORK et al. describe a methodology in which they analyze user profiles
This section discusses various studies related to dark marketplace based on topic-independent features, such as length of text and
analysis and introduces the core concepts of each work along with words, use of function words, interpunction and shallow syntactic
how they relate to ours. patterns, along with time-based features and character n-grams

17
Session: Multi-modal Data Analysis for Security IWSPA ’20, March 18, 2020, New Orleans, LA, USA

[15]. Ultimately, this analysis is used to compute similarities be- Another contribution of our study is a determination of the best
tween pseudo users and attribute aliases. The researchers achieved image hashing technique for DVP applications. Image hashes have
sufficient precision for small sets of forums and users, but under- been considered and compared beforehand to evaluate robustness
achieved in terms of recall, resulting in 25% and 45% in pseudo user against image modifications such as changes in brightness, contrast,
set sizes of 177 and 25 respectively. Also, this model is not likely to scaling, and more [4], resulting in a determination of how often
be successful in marketplace settings where writing styles are less image modifications resulted in different hashes compared to that of
distinctive. the original source image. Such studies report Perceptual Hashing,
Another study [16] discusses the limitation of stylometric analy- PHASH, as the most robust hashing algorithm. In contrast, the work
sis in dark marketplaces. The success of text-based analysis, where described in this paper aims to determine the accuracy in image
users post rich and diverse text, is not possible in dark marketplaces matching based on image hashing without purposefully modifying
since there is sparse text data available and the text usually reflects images. Additionally, this paper considers dark marketplace image
the product type instead of the vendor’s writing style. Thus, the data specifically. Our study concludes that PHASH is the most
researchers proposed to fingerprint vendors by their photographic effective hashing algorithm for DVP as well, which will be further
style instead, building a classifier to identify vendors with multi- detailed in Section 5.
ple accounts across several dark marketplaces based on high level
features like object, scene, background, camera angle, and others.
By considering image data, Wang et al. developed a highly accu- 4 METHODOLOGY
rate model to correlate vendor accounts. However, neither image In this section, the research approach is discussed in steps. All code
metadata nor text-based data was considered in their classification, supporting this work was written in Python version 2.7.15.
which might have improved their model’s accuracy. Furthermore,
their work relied on the content of the images. This requires sub-
stantial computing resources, which may not always be available 4.1 Dark Marketplace Dataset
in practice, especially in smaller, local investigative organizations. The dataset pulled from the DNM Archive consists of 34.4 GB of
A recent research was conducted using crypto-currency data to zipped directories containing HTML, JavaScript, and styling files
analyze dark marketplaces for buyer and seller de-anonymization[14]. along with images. After extracting the zipped directories, 47 of
In this work, Sima considers a single dark marketplace, Valhalla, the 74 download directories (totaling 667.6 GB) were determined
and de-anonymizes users by analyzing scraped data against pub- to be useful for the DVP database (DVP DB). The remaining 27
licly available bitcoin data. This work resulted in an application that directories (115 GB) were not processed into the DVP DB due to
associated dark identities with related bitcoin addresses and trans- a number of factors including lack of image data or inconsistent
actions. However, since this research was based on a single dark HTML formatting which made organizing web scraped data into
marketplace, it does not address the specific dark vendor profiling the DVP DB time consuming and futile. This is a limitation to this
objective of alias attribution. work since the DNM Archive used for this study was not considered
As discussed, the existing work in marketplace analysis differ in its entirety.
from our work as they are based on either text-based data, im- Each marketplace directory was then individually inspected for
agecontent, or crypto-currency data, whereas our work focuses HTML formatting and directory hierarchy. Unfortunately, almost all
on image metadata and image hashing for alias attribution across marketplaces were scraped differently - some had photos saved into
various dark marketplaces. directories while others used base64 encoding in HTML, some or-
ganized listings into separate directories while others contained all
listings under one directory, and so on. Each marketplace directory
3.2 Image Analysis had to be processed with unique Python scripts. Furthermore, the
Researchers in [9] used the same DNM Archive as used in this amount of unique listings collected varied immensely between the
study to evaluate the prevalence of metadata by collecting all jpeg marketplaces. For example, the Agora directory contained 116,858
image files from the archive and extracting their metadata. In total, unique listings, while the Bloomsfield directory contained only 8.
authors Lisker and Rose observed 223,471 unique photos from the Nonetheless, each marketplace directory was parsed to collect the
entire DNM Archive, 229 of which contained GPS information, same data per listing: product ID, marketplace, product name, vendor
defining unique photos by the image file name. Since the authors name, scrape date, and image path.
did not consider any other file formats other than jpeg, data that For base64 encoded images that were not originally stored in
was embedded in base64 encoding and data that was saved in PNG the directory, the Python scripts created a new PNG file in the
format was excluded from analysis. Also, in some cases, images current directory based on the base64 encoded data. Each image
can share the same filename despite being present under different was further processed to check for metadata within the photos
listings and marketplaces. Thus, it is likely that many images were and hashed with each of the four different image hashing schemes
disregarded that were not actual duplicates, but perceived as such. considered in this study. Namely, the Python implementations of
To alleviate these problems, our study defined duplicate images by Average Hash (AHASH), Difference Hash (DHASH), Perception
the listings they belonged to, rather than their titles, and considered Hash (PHASH), and Wavelet Hash (WHASH) were considered for
all image formats, resulting in 297,922 unique images from a fraction image hash analysis [1].
of the DNM Archive, 5,944 of which contained metadata. The results Any images or image paths that caused errors in processing,
of the metadata analysis in our work will be presented in Section 5. such as incorrect file extension and file not found, were not included

18
Session: Multi-modal Data Analysis for Security IWSPA ’20, March 18, 2020, New Orleans, LA, USA

in the DVP DB and therefore not considered for metadata and hash
analysis.

4.2 Image Hash Analysis Approach


To compare the image hashing techniques, a hash analysis was
conducted. The goal was to determine which hashing technique
was most accurate in calculating identical hashes for similar or
equivalent dark marketplace images. Presumably, if a hash provides
enough data to effectively represent images for the purpose of
image matching, then the hash would be sufficient for investigating
vendor aliases, rather than requiring an image in its entirety.
To evaluate the accuracy of this approach, images were first
grouped by hash and hash type, such that if there were 10 images
that produced the same AHASH value, then they were considered
to belong to a single group. For each group, a Structural Similarity
Index Metric (SSIM) [3] was calculated between each possible pair
of images belonging to the said group. The SSIM is a metric that
determines the percentage of similarity between photos. The more
alike two photos are, the closer their SSIM value will be to 1.0. The
more different two photos are, the closer their SSIM value will be
to 0. Therefore, by incorporating SSIM calculations in the hash
analysis, we were able to quantify the level of accuracy in using
image hashes for image matching in dark marketplaces.
To explain further, let us consider a group of 10 images to be a
part of group X sharing the AHASH, abc. For the hash analysis, each
image in group X was paired with each other, such that SSIM values
were calculated between unique image pairs X1 X2 , X1 X3 , X1 X4 ,
etc. Thus, the 10 images from group X resulted in 45 image pairs
and 45 SSIM values. The 45 SSIM values would then be averaged
and stored into a database such that a single entry in the database
would contain data regarding the hash, the hash type, the average
SSIM, and the number of pairs considered in calculating the average
SSIM for a particular group.
Finally, after calculating average SSIM values for each group of
unique hashes and hash types, an overall weighted average SSIM Figure 3: Workflow Summary of Image-Based Analysis for
was calculated for each of the four hash types. The weighted aver- DVP.
age SSIM provides each groups’ SSIM value a weight determined
by the number of image pairs used to calculate it. This way, groups
for either product ID, marketplace, product name, vendor name, or
with a large number of images more heavily influenced the overall
scrape date.
weighted averaged SSIM compared to groups with a small num-
ber of images. Figure 3 summarizes the workflow of image-based
5 EXPERIMENTAL RESULTS
analysis conducted in this study.
Overall, the DVP DB consisted of data from 47 marketplaces with
400,741 product listings, 297,922 images, and 10,712 unique vendor
4.3 DVP Database names from 391 scrape dates. This shows that the majority of dark
The DVP database (DVP DB) is a MySQL database organized into marketplace listings incorporate image data (approximately 75%).
three tables, the first of which holds main listing data, such as The following section discusses the results for the three contribu-
listing ID, product ID, marketplace, product name, vendor, and tions of this research based on the experimentation with the DVP
scrape date. The second contains all image data, i.e. the image path, DB.
any metadata found, and the four hashes of the image. The final
table is used to store all hash analysis data, i.e. the hash, hash type, 5.1 Top Dark Marketplaces
number of images with said hash, the number of image pairs used To determine the most meaningful marketplaces present in the
in calculating the average SSIM, the number of pairs not used in DMN Archives, first each of the 47 marketplaces in DVP DB were
the calculating of the average SSIM due to error, and the average ranked against each other using the following measures: number
SSIM for each hash. DVP DB contains no duplicate listings and of listing entries, number of vendors, number of images, number of
was cleaned of any erroneous entries, such as having empty values images with metadata, number of images with GPS coordinate data,

19
Session: Multi-modal Data Analysis for Security IWSPA ’20, March 18, 2020, New Orleans, LA, USA

proportion of images with metadata, and proportion of images with 5.3 Image Hashing
GPS coordinate data. Then, an average rank was calculated for each The last goal of this study is to find an effective way to represent
of the marketplaces using the seven aforementioned metrics, which image data so that images listed may be used to correlate vendors
denoted the overall significance of each marketplace in comparison without having to save or view image content. The DVP DB resulted
with the others. Among all the dark marketplaces under study, in 76,261 unique AHASHes, 85,651 DHASHes, 80,321 PHASHes,
Agora is found to be the most informative, with 116,858 entries and and 75,114 WHASHes. Out of these, over 40,000 images per hash
64,535 images, 3.6% of which contained metadata. The remaining type were repeated at least once in the database. By running a
marketplaces are listed in Table 1 in order of their significance hash analysis on DVP DB images, PHASH was determined as the
as determined by this study. This table can effectively serve as a most effective solution for image hashing in dark web applications.
reference for the most effective dark marketplaces present in the The complete results of the hash analysis are listed below in Table
DMN Archives for future DVP research. 4 in order of weighted SSIM. Again, the weighted average SSIM
In addition, we also analyzed vendor names to determine the takes into account the number of image pairs considered when
frequency of vendor names being shared across dark marketplaces. calculating average SSIM values per unique hash, and is therefore
Interestingly, we found that names were frequently repeated across a more accurate calculation of hash type reliability.
platforms, as shown in Table 2. While it is possible that individual
vendors coincidentally shared the same name in separate market- 6 LIMITATIONS AND FUTURE WORK
places, this analysis supports the more likely presumption of dark
marketplace overlap by multi-market vendors. Furthermore, this The main limitation within this work is in the DVP DB. While
analysis supports the idea that image data used in conjunction with parsing the DNM Archives, any anomalous files which caused pro-
textual data can lead to more accurate DVP by verifying whether a cessing errors for unknown reasons or did not match the expected
repeated vendor name is simply a coincidence or a probable alias. HTML formatting were passed over and not included in image
metadata analysis and hash analysis. In addition, the amount of
data provided by each of the DNM Archive marketplace datasets
varies due to inconsistent scrape dates and listing quantities. As a
future work, a dark web crawler can be developed which scrapes
marketplace listings more frequently and systematically such that
5.2 Image Metadata data acquired is better balanced and more representative of existing
Beside image content, image files also hold information relevant dark marketplaces in the wild and their vendors.
to the image’s production such as data on camera settings, cam- We have several immediate research plans to follow up. The
era brand and model, time of creation, GPS location, image cre- top 30 DNM Archive marketplaces identified in this study will be
ator and more, which can be embedded into image files. Evidently, used to parse text-based data in addition to image-based data for
such image metadata has the potential to present investigators the purpose of designing machine learning based classifying tech-
with a plethora of identifiable evidence, which may lead to the niques for alias attribution. Also, PHASH’s of images will be used
de-anonymization of dark vendors. to represent image content rather than saving the image content
In our case, of the 297,922 image files stored in DVP DB, 5,944 itself, thereby avoiding both (a) legal issues caused by downloading
were found to contain some metadata (2.0% of all images). Table 3 and possessing exploitative imagery and (b) storage and processing
summarizes the top marketplaces containing the highest proportion overhead.
of images with metadata. Interestingly, 37 of the 47 marketplaces
contained 0 images with metadata, suggesting those sites purpose- 7 CONCLUSION
fully scraped images of their metadata upon uploading to the site Analyzing dark marketplaces is an imperative part of cyber crimi-
as a precaution. nal investigation and prosecution. However, due to the anonymous
DVP DB images were further analyzed for the presence of GPS nature of the Tor network and hidden services, dark marketplace
data embedded in images which could greatly assist investigators analysis is non-trivial. This research considers the task of alias
in identifying the physical locations of dark vendors. Of all DVP attribution between dark vendors in dark marketplaces, i.e. Dark
DB images, 828 were found to contain GPS data specifically (0.28% Vendor Profiling. Accurately determining alias vendors conducting
of all images). Though only few images contained metadata and business in multiple marketplaces will aid dark web investigations
GPS location data, collecting such information can aid in dark and lead to the de-anonymization of anonymous sellers of para-
vendor profiling and locating. For example, in the case that a dark phernalia.
vendor under alias A uploads a photo to marketplace A, where Whereas previous research relied on text-based content for alias
metadata is automatically scraped, and the same vendor under alias attribution across hidden services, this research examines the avail-
B uploads the same photo to marketplace B, where metadata is not ability and significance of image data for dark vendor profiling in
automatically scraped, correlating alias A and alias B through DVP Tor-based dark marketplaces specifically. Namely, this work de-
could generate evidence (such as physical location) against alias A termined the most informative dark marketplaces available from
that would not have been generated had the correlation not been the public Darknet Market Archive dataset, evaluated the presence
made. Therefore, despite the presence of metadata being limited, of metadata in dark marketplace images, and analyzed four im-
its significance and impact can be extended when considered in age hashing techniques, leading to identifying Perceptual Hashing
conjunction with alias attribution. to be the most accurate technique for matching similar images

20
Session: Multi-modal Data Analysis for Security IWSPA ’20, March 18, 2020, New Orleans, LA, USA

Marketplace # Listings # Vendors # Images # W/ Meta # W/ GPS % W/ Meta % W/ GPS


Agora 116,858 3,154 64,535 2,292 214 3.55% 0.33%
Blackbank Market 12,852 905 10,565 2,086 470 19.74% 4.45%
Evolution 89,208 3,922 69,492 749 62 0.88% 0.07%
Alphabay 88,722 1,446 79,060 0 0 0% 0%
Pandora 15,223 516 15,066 11 0 0.07% 0%
Tor Escrow 958 185 866 241 43 27.83% 4.97%
Abraxas 16,641 432 11,979 0 0 0% 0%
Tor Market 1,502 200 817 230 14 28.15% 1.71%
Cloudnine 10,952 1,088 10,070 0 0 0% 0%
Dream Market 7,251 398 6,385 0 0 0% 0%
Cryptomarket 4,422 411 3,941 0 0 0% 0%
Middle Earth 6,650 359 6,167 0 0 0% 0%
Andromeda 3,054 237 2,947 0 0 0% 0%
Bluesky 2,400 213 2,089 0 0 0% 0%
Oxygen 2,212 257 2,012 0 0 0% 0%
Freebay 507 175 417 97 6 23.26% 1.44%
Hydra 2,282 166 2,240 0 0 0% 0%
Cannabis Road 2 1,537 155 1,442 0 0 0% 0%
Area51 489 74 479 89 6 18.58% 1.25%
East India Company 1,429 143 1,232 0 0 0% 0%
The Real Deal 981 82 873 0 0 0% 0%
Black Services 639 167 621 0 0 0% 0%
The Marketplace 823 124 584 0 0 0% 0%
Amazon Dark 199 41 190 57 5 30% 2.63%
Haven 741 74 704 0 0 0% 0%
Darkbay 538 124 533 0 0 0% 0%
Cannabis Road 3 318 95 258 3 0 1.16% 0%
Panacea 461 21 459 0 0 0% 0%
Silkstreet 35 14 33 11 7 33.33% 21.21%
Freemarket 169 6 167 52 1 31.14% 0.6%
Poseidon 427 17 427 0 0 0% 0%
Torbazaar 383 27 332 0 0 0% 0%
Tochka 197 29 192 0 0 0% 0%
1776 171 37 170 0 0 0% 0%
Darknet Heroes 207 28 153 0 0 0% 0%
Deepzon 56 6 55 19 0 34.55% 0%
Dogeroad 112 28 100 0 0 0% 0%
Underground Market 143 20 112 0 0 0% 0%
The Majestic Garden 88 16 63 0 0 0% 0%
Horizon 44 11 44 2 0 4.55% 0%
Cantina 21 9 19 5 0 26.32% 0%
Topix2 34 24 24 0 0 0% 0%
Sheep 8048 370 0 0 0 0% 0%
Bloomsfield 8 3 8 0 0 0% 0%
White Rabbit 313 62 0 0 0 0% 0%
Kiss 415 10 0 0 0 0% 0%
Greyroad 21 9 0 0 0 0% 0%
Table 1: DVP DB Marketplaces listed in order of significance in DVP DB by calculating an average rank over seven character-
istics: number of listing entries, number of vendors, number of images, number of images with metadata, number of images
with GPS coordinate data, proportion of images with metadata, and proportion of images with GPS coordinate data.

21
Session: Multi-modal Data Analysis for Security IWSPA ’20, March 18, 2020, New Orleans, LA, USA

Vendor Name # Marketplaces # Listings REFERENCES


mikehamer 16 227 [1] [n.d.]. ImageHash · PyPI. https://pypi.org/project/ImageHash/. (Accessed on
12/04/2019).
bcdirect 16 652 [2] [n.d.]. Onion Services ś Tor Metrics. https://metrics.torproject.org/hidserv-dir-
blackhand 14 1,227 onions-seen.html. (Accessed on 12/04/2019).
idealpills 13 1,351 [3] [n.d.]. SSIM: Structural Similarity Index | imatest. http://www.imatest.com/docs/
ssim/. (Accessed on 12/11/2019).
theblossom 13 180 [4] [n.d.]. Testing different image hash functions. https://content-blockchain.org/
Table 2: Top five vendor names in DVP DB based on the num- research/testing-different-image-hash-functions/
[5] [n.d.]. Tor Project | Anonymity Online. https://www.torproject.org/. (Accessed
ber of distinct marketplace appearances. on 12/04/2019).
[6] Gwern Branwen, Nicolas Christin, David Décary-Hétu, Rasmus Munksgaard
Andersen, StExo, El Presidente, Anonymous, Daryl Lau, Delyan Kratunov Sohhlz,
Vince Cakic, Van Buskirk, Whom, Michael McKenna, and Sigi Goode. 2015.
Marketplace % Images w/ Meta % Images w/ GPS Dark Net Market archives, 2011-2015. https://www.gwern.net/DNM-archives.
https://www.gwern.net/DNM-archives Accessed: 2019-10-21.
deepzon 34.55% 0% [7] Roger Dingledine, Nick Mathewson, and Paul Syverson. 2004. Tor: The Second-
silkstreet 33.33% 21.21% Generation Onion Router. In IN PROCEEDINGS OF THE 13 TH USENIX SECURITY
SYMPOSIUM.
freemarket 31.14% 0.60% [8] Xiaoxi Fan, Kam-Pui Chow, and Fei Xu. 2014. Web user profiling and tracking
amazondark 30.0% 2.632% based on behavior analysis. (Jan 2014). https://doi.org/10.5353/th_b5731085
[9] Paul Lisker and Michael Rose. 2017. Illuminating the Dark Web. https://medium.
tormarket 28.15% 1.71% com/@roselisker/illuminating-the-dark-web-d088a9c80240
Table 3: Top five marketplaces in DVP DB based on the pro- [10] Joao Marques. 2018. Tor: Hidden Service Intelligence Extraction.
[11] The Tor Project. [n.d.]. Tor: Onion Service Protocol. https://2019.www.torproject.
portion of images containing metadata. org/docs/onion-services.html.en
[12] M. Schäfer, M. Fuchs, M. Strohmeier, M. Engel, M. Liechti, and V. Lenders. 2019.
BlackWidow: Monitoring the Dark Web for Cyber Security Information. In 2019
11th International Conference on Cyber Conflict (CyCon), Vol. 900. 1ś21. https:
Image Hash Type Weighted Avg SSIM Avg SSIM //doi.org/10.23919/CYCON.2019.8756845
[13] Brett Shavers and John Bair. 2016. Hiding Behind the Keyboard. Chapter 2.
PHASH 0.991 0.986 [14] Tomas Sima. 2018. Darknet market analysis and user de-anonymization. Master’s
DHASH 0.987 0.989 thesis. Masaryk University Faculty of Informatics. An optional note.
AHASH 0.881 0.976 [15] M. Spitters, F. Klaver, G. Koot, and M. v. Staalduinen. 2015. Authorship Analysis on
Dark Marketplace Forums. In 2015 European Intelligence and Security Informatics
WHASH 0.660 0.975 Conference. 1ś8. https://doi.org/10.1109/EISIC.2015.47
Table 4: Hash analysis results for average, difference, percep- [16] Xiangwen Wang, Peng Peng, Chun Wang, and Gang Wang. 2018. You Are
Your Photographs: Detecting Multiple Identities of Vendors in the Darknet
tual, and wavelet hashing in order of weighted average SSIM Marketplaces. In Proceedings of the 2018 on Asia Conference on Computer and
values. Communications Security (ASIACCS ’18). ACM, New York, NY, USA, 431ś442.
https://doi.org/10.1145/3196494.3196529

between dark marketplace listings. This work helps future dark


vendor research by identifying not only the list of marketplaces best
suited for experimentation, but also identifying the image hashing
technique best suited for dark web scraping.
This work supports our efforts toward multi-modal Dark Vendor
Profiling using machine learning based classification techniques
where text, image, and behavioral data will be considered for im-
proved results.

22

You might also like