Towards Image-Based Dark Vendor Profiling
Towards Image-Based Dark Vendor Profiling
Towards Image-Based Dark Vendor Profiling
15
Session: Multi-modal Data Analysis for Security IWSPA ’20, March 18, 2020, New Orleans, LA, USA
these challenges to pave the road for efficient image-based DVP The rest of this paper is organized as follows. Section 2 describes
techniques. the anonymity techniques that allow cyber criminals to conduct
In general, previous studies have avoided downloading image illegal activity online. Section 3 explores related dark web analysis
content due to ethical concerns, such as unintentionally access- studies conducted and discusses their limitations. The methodology
ing child pornography and other paraphernalia. In our study, we and experimental results of this work are discussed in Sections
explore methods to represent image content without the need to 4 and 5 respectively, followed by a discussion of limitations and
view, download, or store images for DVP. Specifically, we examine future work in Section 6. Finally, Section 7 concludes this research.
the metadata of images and evaluate the effectiveness of storing
hashes of dark marketplace images, rather than the image content. 2 BACKGROUND
This avoids unintentional exposure to child obscenity and, at the In this section, Tor and its hidden service infrastructure is explained
same time, saves computational resources. The nature of hashing to illustrate the challenges in identifying owners, vendors, and users
allows for large amounts of data to be represented in short charac- of dark marketplaces when anonymity techniques help to conceal
ter streams. Thus, it is a natural candidate for representing large user behavior, location, and other potentially incriminating data.
amounts of image data without the need to store the actual image
content. Classic hashing algorithms are also designed such that any
2.1 Tor: The Anonymous Network
small change in data results in a major change in the hash. Conse-
quently, this work also examines several types of image hashing Anonymity networks, which are known as overlay networks, use
algorithms, which allow similar images to generate the same hash, software solutions deployed on top of existing infrastructure, i.e.
even if resized, cropped, or filtered. In particular, our study analyzes the Internet, to map virtual links between clients and services for
images from 47 Tor-based dark marketplaces and compares four the creation of new virtualized network infrastructures [10]. By far,
image hashing techniques to determine the most effective means the most prevalent anonymity network is The Onion Router, Tor,
to identify similar images within and between dark marketplaces, developed by The Tor Project, Inc. and initially released in 2002 [7].
leading to DVP. Typical Tor connections are based on circuits of three relay nodes
In addition, our study aims to support alias attribution, i.e. the - namely, an entry node, a middle node, and an exit node. When
correlation of several accounts belonging to the same vendor, using preparing a stream of data to be sent down a circuit, a user will
image-based data. According to Black Widow [12], a cyber intel- encrypt their data three times, using each relay’s public key once.
ligence gathering framework for dark web applications, there is As the data is passed from the entry node, to the middle node, and to
substantial overlap between actors across dark forums, even if the the exit node, a layer of encryption is removed at each hop. Finally,
forums are not based in the same language. Therefore, it is reason- once the data has reached the circuit’s exit node, the data is fully
able to suspect a similar overlap exists between dark marketplaces decrypted and passed to the destination node. This scheme allows
as well. Presumably, images have the potential to the identifica- anonymity for the user not only by performing several rounds of
tion of dark vendor aliases, since it is likely a vendor would use encryption but also by ensuring each node is only aware of its
the same images to sell their product if they were participating on neighboring nodes in a circuit, i.e. no node is aware of the overall
several dark marketplaces. Furthermore, images could assist in the end-to-end communication, ensuring clients and services are never
development of incriminating evidence if they contain metadata directly connected.
concerning the image’s author, date and time of creation, location, Additionally, when the Tor browser is used, little to no remnants
camera make and model, and more. of internet activity can be forensically recovered from the device.
Since very few reliable datasets exist for the purpose of dark web Specifically, forensic analysis may verify whether or not the Tor
analysis research, developing complete, reusable data is undeniably browser was installed on a client computer, but not if and when it
one of the largest roadblocks. For the purpose of this study, we was used, nor what it was used for [13]. It is important to note that
started with a publicly available Darknet Market (DNM) Archive the connection between a user and an anonymity network is not
consisting of data scraped from 89 dark marketplaces from 2013- hidden in Tor. However, the user’s location and the content of the
2015 [6]. Despite the author’s warning of potential incompleteness communications within the network remain concealed. Most often,
of each crawl, the 1.6TB dataset has been used by several researchers the user’s traffic is delivered on shared bandwidth, making it even
in a variety of work. In an attempt to address the incompleteness more difficult to distinguish between individual connections.
issue, another goal of our study is to identify the most complete and
useful set of dark marketplaces from the DNM Archive to support 2.2 Hidden Services
future dark marketplace analysis research. Tor’s most distinctive feature is its ability to provide hidden services,
In summary, this research directly supports intelligence and law each of which are hosted with .onion addresses [7]. Tor hidden
enforcement communities’ investigative efforts in DVP by offering services enable users to host anonymous, theoretically untraceable
the following contributions: we demonstrate how more effective websites by implementing additional security measures. This fea-
DVP can be achieved by including image data in the analysis of ture enables dark marketplaces to be hosted and dark vendors to
dark marketplaces; we determine the most effective image hashing conduct criminal activity. Unlike typical Tor network connections,
technique for the identification of images repeatedly used by ven- connections to hidden service involve additional interactions with
dors, leading to dark vendor alias attribution; and we identify the Introduction Points and Rendezvous Points and result in six total
most complete and useful set of dark marketplaces from the well relay nodes: one entry, one middle, and one exit node for both the
known DNM Archive. client and the service [10, 11].
16
Session: Multi-modal Data Analysis for Security IWSPA ’20, March 18, 2020, New Orleans, LA, USA
17
Session: Multi-modal Data Analysis for Security IWSPA ’20, March 18, 2020, New Orleans, LA, USA
[15]. Ultimately, this analysis is used to compute similarities be- Another contribution of our study is a determination of the best
tween pseudo users and attribute aliases. The researchers achieved image hashing technique for DVP applications. Image hashes have
sufficient precision for small sets of forums and users, but under- been considered and compared beforehand to evaluate robustness
achieved in terms of recall, resulting in 25% and 45% in pseudo user against image modifications such as changes in brightness, contrast,
set sizes of 177 and 25 respectively. Also, this model is not likely to scaling, and more [4], resulting in a determination of how often
be successful in marketplace settings where writing styles are less image modifications resulted in different hashes compared to that of
distinctive. the original source image. Such studies report Perceptual Hashing,
Another study [16] discusses the limitation of stylometric analy- PHASH, as the most robust hashing algorithm. In contrast, the work
sis in dark marketplaces. The success of text-based analysis, where described in this paper aims to determine the accuracy in image
users post rich and diverse text, is not possible in dark marketplaces matching based on image hashing without purposefully modifying
since there is sparse text data available and the text usually reflects images. Additionally, this paper considers dark marketplace image
the product type instead of the vendor’s writing style. Thus, the data specifically. Our study concludes that PHASH is the most
researchers proposed to fingerprint vendors by their photographic effective hashing algorithm for DVP as well, which will be further
style instead, building a classifier to identify vendors with multi- detailed in Section 5.
ple accounts across several dark marketplaces based on high level
features like object, scene, background, camera angle, and others.
By considering image data, Wang et al. developed a highly accu- 4 METHODOLOGY
rate model to correlate vendor accounts. However, neither image In this section, the research approach is discussed in steps. All code
metadata nor text-based data was considered in their classification, supporting this work was written in Python version 2.7.15.
which might have improved their model’s accuracy. Furthermore,
their work relied on the content of the images. This requires sub-
stantial computing resources, which may not always be available 4.1 Dark Marketplace Dataset
in practice, especially in smaller, local investigative organizations. The dataset pulled from the DNM Archive consists of 34.4 GB of
A recent research was conducted using crypto-currency data to zipped directories containing HTML, JavaScript, and styling files
analyze dark marketplaces for buyer and seller de-anonymization[14]. along with images. After extracting the zipped directories, 47 of
In this work, Sima considers a single dark marketplace, Valhalla, the 74 download directories (totaling 667.6 GB) were determined
and de-anonymizes users by analyzing scraped data against pub- to be useful for the DVP database (DVP DB). The remaining 27
licly available bitcoin data. This work resulted in an application that directories (115 GB) were not processed into the DVP DB due to
associated dark identities with related bitcoin addresses and trans- a number of factors including lack of image data or inconsistent
actions. However, since this research was based on a single dark HTML formatting which made organizing web scraped data into
marketplace, it does not address the specific dark vendor profiling the DVP DB time consuming and futile. This is a limitation to this
objective of alias attribution. work since the DNM Archive used for this study was not considered
As discussed, the existing work in marketplace analysis differ in its entirety.
from our work as they are based on either text-based data, im- Each marketplace directory was then individually inspected for
agecontent, or crypto-currency data, whereas our work focuses HTML formatting and directory hierarchy. Unfortunately, almost all
on image metadata and image hashing for alias attribution across marketplaces were scraped differently - some had photos saved into
various dark marketplaces. directories while others used base64 encoding in HTML, some or-
ganized listings into separate directories while others contained all
listings under one directory, and so on. Each marketplace directory
3.2 Image Analysis had to be processed with unique Python scripts. Furthermore, the
Researchers in [9] used the same DNM Archive as used in this amount of unique listings collected varied immensely between the
study to evaluate the prevalence of metadata by collecting all jpeg marketplaces. For example, the Agora directory contained 116,858
image files from the archive and extracting their metadata. In total, unique listings, while the Bloomsfield directory contained only 8.
authors Lisker and Rose observed 223,471 unique photos from the Nonetheless, each marketplace directory was parsed to collect the
entire DNM Archive, 229 of which contained GPS information, same data per listing: product ID, marketplace, product name, vendor
defining unique photos by the image file name. Since the authors name, scrape date, and image path.
did not consider any other file formats other than jpeg, data that For base64 encoded images that were not originally stored in
was embedded in base64 encoding and data that was saved in PNG the directory, the Python scripts created a new PNG file in the
format was excluded from analysis. Also, in some cases, images current directory based on the base64 encoded data. Each image
can share the same filename despite being present under different was further processed to check for metadata within the photos
listings and marketplaces. Thus, it is likely that many images were and hashed with each of the four different image hashing schemes
disregarded that were not actual duplicates, but perceived as such. considered in this study. Namely, the Python implementations of
To alleviate these problems, our study defined duplicate images by Average Hash (AHASH), Difference Hash (DHASH), Perception
the listings they belonged to, rather than their titles, and considered Hash (PHASH), and Wavelet Hash (WHASH) were considered for
all image formats, resulting in 297,922 unique images from a fraction image hash analysis [1].
of the DNM Archive, 5,944 of which contained metadata. The results Any images or image paths that caused errors in processing,
of the metadata analysis in our work will be presented in Section 5. such as incorrect file extension and file not found, were not included
18
Session: Multi-modal Data Analysis for Security IWSPA ’20, March 18, 2020, New Orleans, LA, USA
in the DVP DB and therefore not considered for metadata and hash
analysis.
19
Session: Multi-modal Data Analysis for Security IWSPA ’20, March 18, 2020, New Orleans, LA, USA
proportion of images with metadata, and proportion of images with 5.3 Image Hashing
GPS coordinate data. Then, an average rank was calculated for each The last goal of this study is to find an effective way to represent
of the marketplaces using the seven aforementioned metrics, which image data so that images listed may be used to correlate vendors
denoted the overall significance of each marketplace in comparison without having to save or view image content. The DVP DB resulted
with the others. Among all the dark marketplaces under study, in 76,261 unique AHASHes, 85,651 DHASHes, 80,321 PHASHes,
Agora is found to be the most informative, with 116,858 entries and and 75,114 WHASHes. Out of these, over 40,000 images per hash
64,535 images, 3.6% of which contained metadata. The remaining type were repeated at least once in the database. By running a
marketplaces are listed in Table 1 in order of their significance hash analysis on DVP DB images, PHASH was determined as the
as determined by this study. This table can effectively serve as a most effective solution for image hashing in dark web applications.
reference for the most effective dark marketplaces present in the The complete results of the hash analysis are listed below in Table
DMN Archives for future DVP research. 4 in order of weighted SSIM. Again, the weighted average SSIM
In addition, we also analyzed vendor names to determine the takes into account the number of image pairs considered when
frequency of vendor names being shared across dark marketplaces. calculating average SSIM values per unique hash, and is therefore
Interestingly, we found that names were frequently repeated across a more accurate calculation of hash type reliability.
platforms, as shown in Table 2. While it is possible that individual
vendors coincidentally shared the same name in separate market- 6 LIMITATIONS AND FUTURE WORK
places, this analysis supports the more likely presumption of dark
marketplace overlap by multi-market vendors. Furthermore, this The main limitation within this work is in the DVP DB. While
analysis supports the idea that image data used in conjunction with parsing the DNM Archives, any anomalous files which caused pro-
textual data can lead to more accurate DVP by verifying whether a cessing errors for unknown reasons or did not match the expected
repeated vendor name is simply a coincidence or a probable alias. HTML formatting were passed over and not included in image
metadata analysis and hash analysis. In addition, the amount of
data provided by each of the DNM Archive marketplace datasets
varies due to inconsistent scrape dates and listing quantities. As a
future work, a dark web crawler can be developed which scrapes
marketplace listings more frequently and systematically such that
5.2 Image Metadata data acquired is better balanced and more representative of existing
Beside image content, image files also hold information relevant dark marketplaces in the wild and their vendors.
to the image’s production such as data on camera settings, cam- We have several immediate research plans to follow up. The
era brand and model, time of creation, GPS location, image cre- top 30 DNM Archive marketplaces identified in this study will be
ator and more, which can be embedded into image files. Evidently, used to parse text-based data in addition to image-based data for
such image metadata has the potential to present investigators the purpose of designing machine learning based classifying tech-
with a plethora of identifiable evidence, which may lead to the niques for alias attribution. Also, PHASH’s of images will be used
de-anonymization of dark vendors. to represent image content rather than saving the image content
In our case, of the 297,922 image files stored in DVP DB, 5,944 itself, thereby avoiding both (a) legal issues caused by downloading
were found to contain some metadata (2.0% of all images). Table 3 and possessing exploitative imagery and (b) storage and processing
summarizes the top marketplaces containing the highest proportion overhead.
of images with metadata. Interestingly, 37 of the 47 marketplaces
contained 0 images with metadata, suggesting those sites purpose- 7 CONCLUSION
fully scraped images of their metadata upon uploading to the site Analyzing dark marketplaces is an imperative part of cyber crimi-
as a precaution. nal investigation and prosecution. However, due to the anonymous
DVP DB images were further analyzed for the presence of GPS nature of the Tor network and hidden services, dark marketplace
data embedded in images which could greatly assist investigators analysis is non-trivial. This research considers the task of alias
in identifying the physical locations of dark vendors. Of all DVP attribution between dark vendors in dark marketplaces, i.e. Dark
DB images, 828 were found to contain GPS data specifically (0.28% Vendor Profiling. Accurately determining alias vendors conducting
of all images). Though only few images contained metadata and business in multiple marketplaces will aid dark web investigations
GPS location data, collecting such information can aid in dark and lead to the de-anonymization of anonymous sellers of para-
vendor profiling and locating. For example, in the case that a dark phernalia.
vendor under alias A uploads a photo to marketplace A, where Whereas previous research relied on text-based content for alias
metadata is automatically scraped, and the same vendor under alias attribution across hidden services, this research examines the avail-
B uploads the same photo to marketplace B, where metadata is not ability and significance of image data for dark vendor profiling in
automatically scraped, correlating alias A and alias B through DVP Tor-based dark marketplaces specifically. Namely, this work de-
could generate evidence (such as physical location) against alias A termined the most informative dark marketplaces available from
that would not have been generated had the correlation not been the public Darknet Market Archive dataset, evaluated the presence
made. Therefore, despite the presence of metadata being limited, of metadata in dark marketplace images, and analyzed four im-
its significance and impact can be extended when considered in age hashing techniques, leading to identifying Perceptual Hashing
conjunction with alias attribution. to be the most accurate technique for matching similar images
20
Session: Multi-modal Data Analysis for Security IWSPA ’20, March 18, 2020, New Orleans, LA, USA
21
Session: Multi-modal Data Analysis for Security IWSPA ’20, March 18, 2020, New Orleans, LA, USA
22