Using Positive Tainting and Syntax-Aware Evaluation To Counter SQL Injection Attacks
Using Positive Tainting and Syntax-Aware Evaluation To Counter SQL Injection Attacks
Using Positive Tainting and Syntax-Aware Evaluation To Counter SQL Injection Attacks
ABSTRACT commands are executed by the database, and the attack succeeds.
SQL injection attacks pose a serious threat to the security of Web Although this general mechanism is well understood, straightfor-
applications because they can give attackers unrestricted access to ward solutions based on defensive coding practices have been less
databases that contain sensitive information. In this paper, we pro- than successful for several reasons. First, it is difficult to imple-
pose a new, highly automated approach for protecting existing Web ment and enforce a rigorous defensive coding discipline. Second,
applications against SQL injection. Our approach has both concep- many solutions based on defensive coding address only a subset
tual and practical advantages over most existing techniques. From of the possible attacks. Finally, defensive coding is problematic
the conceptual standpoint, the approach is based on the novel idea in the case of legacy software because of the cost and complex-
of positive tainting and the concept of syntax-aware evaluation. ity of retrofitting existing code. Researchers have proposed a wide
From the practical standpoint, our technique is at the same time pre- range of alternative techniques to address SQLIAs, but many of
cise and efficient and has minimal deployment requirements. The these solutions have limitations that affect their effectiveness and
paper also describes WASP, a tool that implements our technique, practicality.
and a set of studies performed to evaluate our approach. In the stud- In this paper we propose a new, highly automated approach for
ies, we used our tool to protect several Web applications and then dynamic detection and prevention of SQLIAs. Intuitively, our ap-
subjected them to a large and varied set of attacks and legitimate proach works by identifying “trusted” strings in an application and
accesses. The evaluation was a complete success: WASP success- allowing only these trusted strings to be used to create certain parts
fully and efficiently stopped all of the attacks without generating of an SQL query, such as keywords or operators. The general mech-
any false positives. anism that we use to implement this approach is based on dynamic
tainting, which marks and tracks certain data in a program at run-
Categories and Subject Descriptors: D.2.0 [Software Engineer- time.
ing]: General—Protection mechanisms; The kind of dynamic tainting we use gives our approach several
important advantages over techniques based on different mecha-
General Terms: Security nisms. Many techniques rely on complex static analyses in order to
Keywords: SQL injection, dynamic tainting, runtime monitoring find potential vulnerabilities in code (e.g., [9, 15, 26]). These kinds
of conservative static analyses can generate high rates of false posi-
tives or may have scalability issues when applied to large, complex
1. INTRODUCTION applications. Our approach does not rely on complex static anal-
SQL injection attacks (SQLIAs) are one of the major security yses and is very efficient and precise. Other techniques involve
threats for Web applications [5]. Successful SQLIAs can give at- extensive human effort (e.g., [4, 18, 24]). They require developers
tackers access to and even control of the databases that underly to manually rewrite parts of their applications, build queries using
Web applications, which may contain sensitive or confidential in- special libraries, or mark all points in the code at which malicious
formation. Despite the potential severity of SQLIAs, many Web input could be introduced. In contrast, our approach is highly auto-
applications remain vulnerable to such attacks. mated and in most cases requires minimal or no developer interven-
In general, SQL injection vulnerabilities are caused by inade- tion. Lastly, several proposed techniques require the deployment of
quate input validation within an application. Attackers take ad- extensive infrastructure or involve complex configurations (e.g., [2,
vantage of these vulnerabilities by submitting input strings that 23, 25]). Our approach does not require additional infrastructure
contain specially-encoded database commands to the application. and can be deployed automatically.
When the application builds a query using these strings and sub- Compared to other existing techniques based on dynamic taint-
mits the query to its underlying database, the attacker’s embedded ing (e.g., [8, 20, 21]), our approach makes several conceptual and
practical improvements that take advantage of the specific char-
acteristics of SQLIAs. The first conceptual advantage of our ap-
proach is the use of positive tainting. Positive tainting identifies
Permission to make digital or hard copies of all or part of this work for and tracks trusted data, whereas traditional (“negative”) tainting fo-
personal or classroom use is granted without fee provided that copies are cuses on untrusted data. In the context of SQLIAs, there are sev-
not made or distributed for profit or commercial advantage and that copies eral reasons why positive tainting is more effective than negative
bear this notice and the full citation on the first page. To copy otherwise, to tainting. First, in Web applications, trusted data sources can be
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
more easily and accurately identified than untrusted data sources;
SIGSOFT’06/FSE-14, November 5–11, 2006, Portland, Oregon, USA. therefore, the use of positive tainting leads to increased automation.
Copyright 2006 ACM 1-59593-468-5/06/0011 ...$5.00.
1. String login = getParameter("login");
Second, the two approaches differ significantly in how they are af- 2. String pin = getParameter("pin");
fected by incompleteness. With negative tainting, failure to identify 3. Statement stmt = connection.createStatement();
the complete set of untrusted data sources would result in false neg- 4. String query = "SELECT acct FROM users WHERE login=’";
5. query += login + "’ AND pin=" + pin;
atives, that is, successful undetected attacks. With positive tainting, 6. ResultSet result = stmt.executeQuery(query);
conversely, missing trusted data sources would result in false pos- 7. if (result != null)
itives, which are undesirable, but whose presence can be detected 8. displayAccount(result); // Show account
immediately and easily corrected. In fact, we expect that most false 9. else
10. sendAuthFailed(); // Authentication failed
positives would be detected during pre-release testing. The second
conceptual advantage of our approach is the use of flexible syntax- Figure 1: Excerpt of a Java servlet implementation.
aware evaluation, which gives developers a mechanism to regulate
the usage of string data based not only on its source, but also on its malicious commands into a vulnerable application [10]. In this sec-
syntactical role in a query string. In this way, developers can use tion we introduce an example application that contains an SQL in-
a wide range of external input sources to build queries, while pro- jection vulnerability and show how an attacker can leverage the vul-
tecting the application from possible attacks introduced via these nerability to perform an SQLIA. Note that the example represents
sources. an extremely simple kind of attack, and we present it for illustra-
The practical advantages of our approach are that it imposes a tive purposes only. Interested readers may refer to References [1]
low overhead on the application and has minimal deployment re- and [10] for further examples of the different types of SQLIAs.
quirements. Efficiency is achieved by using a specialized library, The code excerpt in Figure 1 represents the implementation of lo-
called MetaStrings, that accurately and efficiently assigns and tracks gin functionality that we can find in a typical Web application. This
trust markings at runtime. The only deployment requirements for type of login function would commonly be part of a Java servlet, a
our approach are that the Web application must be instrumented type of Java application that runs on a Web application server, and
and deployed with our MetaStrings library, which is done auto- whose execution is triggered by the submission of a URL from a
matically. The approach does not require any customized runtime user of the Web application. The servlet in the example uses the
system or additional infrastructure. input parameters login and pin to dynamically build an SQL
In this paper, we also present the results of an extensive empiri- query or command.1 The login and pin are checked against the
cal evaluation of the effectiveness and efficiency of our technique. credentials stored in the database. If they match, the correspond-
To perform this evaluation, we implemented our approach in a tool ing user’s account information is returned. Otherwise, a null set is
called WASP (Web Application SQL-injection Preventer) and eval- returned by the database and the authentication fails. The servlet
uated WASP on a set of seven Web applications of various types and then uses the response from the database to generate HTML pages
sizes. For each application, we protected it with WASP, targeted it that are sent back to the user’s browser by the the Web server.
with a large set of attacks and legitimate accesses, and assessed the Given the servlet code, if a user submits login and pin as
ability of our technique to detect and prevent attacks without stop- “doe” and “123,” the application dynamically builds the query:
ping legitimate accesses. The results of the evaluation are promis-
ing; our technique was able to stop all of the attacks without gener- SELECT acct FROM users WHERE login=’doe’ AND pin=123
ating false positives for any of the legitimate accesses. Moreover, If login and pin match the corresponding entry in the database,
our technique proved to be efficient, imposing only a low overhead doe’s account information is returned and then displayed by func-
on the Web applications. tion displayAccount(). If there is no match in the database,
The main contributions of this work are: function sendAuthFailed() displays an appropriate error mes-
sage. An application that uses this servlet is vulnerable to SQLIAs.
• A new, automated technique for preventing SQLIAs based on For example, if an attacker enters “admin’ --” as the user name
the novel concept of positive tainting and on flexible syntax- and any value as the pin (e.g., “0”), the resulting query is:
aware evaluation.
SELECT acct FROM users WHERE login=’admin’ -- ’ AND pin=0
• A mechanism to perform efficient dynamic tainting of Java
strings that precisely propagates trust markings while strings In SQL, “--” is the comment operator, and everything after it is
are manipulated at runtime. ignored. Therefore, when performing this query, the database sim-
• A tool that implements our SQLIA prevention technique for ply searches for an entry where login is equal to admin and
Java-based Web applications and has minimal deployment returns that database record. After the “successful” login, the func-
requirements. tion displayAccount() would therefore reveal the admin’s
• An empirical evaluation of the technique that shows its ef- account information to the attacker.
fectiveness and efficiency.
WASP
String
Additional Checker
Trusted Sources MetaStrings
and Markings library
SQLIA
Data
URL HTML
Users
Figure 2: High-level overview of the approach and tool.
4.1 The MetaStrings Library classes also provide methods for setting and querying the metadata
MetaStrings is our library of classes that mimic and extend the associated with a string’s characters.
behavior of Java’s standard string classes (i.e., Character, Stri- The use of MetaStrings has the following benefits: (1) it allows
ng, StringBuilder, and StringBuffer).3 For each string for associating trust markings at the granularity level of single char-
class C, MetaStrings provides a “meta” version of the class, MetaC, acters; (2) it accurately maintains and propagates trust markings;
that has the same functionality as C, but allows for associating (3) it is defined completely at the application level and therefore
metadata with each character in a string and tracking the metadata does not require a customized runtime system; (4) its usage requires
as the string is manipulated at runtime. only minimal and automatically performed changes to the applica-
The MetaStrings library takes advantage of the object-oriented tion’s bytecode; and (5) it imposes a low execution overhead on the
features of the Java language to provide complete mediation of Web application (See Section 5.3).
string operations that could affect string values and their associ- The main limitations of the current implementation of the MetaS-
ated trust markings. Encapsulation and information hiding guaran- trings library are related to the handling of primitive types, native
tee that the internal representation of a string class is accessed only methods, and reflection. MetaStrings cannot currently assign trust
through the class’s interface. Polymorphism and dynamic binding markings to primitive types, so it cannot mark char values. Be-
let us add functionality to a string class by (1) creating a subclass cause we do not instrument native methods, if a string class is
that overrides all methods of the original class and (2) replacing in- passed as an argument to a native method, the trust marking associ-
stantiations of the original class with instantiations of the subclass. ated with the string might not be correct after the call. In the case of
As an example, Figure 3 shows an intuitive view of the MetaS- hard-coded strings created through reflection (by invoking a string
trings class that corresponds to Java’s String class. As the figure constructor by name), our instrumenter for MetaStrings would not
shows, MetaString extends class String, has the same inter- recognize the constructors and would not change these instantia-
nal representation, and provides the same methods. MetaString tions to instantiations of the corresponding meta classes. However,
also contains additional data structures for storing metadata and as- the MetaStrings library can handle most other uses of reflection,
sociating the metadata with characters in the string. Each method of such as invocation of string methods by name.
class MetaString overrides the corresponding method in Stri- In practice, these limitations are of limited relevance because
ng, providing the same functionality as the original method, but they represent programming practices that are not normally used
also updating the metadata based on the method’s semantics. For to build SQL commands (e.g., representing strings using primitive
example, a call to method substring(2,4) on an object str of char values). Moreover, during instrumentation of a Web applica-
class MetaString would return a new MetaString that con- tion, we identify and report these potentially problematic situations
tains the second and third characters of str and the correspond- to the developers.
ing metadata. In addition to the overridden methods, MetaStrings
4.2 Initialization of Trusted Strings
3 To implement positive tainting, WASP must be able to identify
For simplicity, hereafter we use the term string to refer to all
string-related classes and objects in Java. and mark trusted strings. There are three categories of strings that
String method of a newly-created StringBuilder object. WASP must
replace these string objects with their corresponding MetaStrings
[ f ][ o ][ o ]...[ r ] objects so that they can maintain and propagate the trust markings
of the strings on which they operate. To do this, WASP scans the
bytecode for instructions that create new instances of the string
method1 classes used to perform string manipulation and modifies each such
... instruction so that it creates an instance of the corresponding MetaS-
method n trings class instead. In this case, WASP does not associate any trust
markings with the newly-created MetaStrings objects. These ob-
jects are not trusted per se, and they become marked only if the
actual values assigned to them during execution are marked.
Figure 5 shows the instrumentation added by WASP for implicitly-
created strings. The Java source code corresponds to line 5 in our
MetaString example servlet. The StringBuilder object at offset 28 in the
original bytecode is added by the Java compiler when translating
[ f ][ o ][ o ]...[ r ] (inherited) the string concatenation operator (“+”). WASP replaces the instanti-
ation at offset 28 with the instantiation of a MetaStringBuilder
method 1
class and then changes the subsequent invocation of the constructor
... at offset 37 so that it matches the newly instantiated class. Because
method n MetaStringBuilder extends StringBuilder, the subse-
... quent calls to the append method invoke the correct method in the
setMetadata
MetaStringBuilder class.
Metadata getMetadata
markAll Strings from External Sources. To use query fragments com-
Update Policies
... ing from external (trusted) sources, developers must list these sources
in a configuration file that WASP processes before instrumenting the
Figure 3: Intuitive view of a MetaStrings library class. application. The specified sources can be of different types, such
as files (specified by name), network connections (specified by host
WASP must consider: hard-coded strings, strings implicitly created
and port), and databases (specified by database name, table, field,
by Java, and strings originating from external sources. In the fol- or combination thereof). For each source, developers can either
lowing sections, we explain how strings from each category are specify a custom trust marking or use the default trust marking (the
identified and marked. same used for hard-coded strings). WASP uses the information in
the configuration file to instrument the external trusted sources ac-
Hard-Coded Strings. The identification of hard-coded strings cording to their type.
in an application’s bytecode is a fairly straightforward process. In To illustrate this process, we describe the instrumentation that
Java, hard-coded strings are represented using String objects that WASP performs for trusted strings coming from a file. In the con-
are created automatically by the Java Virtual Machine (JVM) when figuration file, the developer specifies the name of the file (e.g.,
string literals are loaded onto the stack. (The JVM is a stack-based foo.txt) as a trusted source of strings. Based on this informa-
interpreter.) Therefore, to identify hard-coded strings, WASP sim- tion, WASP scans the bytecode for all instantiations of new file ob-
ply scans the bytecode and identifies all load instructions whose jects (i.e., File, FileInputStream, FileReader) and adds
operand is a string constant. WASP then instruments the code by instrumentation that checks the name of the file being accessed. At
adding, after each of these load instructions, code that creates an runtime, if the name of the file matches the name(s) specified by
instance of a MetaString class using the hard-coded string as the developer (foo.txt in this case), the file object is added to an
an initialization parameter. Finally, because hard-coded strings are internal list of currently trusted file objects. WASP also instruments
completely trusted, WASP adds to the code a call to the method of all calls to methods of file-stream objects that return strings, such as
the newly created MetaString object that marks all characters BufferedReader’s readLine method. At runtime, the added
as trusted. At runtime, polymorphism and dynamic binding allow code checks to see whether the object on which the method is called
this instance of the MetaString object to be used in any place where is in the list of currently trusted file objects. If so, it marks the gen-
the original String object would have been used. erated strings with the trust marking specified by the developer for
Figure 4 shows an example of this bytecode transformation. The the corresponding source.
Java code at the top of the figure corresponds to line 4 of our servlet We use a similar strategy to mark network connections. In this
example (see Figure 1), which creates one of the hard-coded strings case, instead of matching file names at runtime, we match host-
in the servlet. Underneath, we show the original bytecode (left), names and ports. The interaction with databases is more compli-
and the modified bytecode (right). The modified bytecode contains cated and requires WASP not only to match the initiating connec-
additional instructions that (1) load a new MetaString object on tion, but also to trace tables and fields through instantiations of the
the stack, (2) call the MetaString constructor using the previous Statement and ResultSet objects created when querying the
string as a parameter, and (3) call the method markAll, which database.
assigns the given trust marking to all characters in the string.
Instrumentation Optimization. Our current instrumentation
Implicitly-Created Strings. In Java programs, the creation of approach is conservative and may generate unneeded instrumenta-
some string objects is implicitly added to the bytecode by the com- tion. We could limit the amount of instrumentation inserted in the
piler. For example, Java compilers typically translate the string code by leveraging static information about the program. For exam-
concatenation operator (“+”) into a sequence of calls to the append ple, data-flow analysis could identify strings that are not involved
Source Code: 4. String query = "SELECT acct FROM users WHERE login=’";
Original Bytecode Modified Bytecode
24a. new MetaString
24b. dup
24c. ldc "SELECT acct FROM users WHERE login=’"
24. ldc "SELECT acct FROM users WHERE login=’"
24e. invokespecial MetaString.<init>:(LString)V
24d. iconst_1
24e. invokevirtual MetaString.markAll:(I)V
with the construction of query strings and thus do not need to be cific methods and classes in the JDBC library (http://java.
instrumented. Another example involves cases where static analy- sun.com/products/jdbc/). Therefore, these points can be
sis could determine that the filename associated with a file object identified through a simple matching of method signatures. Af-
is never one of the developer-specified trusted filenames, that ob- ter identifying the database interaction points, WASP inserts a call
ject would not need to be instrumented. Analogous optimizations to the syntax-aware evaluation function, MetaChecker, imme-
could be implemented for other external sources. We did not in- diately before each interaction point. MetaChecker takes the
corporate any of these optimizations in the current tool because we MetaStrings object that contains the query about to be executed as
were mostly interested in having an initial prototype to assess our a parameter.
technique. However, we are planning to implement them in future When invoked, MetaChecker processes the SQL string about
work to further reduce runtime overhead. to be sent to the database as discussed in Section 3.3. First, it tok-
enizes the string using an SQL parser. Ideally, WASP would use a
4.3 Handling False Positives database parser that recognizes the exact same dialect of SQL that
As discussed in Section 3, sources of trusted data that are not is used by the database. This would guarantee that WASP interprets
specified by the developers beforehand would cause WASP to gen- the query in the same way as the database and would prevent attacks
erate false positives. To assist the developers in identifying data based on alternate encodings [1]—attacks that obfuscate keywords
sources that they initially overlooked, WASP provides a special mode and operators to elude signature-based checks. Our current imple-
of operation, called “learning mode”, that would typically be used mentation includes parsers for SQL-92 (ANSI) and PostgreSQL.
during in-house testing. When in learning mode, WASP adds an After tokenizing the query string, MetaChecker enforces the de-
additional unique taint marking to each string in the application. fault trust policy by iterating through the tokens that correspond to
Each marking consists of an ID that maps to the fully qualified class keywords and operators and examining their trust markings. If any
name, method signature, and bytecode offset of the instruction that of these tokens contains characters that are not marked as trusted,
instantiated the corresponding string. the query is blocked and reported.
If WASP detects an SQLIA while in learning mode, it uses the If developers specified additional trust policies, MetaChecker
markings associated with the untrusted SQL keywords and opera- invokes the corresponding checking function(s) to ensure that the
tors in the query to report the instantiation point of the correspond- query complies with them. In our current implementation, trust
ing string(s). If the SQLIA is actually a false positive, knowing the policies are developer-defined functions that take the list of SQL
position in the code of the offending string(s) would help develop- tokens as input, perform some type of check on them based on
ers correct omissions in the set of trusted inputs. their trust markings, and return a true or false value depending
on the outcome of the check. Trust policies can implement func-
4.4 Syntax-Aware Evaluation tionality that ranges from simple pattern matching to sophisticated
The STRING CHECKER module performs syntax-aware evalua- checks that use externally-supplied contextual information. If all
tion of query strings and is invoked right before the strings are sent custom trust policies return a positive outcome, WASP allows the
to the database. To add calls to the STRING CHECKER module, query to be executed on the database. Otherwise, it classifies the
WASP first identifies all of the database interaction points: points query as an SQLIA, blocks it, and reports it.
in the application where query strings are issued to an underlying
database. In Java, all calls to the database are performed via spe-
SELECT acct FROM users WHERE login = ’ doe ’ AND pin = 123 Table 1: Subject programs for the empirical study.
Subject LOC DBIs Servlets Params
Figure 6: Example query 1 after parsing by runtime monitor. Checkers 5,421 5 18 (61) 44 (44)
Office Talk 4,543 40 7 (64) 13 (14)
Employee Directory 5,658 23 7 (10) 25 (34)
SELECT acct FROM users WHERE login = ’ admin ’ -- ’ AND pin=0
Bookstore 16,959 71 8 (28) 36 (42)
Events 7,242 31 7 (13) 36 (46)
Figure 7: Example query 2 after parsing by runtime monitor. Classifieds 10,949 34 6 (14) 18 (26)
Portal 16,453 67 3 (28) 39 (46)