Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Software Reverse Engineering Education

Download as pdf or txt
Download as pdf or txt
You are on page 1of 122

San Jose State University

SJSU ScholarWorks
Master's Theses Master's Theses and Graduate Research

2009

Software reverse engineering education


Teodoro Cipresso
San Jose State University

Follow this and additional works at: http://scholarworks.sjsu.edu/etd_theses Recommended Citation


Cipresso, Teodoro, "Software reverse engineering education" (2009). Master's Theses. Paper 3734.

This Thesis is brought to you for free and open access by the Master's Theses and Graduate Research at SJSU ScholarWorks. It has been accepted for inclusion in Master's Theses by an authorized administrator of SJSU ScholarWorks. For more information, please contact Library-scholarworksgroup@sjsu.edu.

SOFTWARE REVERSE ENGINEERING EDUCATION

A Thesis Presented to The Faculty of the Department of Computer Science San Jose State University

In Partial Fulfillment of the Requirements for the Degree Master of Science

by Teodoro Cipresso August 2009

UMI Number: 1478574

All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion.

UMT
Dissertation Publishing

UMI 1478574 Copyright 2010 by ProQuest LLC. All rights reserved. This edition of the work is protected against unauthorized copying under Title 17, United States Code.

ProQuest LLC 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 48106-1346

2009 Teodoro Cipresso ALL RIGHTS RESERVED

SAN JOSE STATE UNIVERSITY The Undersigned Thesis Committee Approves the Thesis Titled SOFTWARE REVERSE ENGINEERING EDUCATION by Teodoro Cipresso APPROVED FOR THE DEPARTMENT OF COMPUTER SCIENCE

'(kkJ^
Jr. Mark Stamp, Department of Computer Science Dr. David Taylor, Department of Computer Science

5~
Date

m
Date

S/zo/a?

Dr. Robert Chun,

Department of Computer Science

APPROVED FOR THE UNIVERSITY

Associate Dean

Office of Graduate Studies and Research

Date

ABSTRACT SOFTWARE REVERSE ENGINEERING EDUCATION by Teodoro Cipresso Software Reverse Engineering (SRE) is the practice of analyzing a software system, either in whole or in part, to extract design and implementation information. A typical SRE scenario would involve a software module that has worked for years and carries several rules of a business in its lines of code. Unfortunately the source code of the application has been lost; what remains is "native" or "binary" code. Reverse engineering skills are also used to detect and neutralize viruses and malware as well as to protect intellectual property. It became frighteningly apparent during the Y2K crisis that reverse engineering skills were not commonly held amongst programmers. Since that time, much research has been undertaken to formalize the types of activities that fall into the category of reverse engineering so that these skills can be taught to computer programmers and testers. To help address the lack of software reverse engineering education, several peer-reviewed articles on software reverse engineering, re-engineering, reuse, maintenance, evolution, and security were gathered with the objective of developing relevant, practical exercises for instructional purposes. The research revealed that SRE is fairly well described and most of the related activities fall into one of two categories: software development related and security related. Hands-on reverse engineering exercises were developed in the spirit of these two categories with the goal of providing a baseline education in reversing both Wintel machine code and Java bytecode.

ACKNOWLEDGEMENTS I would like to thank Dr. Mark Stamp for his enduring patience as I struggled to flush out the details of this work. I would also like to thank my committee members, Dr. David Taylor and Dr. Robert Chun, for their support in this effort. Last but not least, I would like to thank my wife Karyn, who has encouraged me throughout my graduate career to persevere through the rough patches, and my cat Freddy, who always kept me company as I typed many suns to sleep.

Table of Contents
1 2 3 4 Introduction Reverse Engineering in Software Development Reverse Engineering in Software Security Reversing and Patching Wintel Machine Code 4.1 Decompilation and Disassembly of Machine Code 4.2 Wintel Machine Code Reversing and Patching Exercise 4.3 Recommended Reversing Tool for the Wintel Exercise 4.4 Animated Solution to the Wintel Reversing Exercise 5 Reversing and Patching Java Bytecode 5.1 Decompiling and Disassembling Java Bytecode 5.2 Java Bytecode Reversing and Patching Exercise 5.3 Recommended Reversing Tool for the Java Exercise 5.4 Animated Solution to the Java Reversing Exercise 6 Basic Anti-Reversing Techniques 7 Applying Anti-Reversing Techniques to Wintel Machine Code 7.1 Eliminating Symbolic Information in Wintel Machine Code 7.2 Basic Obfuscation of Wintel Machine Code 7.3 Protecting Source Code Through Obfuscation 7.4 Advanced Obfuscation of Machine Code 7.5 Wintel Machine Code Anti-Reversing Exercise 7.6 Solution to the Wintel Anti-Reversing Exercise 7.6.1 Encryption of String Literals 7.6.2 Obfuscating the Numeric Representation of the Record Limit 7.6.3 Control Flow Obfuscation for the Record Limit Check 7.6.4 Analysis of the Control Flow Obfuscation Using Run Traces 8 Applying Anti-Reversing Techniques to Java Bytecode 8.1 Eliminating Symbolic Information in Java Bytecode 8.2 Preventing Decompilation of Java Bytecode 8.3 A Java Bytecode Code Anti-Reversing Exercise 8.4 Animated Solution to the Java Bytecode Anti-Reversing Exercise 9 Reengineering and Reuse of Legacy Software Applications 9.1 Legacy Software Reengineering and Reuse Exercise 9.2 Legacy Software Reengineering and Reuse Exercise Solution 10 Identifying, Monitoring, and Reporting Malware 10.1 Malware Identification and Monitoring Exercise 10.2 Malware Identification and Monitoring Exercise Solution Conclusion References 1 3 6 9 11 14 15 17 20 21 25 26 27 29 31 31 35 40 42 44 44 45 47 48 53 56 58 63 68 69 70 84 86 98 106 106 107 109

vi

List of Tables Table 4.1 Result of decompiling HelloWorld.exe using Boomerang. Table 4.2 Quick reference for panes in CPU window of OllyDbg. Table 5.1 Source listing for ListArguments.java. Table 5.2 Java bytecode contained in ListArguments.class. Table 5.3 Jad decompilation of ListArguments.class. Table 7.1 Debugging information inserted into machine code. Table 7.2 Listing of VerifyPassword.cpp and disassembly ofVerifyPassword.exe. Table 7.3 Simple substitution cipher used to protect string constants. Table 7.4 VerifyPasswordObfuscated.cpp and corresponding disassembly. Table 7.5 COBF obfuscation results for VerifyPassword.cpp. Table 7.6 Encrypted strings are decrypted each time they are displayed. Table 7.7 Using a function of the record limit to obfuscate the condition. Table 7.8 Implementation of the control flow obfuscation in Fig. 7.3. Table 7.9 Statistical data gathered for randomized control-flow obfuscation. Table 8.1 Unobfuscated source listing of CheckLimitation.java. Table 8.2 Jad decompilation of ProGuard obfuscated bytecode. Table 8.3 Jad decompilation of SandMark (and ProGuard) obfuscated bytecode. Table 8.4 Listing of DateTime.Java. Table 8.5 Jad decompilation of DateTime.class obfuscated by Zelix Klassmaster. Table 9.1 Sample business logic component to reuse and reengineer. Table 9.2 Interface data structure SMPLCALC-INTERFACE in SMPLCALC.cpy. Table 9.3 XML Schema generated the from COBOL data structure. Table 9.4 Partial listing of SmplCalcJaxbMarshaller.java interaction with JAXB. Table 9.5 Updates to JSimpleCalculator.java in support of JAXB marshalling. Table 9.6 Example native method declaration for the JNI XML bridge. Table 9.7 Example implementation of the Java to COBOL JNI XML bridge. Table 9.8 Implementation of a COBOL XML layer to the legacy application Table 9.9 Example run of the solution code with debug statements turned on. vn 13 16 22 23 24 33 36 38 39 41 45 47 51 56 60 61 62 67 68 76 86 87 90 91 92 93 95 96

List of Figures
Figure 2.1 Figure 2.2 Figure 3.1 Figure 4.1 Figure 4.2 Figure 5.1 Figure 5.2 Figure 7.1 Figure 7.2 Figure 7.3 Figure 7.4 Figure 8.1 Figure 8.2 Figure 9.1 Figure 9.2 Figure 9.3 Figure 9.4 Figure 9.5 Development process for maintaining legacy software. Development related software reverse engineering scenarios. Security related software reverse engineering scenarios. The five panes of the OllyDbg graphical workbench. Sample slide from the machine code reversing animated tutorial. Execution of Java bytecode versus machine code. FrontEnd Plus workbench session for ListArguments, class. Result of obfuscating all string literals in the program. Obfuscated control flow logic for testing the password record limit. Edit-distances between three run traces of the trial limitation check. Usage of opaque predicates to prevent decompilation. Sample slide from the Java antireversing animated tutorial. Layers of a well-structured legacy software application. Mapping legacy functional discriminators to an object-oriented design. Example JCA implementation for accessing a legacy application. Architecture for legacy application reengineering and reuse from Java. Console-based Java interface to the legacy COBOL program. 4 5 8 16 19 21 27 46 51 55 65 70 73 75 79 83 89 102 104 105

Record limit comperands are represented as exponents with a base of 2. 49

Figure 10.1 Process Monitor session for the Password Vault application. Figure 10.2 Example ThreatExpert report summary for submitted malware. Figure 10.3 Console-based UI for the Alarm Clock example software Trojan.

Vlll

1 Introduction
From very early on in life we engage in constant investigation of existing things to understand how and even why they work. The practice of Software Reverse Engineering (SRE) calls upon this investigative nature when one needs to learn how and why, often in the absence of adequate documentation, an existing piece of software helpful or maliciousworks. The sections that follow cover the most popular uses of SRE and, to some degree, the importance of imparting knowledge of them to those who write, test, and maintain software. More formally, SRE can be described as the practice of analyzing a software system to create abstractions that identify the individual components and their dependencies, and, if possible, the overall system architecture [1], [2]. Once the components and design of an existing system have been recovered, it becomes possible to repair and even enhance them. Events in recent history have caused SRE to become a very active area of research. In the early nineties, the Y2K problem spurred the need for the development of tools that could read large amounts of source or binary code for the 2-digit year vulnerability [2]. Shortly after the preparation for the Y2K problem, in the mid to late nineties, the adoption of the Internet by businesses and organizations brought about the need to understand in-house legacy systems so that the information held within them could be made available on the Web [3]. The desire for businesses to expand to the Internet for what was promised to be limitless potential for new revenue caused the creation of many Business to Consumer (B2C) web sites. 1

Today's technology is unfortunately tomorrow's legacy system. For example, the Web 2.0 revolution sees the current crop of web sites as legacy Web applications comprised of multiple HTML pages; Web 2.0 envisions sites where a user interacts with a single dynamic pagerendering a user experience that is more like traditional desktop applications [2]. Porting the current crop of legacy web sites to Web 2.0 will require understanding the architecture and design of these legacy sitesagain requiring reverse engineering skills and tools. At first glance, it may seem that the need for SRE can be lessened by simply maintaining good documentation for all software that is written. While the presence of that ideal would definitely decrease the need; it just has not become a reality. For example, even a company that has brought software to market may no longer understand it because the original designers and developers may have left, or components of the software may have been acquired from a vendor who is no longer in business [1]. Going forward, the vision is to include SRE incrementally, as part of the normal development, or "forward engineering" of software systems. At regular points during the development cycle, code would be reversed to rediscover its design so that the documentation can be updated. This would help avoid the typical situation where detailed information about a software system such as its architecture, design constraints, and trade-offs are found only in the memory of its developer [1].

2 Reverse Engineering in Software Development


While a great deal of software that has been written is no longer in use, a considerable amount has survived for decades and continues to run the global economy. The reality of the situation is that 70% of the source code in the entire world is written in COBOL [3]. One would be hard-pressed these days to obtain an expert education in legacy programming languages like COBOL, PL/I, and FORTRAN. Compounding the situation is the fact that a great deal of legacy code is poorly designed and documented [3]. [6] states that "COBOL programs are in use globally in governmental and military agencies, in commercial enterprises, and on operating systems such as IBM's z/OS, Microsoft's Windows, and the POSIX families (Unix/Linux etc.). In 1997, the Gartner Group reported that 80% of the world's business ran on COBOL with over 200 billion lines of code in existence and with an estimated 5 billion lines of new code annually." Since it's cost-prohibitive to rip and replace billions of lines of legacy code, the only reasonable alternative has been to maintain and evolve the code, often with the help of concepts found in software reverse engineering. Fig. 2.1 illustrates a process a software engineer might follow when maintaining legacy software systems. Whenever computer scientists or software engineers are engaged with evolving an existing system, fifty to ninety percent of the work effort is spent on program understanding [3]. Having engineers spend such a large amount of their time attempting to understand a system before making enhancements is not economically sustainable as a software system continues to grow in size and complexity. To help lessen the cost of program
3

Software Module Enhancement Request

Design Recovery No

Encapsulate Binary and Test Patch Binary and Test

Software Engineer

/ \

Source \ exists?/'' Yes

[ Depl Legacy

^^mSm*! 3t*-^B"

Edit, Compile, and Test

<H| System
L

Deploy .

111

Figure 2.1. Development process for maintaining legacy software, understanding, [3] advises that "practice with reverse engineering techniques improves ability to understand a given system quickly and efficiently." Even though several tools already exist to aid software engineers with the program understanding process, the tools focus on transferring information about a software system's design into the mind of the developer [1]. The expectation is that the developer has enough skill to efficiently integrate the information into their own mental model of the system's architecture. It's not likely that even the most sophisticated tools can replace experience with building mental models of existing software; [4] states "commercial reverse engineering tools produce various kinds of output, but software engineers usually don't how to interpret and use these pictures and reports." The lack of reverse engineering skills in most programmers is a serious risk to the long-term viability of any organization that employs information technology. The problem of software maintenance cannot be dispelled with some clever technique, [7] argues "re-engineering

code to create a system that will not need to be reverse engineered again in the futureis presently unattainable." According to [5], there are four software development related reverse engineering scenarios; the scenarios cover a broad spectrum of activities that include software maintenance, reuse, re-engineering, evolution, interoperability, and testing. Fig. 2.2 summarizes the software development related reverse engineering scenarios.

Achieving Interoperability lAilh Proprietary Software

verification that Implementation Matches Design

Development Related Software Reverse Engineering

Evaluating Software Quality and Robustness

Legacy Software Maintenance, Re-engineering and Evolution

Figure 2.2. Development related software reverse engineering scenarios. The following are tasks one might perform in each of the reversing scenarios [5]: > Achieving Interoperability with Proprietary Software: Develop applications or device drivers that interoperate (use) proprietary libraries in operating systems or applications. > Verification that Implementation Matches Design: Verify that code produced during the forward development process matches the envisioned design by reversing the code back into an abstract design.

> Evaluating Software Quality and Robustness: Ensure the quality of software before purchasing it by performing heuristic analysis of the binaries to check for certain instruction sequences that appear in poor quality code. > Legacy Software Maintenance, Re-engineering, and Evolution: Recover the design of legacy software modules when source is not available to make possible the maintenance, evolution, and reuse of the modules.

3 Reverse Engineering in Software Security


From the perspective of a software company, it is highly desirable that the company's products are difficult to pirate and reverse engineer. Making software difficult to reverse engineer seems to be in conflict with the idea of being able to recover the software's design later on for maintenance and evolution. Therefore, software manufacturers usually don't apply anti-reverse engineering transformations to software binaries until it is packaged for shipment to customers. Software manufacturers will typically only invest time in making software difficult to reverse engineer if there are particularly interesting algorithms that make the product stand out from the competition. Making software difficult to pirate or reverse engineer is often a moving target and requires special skills and understanding on the part of the developer. Software developers who are given the opportunity to practice anti-reversing techniques might be in a better position to help their employer, or themselves, protect their intellectual property. As [3] states, "to defeat a crook you have to think like one." By reverse engineering viruses or other malicious software, programmers can learn their inner 6

workings and witness first-hand how vulnerabilities find their way into computer programs. Reversing software that has been infected with a virus, is a technique used by the developers of anti-virus products to identify and neutralize new viruses or understand the behavior of malware. Programming languages like Java, which do not require computer programmers to manage low-level system details, have become ubiquitous. As a result, computer programmers have increasingly lost touch with what happens in a system during execution of programs. [3] suggests that programmers can gain a better and deeper understanding of software and hardware through learning reverse engineering concepts. Hackers and crackers have been quite vocal and active in proving that they possess a deeper understanding of low-level system details than their professional counterparts [3]. According to [5], there are four software security related reverse engineering scenarios. Similar to development related reverse engineeringthe scenarios cover a broad spectrum of activities: ensuring that software is safe to deploy and use, protecting clever algorithms or business processes, preventing pirating of software and digital media such as music, movies, and booksand making sure that cryptographic algorithms are not vulnerable to attacks. Fig. 3.1 summarizes the software security related reverse engineering scenarios. The following are tasks one might perform in each of the reversing scenarios [5]: > Detecting and Neutralizing Viruses and Malware: Detect, analyze, or neutralize (clean) malware, viruses, spyware, and adware.
7

Testing Cryptographic Algorithms for Weaknesses: Test the level of data security provided by a given cryptographic algorithm by analyzing it for weaknesses. Testing DRM or License Protection (anti-reversing): Protect software and media digital-rights through application and testing of anti-reversing techniques. Auditing the Security of Program Binaries: Audit a program for security vulnerabilities without access to the source code by scanning instruction sequences for potential exploits.

Detecting and Neutralizing Viruses and Malware

Testing Cryptogenic Algorithms for Weaknesses

i
Security Related Software Reverse Engineering

Testing DRM or License Protection ^nti-re versing)

Auditing the Security of Program Binaries (lAithout source code)

Figure 3.1. Security related software reverse engineering scenarios.

4 Reversing and Patching Wintel Machine Code


The executable representation of software, otherwise known as machine code, is typically the result of translating a program written in a high-level language, using a compiler, to an object file, a file which contains platform-specific machine instructions. The object file is made executable using linker, a tool which resolves the external dependencies that the object file has, such as operating system libraries. In contrast to high-level languages, there are low-level languages which are still considered to be highlevel by a computer's CPU because the language syntax is still a textual or mnemonic abstraction of the processor's instruction set. For example, assembly language, a language that uses helpful mnemonics to represent machine instructions, still must be translated to an object file and made executable by a linker. However the translation from assembly code to machine code is done by an assembler instead of a compiler reflecting the closeness of the assembly language's syntax to actual machine code. The reason why compilers translate programs coded in high-level and low-level languages to machine code is three-fold: CPUs only understand machine instructions, having a CPU dynamically translate higher-level language statements to machine instructions would consume significant, additional CPU time, and (3) a CPU that could dynamically translate multiple high-level languages to machine code would be extremely complex, expensive, and cumbersome to maintainimagine having to update the firmware in your microprocessor every time a bug is fixed or a feature is added to the C++ language! 9

To relieve a high-level language compiler from the difficult task of generating machine instructions, some compilers do not generate machine code directly, instead they generate code in a low-level language such as assembly [8]. This allows for a separation of concerns where the compiler doesn't have to know how to encode and format machine instructions for every target platform or processorit can instead just concentrate on generating valid assembly code for an assembler on the target platform. Some compilers, such as the C and C++ compilers in the GNU Compiler Collection (GCC), have the option to output the intermediate assembly code that the compiler would otherwise feed to the assemblerallowing advanced programmers to tweak the code [9]. Therefore the C and C++ compilers in GCC are examples of compilers that translate high-level language programs to assembly code instead of machine code; they rely on an assembler to translate their output into instructions the target processor can understand. [9] outlines the compilation process undertaken by GCC compiler to render an executable file is as follows: > Preprocessing: Expand macros in the high-level language source file. > Compilation: Translate the high-level source code to assembly language. > Assembly: Translate assembly language to object code (machine code). > Linking (Create the final executable): > Statically or dynamically link together the object code with the object code of the programs and libraries it depends on.

10

> Establish initial relative addresses for the variables, constants, and entry points in the object code.

4.1 Decompilation and Disassembly of Machine Code


Having an understanding of how high-level language programs become executables can be extremely helpful when attempting to reverse engineer machine code. Most software tools that assist in reversing executables work by translating the machine code back into assembly language. This is possible because there exists a one-to-one mapping from each assembly language instruction to a machine instruction [10]. A tool that translates machine code back into assembly language is called a disassembler. From a reverse engineer's perspective the next obvious step would be to translate assembly language back to a high-level language, where it would be much less difficult to read, understand, and alter the program. Unfortunately, this is an extremely difficult task for any tool because once high-level language source code is compiled down to machine code, a great deal of information is lost. For example, one cannot tell by looking at the machine code which high-level language (if any) the machine code originated from. Perhaps knowing a particular quirk about a compiler might help a reverse engineer identify some machine code that it had a hand in creating, but this is not a reliable strategy. The greatest difficulty in reverse engineering machine code comes from the lack of adequate decompilerstools that can generate equivalent high-level language source code from machine code. The paper [5] argues that it should be possible to create good 11

decompilers for binary executables, but recognizes that other experts disagreeraising the point that some information is "irretrievably lost during the compilation process." Boomerang is a well-known open-source decompiler project that seeks to one day be able to decompile machine code to high-level language source code with respectable results [11]. For those reverse engineers interested in recovering the source code of a program, decompilation may not offer much hope because as [11] states "a general decompiler does not attempt to reverse every action of the compiler, rather it transforms the input program repeatedly until the result is high level source code. It therefore won't recreate the original source file; probably nothing like it." To get a sense of the effectiveness of Boomerang as a reversing tool, a simple program, HelloWorld.c was compiled and linked using the GNU C++ compiler for Microsoft Windows and then decompiled using Boomerang. The C code generated by the Boomerang decompiler when given HelloWorld.exe as input was quite disappointing: the generated code looked like a hybrid of C and assembly language, had countless syntax errors, and ultimately bore no resemblance to the original program. Table 4.1 contains the source of HelloWorld.c and some of the code generated by Boomerang. Incidentally, the Boomerang decompiler was unable to produce any output when HelloWorld.exe, was built using Microsoft's Visual C++ 2008 edition compiler. The full length of the C code generated by Boomerang for the HelloWorld.exe program contained 180 lines of confusing, nonsensical control structures and function calls to undefined methods. It is surprising to see such a poor decompilation result, but as 12

Table 4.1. Result of decompiling HelloWorld.exe using Boomerang. HelloWorld.c: 01 02 03 04 05 06 #include <stdio.h> int main(int argc, char *argv[]) { printf("Hello Boomerang World\n"); return 0;

Boomerang decompilation of HelloWorld.exe (abbreviated): 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 union { size32[] x83; unsigned int x84; } globallO; size32 global3 = - 1 ; / / 4 bytes // address: 0x401280 void _start() { set__app_type () ; procl(); } // address: 0x401150 void procl() { size32 eax; // r24 size32 ebp; // r29 size32 ebx; // r27 int ecx; // r25 int edx; // r2 6 int esp; // r28 SetUnhandledExceptionFilter(); proc5(pc, pc, 0x401000, ebx, ebp, eax, ecx, edx, ebx, esp - 4, SUBFLAGS32(esp - 44, 4, esp - 4 8 ) , esp - 48 == 0, (unsigned int) (esp - 44) < 4 ) ;

22:

[11] states: "Machine code decompilation, unlike Java/.NET decompilation, is still a very immature technology." To ensure that decompilation was given a fair trial, another decompiler was tried on the HelloWorld.exe executable. The Reversing Engineering Compiler or REC is both a compiler and a decompiler that claims to be able to produce a "C-like" representation of machine code [12]. Unfortunately, the results of the decompilation using REC were similar to that of Boomerang. Based on the current state 13

of decompilation technology for machine code, using a decompiler to recover the highlevel language source of an executable doesn't seem feasible; however, because of the one-to-one correspondence between machine code and assembly language statements [10], we can obtain a low-level language representation. Fortunately there are graphical tools available that not only include a disassembler, a tool which generates assembly language from machine code, but also allow for debugging and altering the machine code during execution.

4.2 Wintel Machine Code Reversing and Patching Exercise


Imagine that you have just implemented a C/C++ version of a Windows 32-bit console application called "Password Vault" that helps computer users create and manage their passwords in a secure and convenient way. Before releasing a limited trial version of the application on your company's Web site, you would like to understand how difficult it would be for a reverse engineer to circumvent a limitation in the trial version that exists to encourage purchases of the full version; the trial version of the application limits the number of password records a user may create to five. The C++ version of the Password Vault application (included with this text) was developed to provide a non-trivial application for reversing exercises without the myriad of legal concerns involved with reverse engineering software owned by others. The Password Vault application employs 256-bit AES encryption, using the free cryptographic library crypto++ [17], to securely store passwords for multiple users each in separate, encrypted XML files. By default, the Makefile that is used to build the 14

Password Vault application defines a constant named "TRIALVERSION" which causes the resulting executable to limit the number of password records a user may create to only five, using conditional compilation. This limitation is very similar to limitations found in many shareware and trialware applications that are available on the Internet.

4.3 Recommended Reversing Tool for the Wintel Exercise


OllyDbg is a shareware interactive machine code debugger and disassembler for Microsoft Windows [13]. The tool has an emphasis on machine code analysis which makes it particularly helpful in cases where the source code for the target program is unavailable [13]. Fig. 4.1 illustrates the OllyDbg graphical workbench. OllyDbg operates as follows: the tool will disassemble a binary executable, generate assembly language instructions from machine code instructions, and perform some heuristic analysis to identify individual functions (methods) and loops. OllyDbg can open an executable directly, or attach to one that is already running. The OllyDbg workbench can display several different windows which are made visible by selecting them on the View menu bar item. The CPU window, shown in Fig. 4.1, is the default window that is displayed when the OllyDbg workbench is started. Table 4.2 lists the panes of the CPU window along with their respective capabilities; the contents of the table are adapted from the online documentation provided by [13] and experience with the tool.

15

Disassembler

Registers
-J_
Plugins Options Window Help

Oily Dbg PassworriVault.exe - [CPU - main thread, module Password] F M ! 3 t 3 | C | File View\Oebug

.__.

...

6jjlix
liMMfe!:M B84012S1 00481283 004S1286 0040128D 00401293 00401298 004EU299 004012R9 004012fil 00401203

* ss
. . . . . . . . . .

PUSH EBP now EBP,ESP 89E5 SUB ESP 8 83EC 08 C70424 010000 MOU DWORD PTR SS:[ESP],1 FF1S 9892S900 CALL DWORD PTR DS: [<&msucrt._ E8 B8FEFFFF CALL Passuord.00401150 90 NOP 8DB426 000000 LEfl ESI,DWORD PTR DSlCESI] PUSH EBP SS MOU EBP,ESP 89ES 83EC 08 SUB ESP, 8
MOU niiinRn P T B <W. T F ^ P I ?

mLMMMM

EBP=3022FFF0

EBX 80000000 ECX 0022FFB0 EDX 7C90E4F4 r v EBX 7FFDDB00 ESP 6022FFC4 EBP S022FFF0 ESI FFFFFFFF EDI 7C910208 n EIP 00401280 P C 0 ES 8023 3 P 1 CS 001B 3 i! SSjafiS

Dump

Information

Stack

Figure 4.1. The five panes of the OllyDbg graphical workbench.

Table 4.2. Quick reference for panes in CPU window of OllyDbg.

Pane Disassembler

Capabilities > Edit, debug, test, and patch a binary executable using actions available on a popup menu. > Patch an executable by copying edits to the disassembly back to the binary. > Display the contents of memory or a file in one of 7 predefined formats: byte, text, integer, float, address, disassembly, or PE Header. > Set memory breakpoints (triggered when a particular memory location is read from or written to). > Locate references to data in the disassembly (executable code). > Decode and resolve the arguments of the currently selected assembly instruction in the Disassembler pane. > Modify the value of register arguments. 16

Dump

Information

Registers

> View memory locations referenced by each argument in either the Disassembler of Dump panes. > Decodes and displays the values of the CPU and FPU (FloatingPoint Unit) registers for the currently executing thread. > Floating point register decoding can be configured for MMX (Intel) or 3DNow! (AMD) multimedia extensions. > Modify the value of CPU registers. > Display the stack of the currently executing thread. > Trace stack frames. In general, stack frames are used to: Restore the state of registers and memory on return from a call statement. Allocate storage for the local variables, parameters, and return value of the called subroutine. Provide a return address.

Stack

4.4 Animated Solution to the Wintel Reversing Exercise


Using OllyDbg, one can successfully reverse engineer a non-trivial Windows application like Password Vault, and make permanent changes to the behavior of the executable. The purpose of placing a trial limitation in the Password Vault application is to provide a concrete objective for reverse engineering the application: disable or relax the trial limitation. Of course the goal here is not teach how to avoid paying for software, but rather to see oneself in the role of a tester, a tester who is evaluating how difficult it would be for reverse engineer to circumvent the trial limitation. This is a fairly relevant exercise to go through for any individual or software company that plans to provide trial versions of their software for download on the Internet. In later sections, we discuss antireversing techniques, which can significantly increase the difficulty a reverse engineer will encounter when reversing an application. 17

For instructional purposes, an animated tutorial that demonstrates the complete end-to-end reverse engineering of the C/C++ Password Vault application was created using Qarbon Viewlet Builder and can be viewed using Macromedia Flash Player. The tutorial begins with the Password Vault application and OllyDbg already installed on a Windows XP machine. Fig. 4.2 contains an example slide from the animated tutorial. The animated tutorial, source, and installer for the machine code version of Password Vault can be downloaded from the following locations: > Wintel Reversing & Patching Animated Solution: http://reversingproject.info/repository.php?fileID=4_l_l > Password Vault C/C+ + Source code: http://reversingproject.info/repository.php?filelD=4_l _2 > Password Vault C/C++ Windows installer: http://reversingproject.info/repository.php?filelD=4_l_3 Begin viewing the animated tutorial by extracting passwordjvault_cpp _reversing_exercise.zip to a local directory and either running password_vault_cpp_reversing_exercise.exe which should launch the standalone version

of Macromedia Flash Player, or by opening the file passwordjvault_cpp_reversing_exercise_yiewlet._swf.html in a Web browser.

18

OlIvDbg - PasswordVault.e e - [Dump - Password: rdala 0 0 5 5 D 0 0 0 [ p j File View Debug Plugins Options Window Help

---

00590FFF)

HHQ

klfii*]
i

; 8 a S b D @ 0 | 7 B 2P. 7D 0 0 2 B 2 D 2 D 2 D - 2 D 2D 2 D 2 D - 2 D 2D 2D 2D j j f s c . c n n i n l i n sn ^ n ^ n -?n sn ? n ? n ? n 7 n 5 n ? n ? n ? n ? n -pn i J9

|M|X| VJJ|J *||_*|| !|1H ^1 +& JJ:E]H[T|"W|H|C|/ JC]BJR]...| S | m a ? ] "YJ>~Tr"_i____r_r 1 e

il

a la c

The Dump window allows for searching through the data and setting a breakpoint that is triggered when data beginning at a specified address is read or modified.
/4 7w idu aj M m w m. AI W W &fa)-& 4 3 f l 20 74 6 5 6F 6416F 72 6F 40 72 65
73 26 53 74 6F 20 6C 74 63 63 27 2C 74 6F 20 72 6F 2R 73 69 63 6E 6F 65 6E 6F 6E 68 72 6F 6E 6D ?? 6F SO 6E 20 54 t<> 69 20 20 69 2B 00 74 65 6F 74 2C 20 29 72 61 64 6E 7D 77 74 6F 29 72 29 61 74 67 65 65 72 63 20 69 72 20 75 77 68 -7A 6E 6 7 7 0 20 28 20 32 35 36 6F:6E 29 2 8 2@ 3 5 43 68 6 1 20!50 61 29161 20 72 64 00 65i 72 7 3 20 27 22 6 1 6E 6 4 6 1 6C 6C 64 28 64 29 50 61 80 54 68 29 77 61 2 2 2E 0 0 6F 7 2 6 4 2 0 61 2 8 72 64 09 65 78 69 64 20 76 73 70 6S 60i65 20 2 0 6 6 6F 2 0 6E 6 5 20 73 70 6E 74 20 64 20 65 6F 7 2 7 2 7 7 6 9 6C 74!68 6F 20^73 61 9 9 ! 4 1 6E 29!6F 70 61 73 20 6 5 ; 2 0 6E fiC -Sffl 'il 72 20 2D 75 2E 6E 73 50 00 20 27 20 6F 61 73 65 73 44 28 50 5B 73 61 63 22 75 77 65 76 6E 65 6C 75 66 20 74 73 61 7D 6F;6fl 29;7C 42:69 73!69 3512E 67;65 73 77 61- 73 09 54 27 30 2C 2 8 2 7 ; 3E 77: 65 74!61 73177 29!64 20! 73 69 73 52 65 61 73 45 72 74:69 75!6C 69!66 7E!2fl 6E! 6 4 20 76 63 69 61 75 74 65 63 74 20!6E 74!29 65! 74 69L6E 69:6F 79 65 6D 6 5 "51-1 7 n 65 63 90 00 74 2 0 6E 6 7 32 20 20 74 6F 7 2 73 77 68 65 20 72 27 2F 27 20 64 20 2E 0 9 6F 7 2 65 73 65 74 7 0 6C 6 3 6F 73 77 7 2 6F 6E 6 7 74 20 69 65 7D 2 2 2D 2D 61 75 66 69 6C 7 4 72 65 2E 2 9 6F 77 73 61 7 9 2E 76 61 6E 2 9 63 69 20 77 ->-3 1C 7 4 2E 6 9 0 0 7C 2 0 45 6E 63 29 43 72 20;7C 00 68,65 20 64;00 43 6F:72 64 20;63 68 65 74 75 27,2C 29 6 1 : 72 6 5 69;6E 29 4 4 6 5 6C 64 20 52 6 3 i 7 2 69 20s 74 6F 61i 79 29 72 64 73 6F; 72 6 4 72'5D 29 20;70 61 66.6F 72 6 4 ; 2 0 75 29;77 61 61s 7 3 7 3 6 C ! 7 4 2E 65,64 20 20;70 61 64 28 69 59 7 2 6F 29^65 78 7 6 6 9 6E 06;5B 49 6C!69 64 6 E ! 7 5 6D 66 69 65 6 1 ; 7 320 Cta AC CC

3
1

0 0 OBSSITRJB 1 00=bCHO 63 Q85ED120 7 2 8 0 i S L l : O 6F 88550140 45 : (305501 "0 7 0 :00550160 74 : 8855D178 29 80S5D130 75 O055D138 6 1 00b5DlP8 65 ' C1055C1B0 6 1 .6855D3C0 20 86?5D1D0 27 8 0 5 5 0 i E O 6F eSE^DlFO 63 , 00550JO0 65 , 0G5SD218 6 F GO5SD220 6 9 O05SD2 30 ?B 88550240 73 0055029? 64 88SSD260 6 5 SOtEDl^O 41 85502f:n 77 00E^:J:::<O 68 0855D2R9 7 2 1 O 0 5 5 D 2 B B 6E .0O5SD2Ce 6 9 i OO55D2D0 5 4 0055D2E0 72 ,09S50ZFO 77 0 8 5 5 0 : 0 0 69 0855?^10 61 88SSD320 29 0055O3EO 6 6 0 8 5 5 0 3 4 8 6F on^DBse 6 5 GQSSCSuB 7 2 O u S S O ? ^ 90 tA

C IS i xy ;... ; ion /A M I 76 65 c t : t eodoro@reve 6E 6 6 r s i n g p r o j e c t . i n f 20 41 0 I... ! fi 7 2 7 9 ES 2 5 6 - B i t E n c r y 7 9 7 8 Dt Ion u s i n g C r y p 28 00 to++ 5 . 5 . 2 ! . ( . 56 61 ) .Change t h e U a 72 65 u l t Password.Cre 26 52 a t e a Password R 61 72 e c o r d . . . T h e char 7 2 6E a c t e r s ' : r e t u r n 2 7 3C 2 0 6E ' , a n d ' > ' a r e n 72 65 ot a l l o w e d in r e 65 74 c o r d d a t a . . D e l e t 6 5 6 3 e a P a s s w o r d Rec 70 74 ord.The de-script 2 0 2 2 i o n was s e t t o " 5 0 6 1 { # > " . . D i s p l a y Pa 0 0 45 s s w o r d R e c o r d s . E 20 52 d i t a Password R 00 90 e c o r d . [ E r r o r ] . . 7 3 7 3 fin en i s t i n g p a s s 2 0 74 w o r d v a u I t f o r t 7 3 6 5 h e s p e c i f Led u s e 7 3 2 9 rnaroe " < # > " was 7 5 6D n o t f o u n d a s s u c i 8 8 8 9 i n g new v a u i t . . . 6 3 7 5 T h e s p e c i f i e d CM 73 73 r r e n t v a u l t pass 73 29 word e n t e r e d i s Progr 67 72 i n c o r r e c t . 6 9 7 4 a n u 1 1 L now en i t without saving 67 29 safety..[Inf 6E 66 f o r 2 0 6D 0 ] . O n i n v a t i d n 6 2 65 enu o p t i o n nunbe 6 4 2E r was s p e c i f l e d . 7 3 6 5 . T h e nacie w a s s e It* -Vi 1 " _ ;.-;_ .';-

Backup Search foi S o to address * Hex Tex* Shrei

fca

y,,'/ "ST" Dif*-G

*v

Use the context menu to begin a search.


J

<>V.

tons
Float Disassemble Special

Appearance

j/^achedpracesspaused^rtol.Dl^t^.Ppint'

-,". ':;-- '- ' . " .':'." ' * - "

.; J

'

-J Paused

Figure 4.2. Sample slide from the machine code reversing animated tutorial.

19

5 Reversing and Patching Java Bytecode


Applications written in Java are generally well-suited to being reverse engineered. To understand why, it's important to understand the difference between machine code and Java bytecode (Fig. 5.1 illustrates the execution of Java bytecode versus machine code): > Machine code: "Machine code or machine language is a system of instructions and data executed directly by a computer's central processing unit" [14]. Machine code contains the platform-specific machine instructions to execute on the target processor. > Java bytecode: "Bytecode is the intermediate representation of Java programs just as assembler is the intermediate representation of C or C++ programs" [15]. Java bytecode contains platform-independent instructions that are translated to platform-specific instructions by a Java Virtual Machine. In Section 4, an attempt to recover the source of a simple "Hello World" C++ application was unsuccessful when executables built using two different compilers were given as input to the Boomerang decompiler. Much more positive results can be achieved for Java bytecode because of its platform-independent design and high-level representation. On Windows, machine code is typically stored in files with the extensions *.exe, *.dll; the file extensions for machine code vary per operating system. This is not the case with Java bytecode as it is always stored in files that have a * class extension. Related Java classes, such as those for an application or class library, are often bundled together in an archive file with a *.jar extension. The Java Language Specification allows at most one
20

top-level public class to be defined per *.java source file and requires that the bytecode be stored in a file with whose name matches the pattern TopLevelClassName.class.

Java bytecode JVM instruction

Machine instruction

Java Virtual Machine (JVM)

SSHilBH

Machine code

CPU
Figure 5.1. Execution of Java bytecode versus machine code.

5.1 Decompiling and Disassembling Java Bytecode


To demonstrate how much more feasible it is to recover Java source code from Java bytecode than it is to recover C/C++ code from machine code, we decompile the bytecode for the program List Arguments.Java using Jad, a Java decompiler which can be found here [16]; we then compare the generated Java source with the original. Before performing the decompilation we peek at the bytecode usingy'avap to get an idea of how much information survives the translation from high-level Java source to the intermediate format of Java bytecode. Table 5.1 contains the source code for List Arguments.Java, a simple Java program that echoes each argument passed on the command-line to standard output.

21

Table 5.1. Source listing for ListArguments.java. 01 02 03 04 05 06 07 08 09 package info.reversingproject.listarguments; public class ListArguments { public static void main(String[] arguments){ for (int i = 0; i < arguments.length; i++) { System.out.println("Argument[" + i + " ] : " + arguments[i])

Bytecode is stored in a binary format that is not human-readable and therefore must be "disassembled" in order to be read. Recall that the result of disassembling machine code is assembly language that can be converted back into machine code using an assembler; unfortunately, the same does not hold for disassembling Java bytecode. Sun Microsystem's Java Development Toolkit (JDK) comes with javap a command-line tool for disassembling Java bytecode; to say that javap "disassembles" bytecode is a bit of a misnomer since the output of javap is unstructured text which cannot be converted back into bytecode. The output of javap is nonetheless useful as a debugging and performance tuning aid since one can see which JVM instructions are generated from high-level Java language statements. Table 5.2 lists the Java bytecode for the main method of ListArguments class; notice that the fully qualified name of each method invoked by the bytecode is preserved. It may seem curious that while ListArguments.java contains no references to the class java.lang.StringBuilder, there are many references to it in the bytecode; this is because the use of the "+" operator to concatenate strings is a convenience offered by the Java language that has no direct representation in bytecode. To perform the concatenation, the 22

bytecode creates a new instance of the StringBuilder class and invokes its append method for each occurrence of the "+" operator in the original Java source code (there are three). A loss of information has indeed occurred, but we'll see that it's still possible to generate Java source code equivalent to the original in function, but not in syntax.
Table 5.2. Java bytecode contained in ListArguments.class. 0 1 2 3 4 5 8 i:.:
11 1:

113: IE 3: 2( ): 2: 1: 2i- :: 2" 7: 3: 25 3: >: 3: 1: 3^:: 3f 5: 311: 4: L: 4^1: 4' 7: 5( ):

iconst 0 istore 1 iload 1 aload 0 arraylength if icmpge getstatic new #3; dup invokespecial ldc #5; invokevirtual iload 1 invokevirtual ldc #8; invokevirtual aload 0 iload_l aaload invokevirtual invokevirtual invokevirtual iinc 1, 1 goto 2 return

50 #2;

// java/lang/System.out // java/lang/StringBuilder // java/lang/StringBuilder.init // "Argument[" // java/lang/StringBuilder.append // java/lang/StringBuilder // "] :" // java/lang/StringBuilder.append

#4; #6; #7; #6;

#6; // java/lang/StringBuilder.append #9; // java/lang/StringBuilder.toString #10; // java/io/PrintStream.println

Table 5.3 lists the result of decompiling ListArguments.class using Jad; while the code is different from the original List Arguments.Java program, it is functionally equivalent and syntactically correct, which is a much better result than that seen earlier with decompiling machine code.

23

Table 5.3. Jad decompilation of ListArguments.class.


01

02 03 04 05 06 07 08 09 10 11 12:

package info.reversingproject.listarguments; import Java.io.PrintStream; public class ListArguments { public static void main(String args[]) { for (int i = 0; i < args.length; i++) System.out.printIn((new StringBuilder()).append("Argument[") .append(i) .append("] :") .append(args[i]) .toString ()); }

An advanced programmer who is fluent in the Java Virtual Machine specification could use a hex editor or a program to modify Java bytecode directly, but this is similar to editing machine code directly, which is error-prone and difficult. In Section 4, which covered reversing and patching of machine code, it was determined through discussion and an animated tutorial that one should work with disassembly to make changes to a binary executable. However, the result of disassembling Java bytecode is a pseudoassembly language, a language that cannot be compiled or assembled but serves to provide a more abstract, readable representation of the bytecode. Being that directly editing bytecode is difficult, and that disassembling bytecode results in pseudo-assembly which cannot be compiled, it would seem that losing Java source code is more dire of a situation than losing C/C++ source code, but of course this is not the case because, as we've seen using Jad, Java bytecode can be successfully decompiled to equivalent Java source code.

24

5.2 Java Bytecode Reversing and Patching Exercise


This section introduces an exercise that is the Java Bytecode equivalent of the exercise given in Section 4.2 for Wintel machine code. Imagine that you have just implemented a Java version of a console application called "Password Vault" that helps computer users create and manage their passwords in a secure and convenient way. Before releasing a limited trial version of the application on your company's Web site, you would like to understand how difficult it would be for a reverse engineer to circumvent a limitation in the trial version that exists to encourage purchases of the full version; the trial version of the application limits the number of password records a user may create to five. The Java version of the Password Vault application (included with this text) was developed to provide a non-trivial application for reversing exercises without the myriad of legal concerns involved with reverse engineering software owned by others. The Java version of the Password Vault application employs 128-bit AES encryption, using Sun's Java Cryptography Extensions (JCE), to securely store passwords for multiple users each in separate, encrypted XML files.

25

5.3 Recommended Reversing Tool for the Java Exercise


If using Jad from the command-line doesn't sound appealing there is a freeware graphical tool built upon Jad called FrontEnd Plus that provides a simple workbench for decompiling classes and browsing the results [16]; it also has a convenient batch mode where multiple Java class files can be decompiled at once. After editing the Java generated by Jad, it's necessary to recompile the source back to bytecode in order to integrate the changes. The ability to recompile the generated Java is not functional in the FrontEnd Plus workbench for some reason, though it's simple enough to do the compilation manually. Next we mention an animated tutorial for reversing a Java implementation of the Password Vault application, which was introduced in Section 4. Fig. 5.2 shows a FrontEnd Plus workbench session containing the decompilation of List Arguments, class. To demonstrate using the FrontEnd Plus to reverse engineer and patch a Java bytecode, a Java version of the Password Vault application was developed; recall that the animated tutorial in Section 4 introduced the machine code (C++) version. The Java version of the Password Vault application uses 128-bit instead of 256-bit AES encryption because Sun Microsystem's standard Java Runtime Environment (JRE) does not provide 256-bit encryption due to export controls. A trial limitation of five password records per users is also implemented in the Java version. Unfortunately, Java does not support conditional compilation, so the source code cannot be compiled to omit the trial limitation without manually removing it or using a custom build process. 26

1*1 FiontEnd Plus v1.04 - ListArguments Java - [ListArgu men ts. Java] File Edit Search Window Tools and Options Feedback Homepages Help i i

7]5|x)

! : A i 4 i c P | [ b i ^ a B f i i a M | #4 ft V
!: jjL |3* I B ^

! Courier New
..
A

I10
i

- l ^ s m l i f ^ l

0001 0002 B - O Imports ! 0003 - -*l java.io.PrintStream; 0004 B-CS Methods 0005 % ListArguments() 0006 $ static void main(String argsfl) 0 0 0 7 0008 0009 0010 0011 0012 0013 0014 0015 0016 0017 0018 0019 0020 0021

/ / Front End Plus GUI for JAB / / Decompiled : ListArgzunents.


package

class

info.reversingproject.listarguments;

Import j a v a . i o . P t r i n t S t . r e a m ; public class


{

ListArguments

p u b l i c L i s t A r g u m e n t s ()
{

iJ
args[])

>
public s t a t i c void main(String
{

f o r ( i n t i = 0 ; i < a r g s . l e n g t h ; i++) S y s t e m . o u t . p r i n t I n ( (new S t r i n g B u i l d e r ( ) ) . append!"Argument[")- append(i) .append("J append ( a r g s [ i] ) . t o S t r i n g ( ) ) ;

")

>

>
Martin Cowley

<l 1
j

zJ 1

Figure 5.2. FrontEnd Plus workbench session for ListArguments.class.

5.4 Animated Solution to the Java Reversing Exercise


Using FrontEnd Plus (and Jad), one can successfully reverse engineer a nontrivial Java application like Password Vault, and make permanent changes to the behavior of the bytecode. Again, the purpose of having placed a trial limitation in the Password Vault application is to provide an opportunity for one to observe how easy or difficult it is for a reverse engineer to disable the limitation. Just like for machine code, anti-reversing strategies can be applied to Java bytecode. We cover some basic, effective strategies for protecting bytecode from being reverse engineered in a later section. For instructional purposes, an animated solution that demonstrates the complete

27

end-to-end reverse engineering of the Java Password Vault application was created using Qarbon Viewlet Builder and can be viewed using Macromedia Flash Player. The tutorial begins with the Java Password Vault application, FrontEnd Plus, and Sun's Java JDK vl.6 installed on a Windows XP machine. Fig. 5.3 contains an example slide from the animated tutorial. The animated tutorial, source, and installer for the Java version of Password Vault can be downloaded from the following locations: > Java Bytecode Reversing & Patching Animated Solution: http://reversingproject.info/repository. php?fileID=5_4_l > Password Vault Java Source code: http://reversingproject.info/repository.php?fileID=5_4_2 > Password Vault (Java Version) Windows installer: http://reversingproject.info/repository.php?fileID=5_4_3 Begin viewing the tutorial by extractmgpassword_vaultjava_reversing_exercise.zip local directory and either running password_vaultjava_reversing_exercise.exe which to a

should launch the standalone version of Macromedia Flash Player, or by opening the file password_vaultjava_reversing_exercise_viewlet._swf.html in a Web browser.

28

6 Basic Anti-Reversing Techniques


Having seen that it is fairly straight-forward for a reverse engineer to disable the trial limitation on the machine code and Java bytecode implementations of the Password Vault application, we now investigate applying anti-reversing techniques to both implementations in order to make it significantly more difficult for the trial limitation to be disabled. While anti-reversing techniques cannot completely prevent software from being reverse engineered, they act as a deterrent by increasing the challenge for the reverse engineer. [5] states "It is never possible to entirely prevent reversing" and "What is possible is to hinder and obstruct reversers by wearing them out and making the process so slow and painful that they give up." The remainder of this section introduces basic anti-reversing techniques, two of which are demonstrated in Sections 7 and 8. While it is not possible to completely prevent software from being reverse engineered, a reasonable goal is to make it as difficult as possible. Implementing antireversing strategies for source code, machine code, and bytecode can have adverse effects on a program's size, efficiency, and maintainability; therefore, it's important to evaluate whether a particular program warrants the cost of protecting it. The basic anti-reversing techniques introduced in this section are meant to be applied post-production, after the coding for an application is complete and tested. These techniques obscure data and logic and therefore are difficult to implement while also working on the actual functionality of the applicationdoing so could hinder or slow debugging and, even worse, create a dependency between the meaningful program logic and the anti-reversing strategies used. 29

[5] describes three basic anti-reversing techniques: > Eliminating Symbolic Information: The first and most obvious step in preventing reverse engineering of a program is to render unrecognizable, all symbolic information in machine code or bytecode because such information can be quite useful to a reverse engineer. Symbolic information includes class names, method names, variable names, and string constants that are still readable after a program has been compiled down to machine code or bytecode. > Obfuscating the Program: Obfuscation includes eliminating symbolic information, but goes much further. Obfuscation strategies include: modifying the layout of a program, introducing confusing non-essential logic or control flow, and storing data in difficult to interpret organizations or formats. Applying all of these techniques can render a program difficult to reverse, however care must be taken to ensure the original functionality of the application remains intact. > Embedding Antidebugger Code: Static analysis of machine code is usually carried out using a disassembler and heuristic algorithms that attempt to understand the structure of the program. Active or live analysis of machine code is done using an interactive debugger-disassembler that can attach to a running program and allow a reverse engineer to step through each instruction and observe the behavior of the program at key points during it's execution. Live analysis is how most reverse engineers get the job done, so it's common for developers to want to implement guards against binary debuggers.
30

7 Applying Anti-Reversing Techniques to Wintel Machine Code


Extreme care must be taken when applying anti-re versing techniques because some ultimately change the machine code or Java bytecode that will be executed on the target processor. In the end, if a program doesn't work, measuring how efficient or difficult to reverse engineer it is becomes meaningless [18]. Some of the anti-reversing transformations performed on source code to make it more difficult to understand in both source and executable formats, can make the source code more challenging for a compiler to process because the program no longer looks like something a human would write. [18] states "any compiler is going to have at least some pathological programs which it will not compile correctly." Compiler failures on so called "pathological" programs occur because compiler test cases are most often coded by peoplenot mechanically generated by a tool that knows how to try every fringe case and surface every bug. Keeping this in mind, one should not be surprised if some compilers have difficulty with obfuscated source code. Following the basic anti-reversing techniques introduced in Section 6, we now investigate the technique Eliminating Symbolic Information as it applies to Wintel machine code.

7.1 Eliminating Symbolic Information in Wintel Machine Code


Eliminating Symbolic Information calls for the removal of any meaningful symbolic information in the machine code that is not important to the execution of the program, but serves to ease debugging or reuse of it by another program. For example, if a program relies on certain function or methods names (as a DLL does) the names of 31

those methods or functions will appear in the .idata (import data) section of the Windows PE header. In production versions of a program, the machine code doesn't directly contain any symbolic information from the original source codesuch as method names, variable names, or line numbers; the executable file only contains the machine instructions that were produced by the compiler [9]. This lack of information about the connection between the machine instructions and the original source is unacceptable for purposes of debuggingthis is why most modern compilers, like GCC, include an option to generate debugging information into the executable file that allow one to trace a failure occurring at a particular machine instruction back to a line in the original source code [9]. To show the various kinds of symbolic information that are inserted into machine code to enable debugging of an application, the GNU C++ compiler was directed to compile the program Calculator.cpp with debugging information but to generate assembly language instead of machine code. The source code for Calculator.cpp and the generated assembly language equivalent are given in Table 7.1. The GNU compiler stores debug information in the symbol tables (.stabs) section of the Windows PE header so that it will be loaded into memory as part of the program image. It should be clear from the generated assembly in Table 7.1 that the debugging information inserted by GCC is by no means a replacement for the original source code of the program. A source-level debugger, like the GNU Project Debugger (GDB), must be able to locate the original source code file to make use of the debugging information embedded in the executable. Nevertheless, debugging information can give plenty of hints to a reverse

32

engineer, such as the count and type of parameters one must pass to a given method. An obvious recommendation to make here, assuming there is an interest in protecting machine code from being reverse engineered, is to ensure that source code is not compiled for debugging when generating machine code for use by customers.
Table 7.1. Debugging information inserted into machine code. Calculator.cpp: int main(int argc, char *argv[]) 01 02 { 03 string input; int opl, op2; char fnc; long res 04 cout << "Enter integer 1: "; 05 getline (cin, input); opl = atoi(input.c_str()) 06 cout "Enter integer 2: "; 07 getline (cin, input); op2 = atoi(input.c_str()) 08 cout "Enter function [+|-|*]: "; 09 getline (cin, input); fnc = input.at (0); 10 switch (fnc) 11 { 12 case ' + ' : 13 res = doAdd(opl, op2); break; 14 case '-': 15 res = doSub(opl, op2); break; 16 case '*': 17 res = doMul(opl, op2); break; 18 } 19 cout "Result: " << res endl; 20 return 0; 21 } 22 23 long doAdd(int opl, int op2) { return opl + op2; } 24 long doSub(int opl, int op2) { return opl - op2; } long doMul(int opl, int op2) { return opl * op2; } Calculators (abbreviated assembly): "Calculator.cpp" 01 .file "C:/SRECD/MiscCPPSource/Calculator/",100,0,0,LtextO 02 .stabs 03 .stabs "Calculator.cpp",100,0,0, LtextO 04 .stabs "main:F(0,3)",36,0,12,_main 05 .stabs "argc:p (0,3)",160,0,12,8 06 .stabs "argv:p(40,35)",160,0,12,12 06 main: 07 .stabs "Calculator.cpp",132,0,0,Ltext 08 call Z5doAddii 09 call Z5doSubii 10 call Z5doMulii

33

11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

.stabs "_Z5doAddii:F(0,18)",36,0,33, .stabs "opl:p(0,3)",160,0,33,8 .stabs Mop2:p(0,3)",160,0,33,12 Z5doAddii: movl 12(%ebp), %eax addl 8(%ebp), %eax .stabs "_Z5doSubii:F(0,18)",36,0,34, .stabs "opl:p(0,3)",160,0,34,8 .stabs "op2:p(0,3)",160,0,34,12 Z5doSubii: .stabn 68,0,34,LM33- Z5doSubii movl 8(%ebp), %eax subl %edx, %eax .stabs "_Z5doMulii:F(0,18)",3 6,0,35, .stabs "opl:p(0,3)",160,0,35,8 .stabs "op2:p(0,3)",160,0,35,12 Z5doMulii: .stabn 68,0,35,LM35- Z5doMulii movl 8(%ebp), %eax imull 12(%ebp), %eax

Z5doAddii

Z5doSubii

Z5doMulii

The hunt for symbolic information doesn't end with information embedded by debuggers, it continues on to include the most prolific author of such helpful information the programmer. Recall that in the animated tutorial on reversing Wintel machine code (see Section 4) the key piece of information that led to the solution was the trial limitation message found in the .rdata {read-only) section of the executable. One can imagine that something as simple as having the Password Vault application load the trial limitation message from a file each time time it's needed and immediately clearing it from memory would have prevented the placement of a memory breakpoint on the trial message, which was an anchor for the entire tutorial. An alternative to moving the trial limitation message out of the executable would be to encrypt it so that a search of the dump would not turn up any hits; of course encrypted symbolic information would need to be decrypted before it is used. Encryption of symbolic information, as was discussed
34

in relation to the Wintel animated tutorial, is an activity related to the obfuscation of a program, which we discuss next.

7.2 Basic Obfuscation of Wintel Machine Code


Obfuscating the Program calls for performing transformations to the source code and/or machine code that would render either extremely difficult to understand but functionally equivalent to the original. There are many kinds of transformations one can apply with varying levels of effectiveness, and as [5] states "an obfuscation transformation will typically have an associated cost (such as): larger code, slower execution time, or increased runtime memory consumption (by the machine code)." Because of the high-level nature of intermediate languages like Java and .NET bytecode, there are free obfuscation tools that can perform fairly robust transformations on bytecode so that any attempt to decompile the program will still result in source code that compiles, but is near impossible to understand because of the obfuscation techniques that are applied. [19] states "Obfuscation (of Java bytecode) is possible for the same reasons that decompiling is possible: Java bytecode is standardized and well documented." Unfortunately, the situation is very different for machine code because it is not standardized; instruction sets, formats, and program image layouts vary depending on the target platform architecture. The side-effect of this is that tools to assist with obfuscating machine code are much more challenging to implement and expensive to acquire; no free tools were found at the time of this writing. One such commercial tool, EXECryptor (www.strongbit.com) is an industrial-strength machine code obfuscator that when applied
35

to the machine code for the Password Vault application rendered it extremely difficult to understand. The transformations performed by EXECryptor caused such extreme differences in the machine code, including having compressed parts of it, that it was not possible to line up the differences between the original and obfuscated versions of the machine code to show evidence of the obfuscations. Therefore, to demonstrate machine code obfuscations in a way that is easy to follow, we'll perform obfuscations at the source code level and observe the differences in the assembly language generated by the GNU C++ compiler. The key idea here is that the obfuscated program has the same functionality as the original, but is more difficult to understand during live or static analysis. There are no standards for code obfuscation, but it's relatively important to ensure that the obfuscations applied to a program are not easily undone because deobfuscation tools can be used to eliminate easily identified obfuscations [5]. Table 7.2 contains the source code and disassembly of VerifyPassword.cpp, a simple C++ program that contains an insecure password check that is no weaker than the implementation of the Password Vault trial limitation check. To find the relevant parts of .text and .rdata sections that are related to the password check, the now familiar technique of setting a breakpoint on a constant in the .rdata section was used.
Table 7.2. Listing of VerifyPassword.cpp and disassembly ofVerifyPassword.exe. VerifyPassword.cpp: 01: i n t m a i n ( i n t argc, char *argv[]) 02: { 03: const char *password = "juplter"; 04: string specified; 05: cout << "Enter password: ";

36

06 07 08 09 10 11 12 13 14

getline (cin, specified); if (specified.compare(password) == 0) { cout << "[OK] Access granted." << endl; } else { cout << "[Error] Access denied." << endl;

VerifyPassword.exe disassembly (abbreviated):


TEXT SECTION

# "j up Iter"

0040144A MOV DWORD PTR SS:[EBP-1C],VerifyPa.00443000


# "Enter password: "

00401463 MOV DWORD PTR SS:[ESP+4],VerifyPa.00443008


# if (specified. compare (password) == 0)

004014A3 TEST EAX,EAX 004014A5 JNZ SHORT VerifyPa.004014CD


# "[OK] Access # "[Error] Access granted." denied."

004014A7 MOV DWORD PTR SS:[ESP+4],VerifyPa.00443019 004014CD MOV DWORD PTR SS:[ESP+4],VerifyPa.0044302E RDATA SECTION 00443000 6A75702174657200456E746572207061 00443010 7373776F72643A20005B4F4B5D204163 00443020 63657373206772616E7465642E005B45 00443030 72726F725D204163636573732064656E 00443040 6965642E000000000000000000000000 jup!ter.Enter pa ssword: .[OK] Ac cess granted..[E rror] Access den ied

Using the simple program VerifyPassword.cpp, we now investigate applying obfuscations to make machine code more difficult to reverse engineer. The first obfuscation that will be applied is a data transformation technique which [5] calls "Modifying Variable Encoding". Essentially this technique prescribes that all meaningful and sensitive constants in a program be stored or represented in an alternate encoding, such as ciphertext. For numerics, one can imagine storing or working with a function of a number instead of the number itself; for example, instead of testing for a <

37

10, we can obscure the test by checking if 1.2 < 1.210 instead. To make string constants unreadable in a dump of the .rdata section we can employ a simple substitution cipher whose decryption function would become part of the machine code. A simple substitution cipher is an encryption algorithm where each character in the original string is replaced by another using a one-to-one mapping [20]. Substitution ciphers are easily broken because the algorithm is the secret [21], so while we will use one for ease of demonstration, stronger encryption algorithms should be used in real-world scenarios. Table 7.3 contains the definition of a simple substitution cipher that shifts each character 13 positions to the right in the local 8-bit ASCII or EBCDIC character set. Ciphertext is generated or read in printable hexadecimal to allow all members of the character set, including control characters, to be used in the mappings. Note: unlike ROT13 [22], this cipher is not it's own inversemeaning that shifting each character an additional 13 positions to the right will not perform decryption.
Table 7.3. Simple substitution cipher used to protect string constants. SubstitutionCipher.h: 01 class SubstitutionCipher 02 03 public: SubstitutionCipher (); 04 string encryptToHex(string plainText); 05 string decryptFromHex(string cipherText) 06 07 private: unsigned char encryptTable[256]; 08 09 unsigned char decryptTable[256]; 10 char hexByte[2]; 11

Full source code: http://reversingproject. info/repository. php?fileID=7_2_l

38

Using the substitution cipher given in Table 7.3, we replace each string constant in VerifyPassword, cpp with its equivalent ciphertext. Even strings with format modifiers such as "%s" and "%d" can be encrypted as these inserts are not interpreted by methods such as printf and sprint/until execution time. Table 7.4 contains the source code and disassembly for VerifyPasswordObfuscated.exe, where each string constant in the program is stored as ciphertext; when the program needs to display a message, the ciphertext is passed to the bundled decryption routine. The transformation we've manually applied removes the helpful information the string constants provided when they were stored in the clear. Given that modern languages have well-documented grammars, it should be possible to develop a tool that automatically extracts and replaces all string constants with ciphertext that is wrapped by a call to the decryption routine.
Table 7.4. VerifyPasswordObfuscated.cpp and corresponding disassembly. VerifyPasswordObfuscated.cpp: 01: #include "substitutioncipher.h" 02: using namespace std; 03: static const char *password = "77827D2E81727F"; 04: static const char *enter_password = "527B81727F2D7D6E8080847C 7F71472D"; 05: static const char *password_ok = "685C586A2D4E70707280802D747 F6E7B8172713B"; 06: static const char *password_bad = "68527F7F7C7F6A2D4E70707280 802D71727B7672713B"; 07 int main(int argc, char *argv[]) 08 SubstitutionCipher cipher; 09 string specified; 10 11 cout << cipher . decryptFromHex (enter__password) ; 12 getline(cin, specified); 13 if (specified.compare(cipher.decryptFromHex(password)) == 0) 14 { 15 cout << cipher.decryptFromHex(password_ok) << endl; 16 } else 17 {

39

cout << cipher.decryptFromHex(password_bad) << endl; 19 20 VerifyPasswordObfuscated.exe disassembly (abbreviated):


RDATA SECTION

00445000 00445010 00445020 00445030 00445040 00445050 00445060 00445070 00445080

35323742383137323746324437443645 38303830383437433746373134373244 00373738323744324538313732374600 36383543353836413244344537303730 37323830383032443734374636453742 38313732373133420000000036383532 37463746374337463641324434453730 37303732383038303244373137323742 37363732373133420000000000000000

527B81727F2D7D6E 8080847C7F71472D .77827D2E81727F. 685C586A2D4E7070 7280802D747F6E7B 8172713B....6852 7F7F7C7F6A2D4E70 707280802D71727B 7672713B

Once all constants have been stored in an alternate encoding, the next step one could take to further protect the VerifyPassword.cpp program would be to obfuscate the condition in the code that tests for the correct password. Applying transformations to disguise key logic in a program is an activity related to the anti-reversing technique Obfuscating the Program. For purposes of demonstration we'll implement some obfuscations to the trial limitation check in the C++ version of the Password Vault application, which was introduced in Section 4, but first we discuss an additional application of the technique Obfuscating the Program that helps protect intellectual property when proprietary software is shipped as source code.

7.3 Protecting Source Code Through Obfuscation


When delivering a software application to clients, there may exist a requirement to ship the source code so that the application binary can be created on the client's computer using shop-standard build and audit procedures. If the source code contains

40

intellectual property that is worth protecting, one can perform transformations to the source code which make it difficult to read, but have no impact on the machine code that would ultimately be generated when the program is compiled. To demonstrate source code obfuscation, COBF [23], a free C/C++ source code obfuscator was configured and given VerifyPassword.cpp as input; the results of which are displayed in Table 7.5.
Table 7.5. COBF obfuscation results for VerifyPassword.cpp. COBF invocation: 01 C:\cobf_l.0 6\src\win32\release\cobf.exe 02 @C:\cobf_l.06\src\setup_cpp_tokens.inv -o cobfoutput -b -p C: 03 \cobf_l.0 6\etc\pp_eng_msvc.bat VerifyPassword.cpp COBF obfuscated source for VerifyPassword.cpp: 01 #include"cobf.h" 02 Is lp Ik;If lo(lf ln,ld*lj[]){11 Id*lc="\x6a\x75\x70\x21\x74 03 \x65\x72";lh la;Ib"\x4 5\x6e\x7 4\x65\x72\x2 0\x7 0\x61\x7 3\x7 3 04 \x7 7\x6f\x7 2\x64""\x3a\x20";li(lq,la) ; lm (la. lg (lc) ==0) {lb"\x5b 05 \x4f\x4b\x5d\x20\x41" "\x63\x63\x65\x7 3\x73\x20\x67\x7 2\x61\x6e 06 \x7 4\x65\x64\x2e"le; } lr {Ib"\x5b\x4 5\x72\x72\x6f \x72\x5d 07 \x2 0\x41\x63\x63\x65\x73\x7 3\x2 0\x64" "\x65\x6e\x6 9\x65 08 \x64\x2e"le; } } COBF generated header (cobf.h): 01 02 03 04 05 06 07 08 #define #define #define #define #define #define define #define Is lp Ik If lo Id 11 lh using namespace std int main char const string 09 10 11 12 13 14 15 #define #define #define #define #define #define #define lb li lq lm lg le lr cout getline cin if compare endl else

COBF replaces all user-defined method and variables in the immediate source file with meaningless identifiers. In addition, COBF replaces standard language keywords and library calls with meaningless identifiers, however these replacements must be undone before compilation; for example, the keyword "if cannot be left as "lm". 41

Therefore, COBF generates the cobf.h header file which includes the necessary substitutions to make the obfuscated soure compilable. Through this process, all userdefined method and variable names within the immediate file are lost, rendering the source code difficult to understand, even if one performs the substitutions prescribed in cobf.h. Since COBF generates obfuscated source as a continuous line, any formatting in the source code that served to make it more readable is lost. While the original formatting cannot be recovered, a code formatter such as Artistic Style can be used to format the code using ANSI formatting schemes so that methods and control structures can again be identified via visual inspection. Source code obfuscation is a fairly weak form of intellectual property protection, but it does serve a purpose in real-world scenarios where a given application needs to be built on the end-user's target computer instead of being pre-built and delivered on installation media.

7.4 Advanced Obfuscation of Machine Code


One of the features of an interactive debugger-disassembler like OllyDbg that is very helpful to a reverse engineer is the ability to trace the machine instructions that are executed when a particular operation or function of a program is tried. In the Password Vault application, introduced in Section 4, a reverse engineer could pause the program's execution in OllyDbg right before specifying the option to create a new password record. To see which instructions are executed when the trial limitation message is displayed, the reverser can choose to record a trace of all the instructions that are executed when execution is resumed. To make it difficult for a reverse engineer to understand the logic
42

of a program through tracing or stepping through instructions, we can employ control flow obfuscations, which introduce confusing, randomized, benign logic that serves to make live and static analysis (debugging and tracing) difficult. The often randomized and recursive nature of effective control flow obfuscations can make traces more difficult to understand and interactive debugging sessions less helpful: randomization makes the execution of the program appear different each time it's run, while recursion makes stepping through code more difficult because of deeply nested procedure calls. In [5], three types of control flow transformations are introduced: computation, aggregation, and ordering. Computation transformations reduce the readability of machine code and, in the case of opaque predicates, can make it difficult for a decompiler to generate equivalent high-level language source code. Aggregation transformations destroy the high-level language structure of a program. For example, if a programmer used the structured programming technique of functional decomposition, inlining the code of many functions into a single function in the machine code would make it impossible to recover the original program structure. Ordering transformations randomize the order of operations in a program to make it more difficult to follow the logic of a program during live or static analysis (debugging or tracing). To provide an example of how control flow obfuscations can be applied to protect a non-trivial program, we'll apply both a computation and ordering control flow obfuscation to the trial limitation check in the Password Vault application, and analyze their potential effectiveness, by gathering some statistics during execution of the obfuscated code.

43

7.5 Wintel Machine Code Anti-Reversing Exercise


Apply the anti-reversing techniques Eliminating Symbolic Information and Obfuscating the Program, both introduced in Sections 6 and 7, to the C/C++ source code of the Password Vault application with the goal of making it more difficult to disable the trial limitation. Rebuild the executable binary for the Password Vault application from the modified sources using the GNU compiler collection for Windows. Show that the Wintel machine code reversing and patching animated solution in Section 4.4 can no longer be carried out as demonstrated.

7.6 Solution to the Wintel Anti-Reversing Exercise


The solution to the Wintel machine code anti-reversing exercise is given through comparisons of the original and obfuscated source code of the Password Vault application. As each anti-reversing transformation is applied to the source code, important differences and additions are explained through a series of generated diff reports and memory dumps. Once the anti-reversing transformations have been applied, the impact they have on the machine code and how reversing the Password Vault application becomes more difficult is covered; these obfuscations make it difficult to find a good starting point and hinder live and static analysis. The obfuscated source code for the Password Vault application is located in the password' vault cpp obfuscated directory of the archive located at http://reversingproject.info/repository.php? fileID=4 1 2.

44

7.6.1 Encryption of String Literals


To eliminate the obvious starting point of setting an access breakpoint on the trial message, all of the messages issued by the application are stored as encrypted hexadecimal literals that are decrypted each time they are usedkeeping the decrypted versions out of memory as much as possible. Table 7.6 gives an example of the needed code changes to PasswordVaultConsoleUtil.cpp.
Table 7.6. Encrypted strings are decrypted each time they are displayed. 133 case createPasswordRecord: return "Create a Password Record"; ==> 137 case createPasswordRecord: DecryptMessageText("507F72 6E81722D6E2D5D6E80 8 0847C7F712D5F727 07C7F7 1", textBuffer); 186 case recordLimitReached: return "Thank you for trying Password Vault! You have reached the maximum number of records allowed in this trial version."; ==> 190 case recordLimitReached: DecryptMessageText("617 56E7B782D8 67C822D737C7F2D817F86767B742D5D6E8 080847C7F712D636E827 9812E2D6 67C822D756E83722D7F72 6E707572712D817572 2D7A6E857 67A82 7A2D7B8 27A6F72 7F2D7C7 32D7F72 707C7F718 02D6E7 97 97C84 72 7 12D7 67B2D817 57 6 8 02D817F7 6 6E7 92D8 372 7F8 07 67C7B3B", _textBuffer); 205 void PasswordVaultConsoleUtil::DecryptMessageText(const char *_cipherText, string *_plainTextBuffer) 206 { 208 string cipherText(_cipherText); 210 SubstitutionCipher cipher; 212 _plainTextBuffer->assign(cipher.decryptFromHex(cipherText)); 214 }

The net effect of encrypting the literals is shown in Fig. 7.1 where a dump of the .rdata section of the Password Vault program image no longer yields the clues it once did. Since the literals are no longer readable, one cannot simply locate and set a breakpoint on the trial limitation messageas was done in the solution to the Wintel machine code
45

reversing exercisecausing a reverser to choose an alternate strategy. Note that more than just the trial limitation message would need to be encrypted otherwise it would look quite suspicious in a memory dump alongside other non-encrypted strings!
30400300 00401000 005SC000 00550000 00591080 00592000 00599000 0059B000 00001008 00156600 00001000 00034333 00301003 00007003 00301003 33802033 Password Password Password Password Password Password Password Password .text .data . i-data PE header code data I nag I nag I nag Inag I nag Inag Inag Inag

R R R
R

RUIE

RUE
RUIE FttUE RUIE RUIE

/4

.bss .idata .rsrc

imports resources

R R R R

RUE RUE

73 22 64 50 69 76 2E 54 79 75 61 6D 72 68 6E

73 7B 20 61 6C 65 00 68 69 6C 63 29 64 69 2E

77 2P, 73 73 65 64 5B 61 6E 74 68 6E 73 73 00

6F>?2 7D122 75163 73:77 20122 20:73 57:61 6E;6B 67 20 21 20 65|64 75*6D 20:61 20:74 4D: 65

64 20 63 6F 7B 75 72 20 50 59 20 62 6C 72 73

20 77 65 72 2P, 63 6E 79 61 6F 74 65 6C 69 73

66! 6F 61,73 73:73 64:20 7D; 22 63:65 69! 6E 6F!75 73173 75:20 68|6S 72:20 6FI77 61.6C 61-67

72 20 66 76 20 73 67 20 77 68 20 6F 65 20 65

20 63 75 61 77 73 SD 66 6F 61 6D 66 64 76 20

75 73 68 61 6C 6C 75 6C 61 73 66 75 20 00 6F 72 72 64 76i65 61 78 20 72 20 69 65 72 4E 6F 38 37 32 38 37 38 35 32 37 32 38 37 33 42 37 mi 37 32 37 32 37 37 38 37 37 36 37 38 37 35 35

65 6E 79 74 20 6C 00 20 20 20 69 65 6E 73 74 30 43 46 30 31 30 44 44 36 46 33 32 42 37 38 46 44 31 44 32 31 35 41 46 45 36 31 46 41 42

72 67 2E 20 73 6C 00 74 56 72 6D 63 20 69 20 38 37 38 32 32 37 36 38 37 32 37 38 00 34 32 32 35 32 36 32 32 37 36 37 37 37 37 38 37 37

20 65 00 66 61 79 00 72 61 65 75 6F 74 6F 46

ssword for user "C*>" was change d successfuIly.. Password vauIt f ile "{#}" was sa ved successfully ..CUarn ing3 .... Thank you for tr y ing Password Va u It? Vou have re ached the nax inu n number- of reco rds a I Lowed in t h is trial versio n..Message Not F 34 44 37 30 30 32 30 45 32 34 31 30 36 32 36 31 45 33 43 46 31 41 32 30 39 44 36 36 30 31 37 38 38 37 38 37 38 38 32 36 32 37 34 44 37 37 38 36 38 37 37 38 37 37 37 38 36 37 38 32 43i 37 32i38 41 32 35136 32137 39 37 30 38 32 37 44 32 45 38 44 38 33 38 36 45 00 36 43 38 46 38 30 38 45|38 32132 32136 3Sj37 321 37 461 32 43 37 43 38 31 37 45 37 43 37 30 36 44! 35 46 30 46 45 30 39 34 39 46 30 30 32 37 31 32 36 30 32 44 45 32 41 44 46 34 35 39 42 45 33 37 37 32 37 37 38 37 38 38 32 38 37 46 37 32 37 38 37 37 37 32 32 37 37 37 37 32 33 37 37 31 ! 32 32 37 44 38 42:37 30 37 36 33 43|37 31 32 38-33 44:38 32; 37 39: 37 37: 42 35 36 4437 36-37 34:37 39, 38 35 36 30 37 44:37 44: 37 43 37 31i 38 32! 37 36:38 44; 38 42:00 34-37 43:38 44 46 34 34 32 42 46 44 37 30 30 39 37 45 33 42 43 31 45 35 41 42 33 30 31 30 33 00 32 32 37 32 36 37 38 00 37 37 38 36 37 38 36 37 37 37 37 32 38 37 36 38 32 32 32 32 37 00 32 37 33 44 45 32 30 00 31 33 41 45 38 36 37 42 43 34 46 45 33 32 45 32 44 44 44 44 32 00 44 42 8080847C7F712D73 7C7F2D8280727F2D 2F88378FI2F2D346E 802D70756E7B7472 712D808278707280 8073827979863B.. SD6E8080847C7F71 2D836E8279812D73 7679722D2F88378P, 2F2D846E8B2D806E 8372712D80827070 7280807382797986 3B.68646E7F7B767 B746R2D.617S6E7B 782D867C822D737C 7F2D817F86767B74 2D5D6E8080847C7F 712D636E8279812E 2D667C822D756E83 722D7F726E707572 712D817572207P.6E 857670827R2D7B82 7B6F727F2D7C732D 7F72707C7F71802D 6E79797C8472712D 767B2D817S76802D 817F766E792D8372 7F80767C7B3B 5R7280806E74722D 5B7C812D537C827B

*" Data Obruscanon


rtL

30138 46:32 38:33 44 37 44 38 33 38 45:38 33 36 39 37 44:38 32:37 30:38 36:38 36:41 44:38 44:38 44:36 44:36 36:37 44:37 44:38 36:37 46:37 32137 39i37 42:32 46!37 30!37 32:38 43:38

Figure 7.1. Result of obfuscating all string literals in the program.

46

7.6.2 Obfuscating the Numeric Representation of the Record Limit


Having obfuscated the string literals in the program image, we'll assume that a reverse engineer will need to select the alternate strategy of pausing the program's execution immediately before specifying the input that causes the trial limitation message to be displayed. Using this strategy, a reverser can either capture a trace of all the machine instructions that are executed when the trial limitation message is displayed, or debug the applicationstepping through each machine instruction until a sequence that seems responsible for enforcing the trial limitation is reached. Recall that in the solution to the Wintel machine code reversing exercise, an obvious instruction sequence that tested a memory location for a limit of five password records was found. By using an alternate but equivalent representation of the record limit we can make the record limit test a bit less obvious. The technique we employ here is to use a function of the record limit instead of the actual value; for example, instead of testing for a <= 5, where a is the record limit, we obscure the limit by testing if 2 a <= 25. Table 7.7 gives an example of the needed code changes to PasswordVault.cpp.
Table 7.7. Using a function of the record limit to obfuscate the condition. 176 178 180 181 void PasswordVault::doCreateNewRecord() #ifdef TRIALVERSION // Add limit on record count for reversing exercise if (passwordStore.getRecords() .size () >= TRIAL_RECORD_LIMIT) ==> 181 if ((pow(2.0, (double)passwordStore.getRecords().size()) >= pow(2.0, 5.0)))

The effects of the source code changes in Table 7.7 on the machine code are shown in Fig. 7.2. A function of the record limit is referenced during execution instead of 47

the limit itself. This type of obfuscation is as strong as the function used to obscure the actual condition is to unravel. Keep in mind that a reverse engineer will not have the non-obfuscated machine code for reference, so even a very weak function, like the one used in this solution, may be effective at wasting some of a reverser's time. The numeric function used here is very simple; more complex functions can be devised that would further decrease the readability of the machine code.

7.6.3 Control Flow Obfuscation for the Record Limit Check


We introduce some non-essential, recursive, and randomized logic to the password limit check in PasswordVault.cpp to make it more difficult for a reverser to perform static or live analysis. A design for obfuscated control flow logic which ultimately implements the trial limitation check is given in Fig. 7.3. Since no standards exist for control flow obfuscation, this algorithm was designed by the author using the cyclomatic complexity metric defined by McCabe [24] as a general guideline for creating a highly-complex control flow graph for the trial limitation check.

48

if (passwordStore.getRecords().size()
084078E0 83BD B0FFFFFF 04 884070E7 -76 21 00487BE9 C74424 08 03000008 084078F1 C74424 04 00000000 004078F9 C70424 36000000 08407180 E8 01B3FFFF 80407105 -E9 BR070000 0040718B 8D45 D8 004071SD S90424 00407110 C785 08FFFFFF FFFFFFFF 0040711A E8 3BBCFFFF

>= T R I f t L R E C O R D L I M I T )

CMP DWORD PTR SS:[EBP-1003,4 JBE SHORT Password.0040710A MOU DWORD PTR SS:CESP+S],3 MOU DWORD PTR SS:[ESP+4],0 MOU DWORD PTR SS:CESP],36 CALL Password.00401406 JMP Password.004078C4 LEA EAX,DWORD PTR SS:CEBP-28] MOU DWORD PTR SS:[ESP],EAX MOU DWORD PTR SS:CEBP-F8],-1 CALL Password.00402D5H

if ((pou(2.0,

(double)passuordStore. getRecoi-ds() .size()) >= p o w ( 2 . 0 , 5.0)))


004ti7iea DD0S 20E4SS00 00407196 D085 F8FEFFFF 0M40710C DAE9 0840710E DFE0 S0407110 9E 00407111 73 02 80407113 'EB 2B 80487115 C74424 08 03000000 804871 ID C74424 04 00000000 0040712S C70424 36000000 0040712C C785 08FFFFFF FFFFFFFF 00407136 E8 CBA2FFFF 0048713B 'E9 BA070000 00407140 8D4S D8 00487143 890424 00407146 C78S 08FFFFFF FFFFFFFF 084071SB E8 0EBCFFFF FLD QWORD PTR DS:CSSE420] FLD QWORD PTR SS:CEBP-108] FUCOMPP FSTSW BX SAHF JNB SHORT Password.004071 IS JMP SHORT Password.00407140 MOU DWORD PTR SS: [ ESP+3 ]. 3 MOU DWORD PTR SS:CESP+4],0 MOU DWORD PTR SS: C ESP 3, 36 MOU DWORD PTR SS:[EBP-F8:,-1 CfiLL Password.00401406 JMP Password.B04078FA LEfi EAX,DWORD PTR SS:CEBP-283 MOU DWORD PTR SS:tESP],EfiX MOU DWORD PTR SS:CEBP-F8],-1 CfiLL Password.00402D5R

Computation Obfuscation

Live analysis of the computation


...DD8S 20E45B00:1
00407186 0840718C 0040718E 80487118 80487111 80487113 DD8S F8FEFFFF DfiE9 DFE0 9E -73 02 -EB 2B

FLO JWQBD ETR OS: tSSE4203


FLD QWORD PTR SS:[EBP-1083 FUCOMPP FSTSW RX SfiHF JNB SHORT Password.004071IS JMP SHORT Password.00407140

DS:C00SEE420D=32.00000000000000 Stack SS:[0022FBR0]=32.00000000008000 00407100 08407106 8040710E 08407118 00407111 80487113 . DD05 20E45S00 . D08S F8FEFFFF FLD QWORD FLD QWORD FUCOMPP FSTSW RX SRHF JNB SHORT JMP SHORT PTR DS:[SSE4203 PTR SS:[EBP-1083

,,:mm :

. DFE0 . 9E .-73 82 .-EB 2B ST<1)=32.000000000000080000 ST=32.000000000000000000

Password.004071IS Password.00407140

The record limit of 5 is obscured by the use of the value 32.0 (2A5) when the operands are loaded and the condition is tested.

Figure 7.2. Record limit comperands are represented as exponents with a base of 2. The record limit check is abstracted out into the method isRecordLimitReached which returns whether or not the record limit is reached after having invoked the method isRecordLimitReached_0. The method isRecordLimitReached_0 invokes itself recursively a random number of times, growing the call stack by a minimum of 16 and a maximum of 64 frames. Each invocation of isRecordLimitReached 0 tests whether the record limit has been reached, locally storing the result, before randomly invoking one of

49

the methods isRecordLimitReached_l, isRecordLimitReached_3.

isRecordLimitReached_2, or

When the call stack is unraveled, isRecordLimitReached J)

finally returns whether or not the record limit is reached in the method isRecordLimitReached. Table 7.8 shows the required code changes to implement the control flow obfuscation. Note that a sum of random numbers returned from methods isRecordLimitReached_1, isRecordLimitReachedJ2, and isRecordLimitReached'_3 is stored in randCallSum, a private attribute of the class; this is to protect against a compiler optimizer discarding the calls because they would otherwise have no effect on the state of any variables in the program.

50

is Re cord Limit Reached Q

reached

max (16, abs(randQ)%64) i s Re cord Limit Re ached _O0

boo I reached = (2"records.getSizeO) >= (2*5)

i s Re cord Li mit Re ach ed _2 0

Figure 7.3. Obfuscated control flow logic for testing the password record limit. Table 7.8. Implementation of the control flow obfuscation in Fig. 7.3. PasswordVault.cpp: if (passwordStore.getRecords() .size () >= TRIAL_RECORD_LIMIT) ===> if (isRecordLimitReached()) 01 02 03 04 05 06 bool PasswordVault::isRecordLimitReached() { srand(time(NULL)); controlFlowAltRemain = max (4, abs(randO) % 64) return isRecordLimitReached 0();

51

bool PasswordVault::isRecordLimitReached_0() { while (controlFlowAltRemain > 0) { controlFlowAltRemain--; isRecordLimitReached 0() ; bool reached = (pow(2.0, (double)passwordStore.getRecords().size()) >= pow(2.0, 5.0)); 17: randCallSum = 0; switch (abs(randO) % 3) { case 0: randCallSum += isRecordLimitReached_l() ; break; case 1: randCallSum += isRecordLimitReached_2() ; break; case 2: randCallSum += isRecordLimitReached_3() ; break; return reached; unsigned int PasswordVault::isRecordLimitReached_l() return abs(rand()); unsigned int PasswordVault::isRecordLimitReached_2() return abs(rand());

unsigned int PasswordVault::isRecordLimitReached_3() return abs(rand());

52

7.6.4 Analysis of the Control Flow Obfuscation Using Run Traces


The goal of this analysis is to demonstrate that even though the Password Vault application is given identical input and delivers identical output on subsequent runs, OllyDbg run traces, which contain the executed sequence of assembly instructions, will be significantly different from each othermaking it difficult for a reverser to understand the trial limitation check through live or static analysis of the disassembly. Live analysis is hampered more by randomization than static analysis is because the control flow of the trial limitation check is randomized each time it is run; one can imagine the confusion that would arise if breakpoints are not always triggered, or triggered in an unpredictable order. OllyDbg run traces are captured using the run trace view once the execution of a program has been paused at the desired starting point. To have the trace logged to a file in addition to the view, select "log to file" on the context menu of the run trace view. Begin the trace by selecting "Trace into" on the "Debug" menu; the program will execute, but much more slowly than normal since each instruction must be inspected and added to the run trace view and optional log file. An OllyDbg trace will include all the instructions executed by the program and its operating system dependencies; fortunately the trace is columnar with each instruction qualified by the name of the module that executed it, so it is possible to post process the trace and extract only those instructions executed by a particular module of interest. For example, in the case of the Password Vault traces which we will analyze in this section, the Sed (stream-editor) utility was used 53

to filter the run tracesleaving only instructions executed by the "Password" module. To analyze the effectiveness of the ordering (control flow) obfuscation, statistics on the differences between three different run traces were gathered using a modification of Levenshtein Distance (LD), a generalization of Hamming Distance, to compute the edit-distancethe number of assembly instruction insertions, deletions, or substitutions needed to transform one trace into the other; we've modified LD to consider each instruction instead of each character in the run traces. Fig. 7.4 illustrates the significant differences that exist between the traces at the point of the obfuscated trial limitation check. The randomized control-flow obfuscation causes significant differences in subsequent executions of the trial limitation checkhopefully creating enough of a deterrent for a reverse engineer by hampering live and static analysis efforts. Table 7.9 contains the statistical data that was gathered for the analysis. A C++ implementation of Levenshtein Distance, written for this solution, can be downloaded from http://reversingproject.info/repository.php?fileID=7_6_l. Note that computing the edit-distance between two large files of any type can take many hours a modern PC. For reference, the average size of three traces analyzed in this section is 10MB, and to compute the edit-distance between two of them required an average of-20 hours of CPU time on an Intel Pentium 1.6GHz Dual-core processor. The LD implementation employed in this analysis uses a dynamic-programming approach that requires O(m) space; note that some reference implementations of LD require O(mn) space since they use a(m+ l ) x ( n + 1) matrix which is impractical for large files [25].
54

The ~20 hour execution time for the LD implementation is mainly because the dynamic programming algorithm is quite naive; perhaps an approximation algorithm would perform significantly better.
Analysis of Run Trace Levenshtein Edit Distances Code path: obfuscated thai limitation check * Edit Distance

Trace # 2 - T r a c e # 3 Trace #1 Trace #2 Trace #1 Trace #3 Figure 7.4. Edit-distances between three run traces of the trial limitation check.

55

Table 7.9. Statistical data gathered for randomized control-flow obfuscation.

Trace Comparison Trace #1 -> Trace #2 Trace #2 - Trace #3 Trace #1 -> Trace #3 Mean Edit Distance Trace Comparison Trace #1 -* Trace #2 Trace #2 - Trace #3 Trace #1 - Trace #3

Edit Distance 101414 67590 168892 112632 Standard Deviation 7932.32 31849.5 39781.83

8 Applying Anti-Reversing Techniques to Java Bytecode


It was demonstrated in the Java reversing and patching exercise of Section 5.2 that decompilation of Java bytecode to Java source code is possible with quite good results. While it is most often the case that we cannot recover the original Java source code from the bytecode, the results will be functionally equivalent. When new features are added to the Java language they won't always introduce new bytecode instructions; for example, support for generics is implemented by carrying additional information in the constants pool of the bytecode that describes the type of object a collection should contain; this information can then be used at execution time by the JVM to validate the type of each object in the collection. The strategy of having newer Java language

56

constructs result in compatible bytecode with optionally-utilized metadata provides the benefit of allowing legacy Java bytecode to run on newer JVMs, however if a decompiler doesn't know to look for the metadata, some information is lost; for example, the fact that a program used generics would not be recovered and all collections would be of type Object (with cast statements of course). Recall that in Section 4.1 the Boomerang decompiler failed to decompile the machine code for a simple C/C++ "Hello World" program, however in Section 5.1, the Jad decompiler produced correct Java source code for a slightly larger program. Given these results, one does need to be concerned with with protecting Java bytecode from decompilation if there is significant intellectual property in the program. The techniques used to protect machine code in the anti-reversing exercise solution, detailed in section 7.6, can also be applied to Java source code to produce bytecode that is obfuscated. Since Java bytecode is standardized and well-documented there are many free Java obfuscation tools available on the Internet such as SandMark [27], ProGuard [29], and RetroGuard [28] which perform transformations directly on the Java bytecode instead of on the Java source code itself. Obfuscating bytecode is inherently easier than obfuscating source code because bytecode has a significantly more strict and organized representation than source codemaking it much more easy to parse. For example, instead of parsing through Java source code looking for string constants to encrypt (protect), one can easily look in the constant pool section of the bytecode. The constant pool section of a Java Class File, unlike the .rdata section of Wintel machine code, contains a well-documented

57

table data structure that makes available the name and length of each constant; on the other hand, the .rdata section of Wintel machine code simply contains all the constants in the program in a contiguous, unstructured bytestream. The variable names, method names, and string literals, in the constant pool section of Java bytecode provide a wealth of information to a reverse engineer regarding the structure and operation of the bytecode and hence should be obfuscated to protect the software. Therefore, we now look at applying the technique Eliminating Symbolic Information in the context of Java bytecode.

8.1 Eliminating Symbolic Information in Java Bytecode


Variable, class, and method names, are all left intact when compiling Java source code to Java bytecode. This is a stark difference from machine code where variable and method names are not preserved. Sun Microsystem's Java compiler javac provides an option to leave out debugging information in Java bytecode: specifying javac -g:none will exclude information on line numbers, the source file name, and local variables. This option offers little to no help in fending off a reverse engineer since none of the variable names, methods names, or string literals are obfuscated. According to the documentation for Zelix Klassmaster [26], a Java bytecode obfuscation tool, a high-level of protection can be achieved for Java bytecode by applying three transformations: (1) Name Obfuscation, (2) String Encryption, and (3) Flow Obfuscation. Unfortunately, at the time of this writing, no free-of-charge software tool was found on the Internet that can perform all three of these transformations to Java bytecode. A couple of tools, namely ProGuard [29] and RetroGuard [28] are capable of applying transformation (1), and SandMark [27],
58

a Java bytecode watermarking and obfuscation research tool, is capable of applying transformation (2), although not easily. Experimentation with SandMark V3.4 was not promising since its "String Encoder" obfuscation function only worked on a trivial Java program; it failed when given more substantial input such as some of the classes that implement the Java version of the Password Vault application. It's clear from a survey of existing Java bytecode obfuscators that a full-function, robust, open-source bytecode obfuscator is sorely needed. Zelix Klassmaster, a commercial product capable of all the three transformations mentioned above, is said to be the best overall choice of Java bytecode obfsucator in [19]. A 30-day evaluation version of Zelix Klassmaster can be downloaded from the company's web site. Of course one can always make small-scale modifications to Java bytecode with a bytecode editor such as CafeBabe [30]. Incidentally, CafeBabe gets its catchy name from the fact that the hexadecimal value OxCAFEBABE comprises the first four bytes of every Java class file; this value is known as the "magic number" which identifies every valid Java class file. To demonstrate applying transformations to Java bytecode, we'll target the bytecode for program CheckLimitation.java whose source code is given in Table 8.1. For this demonstration, assume that a reverse engineer is interested in eliminating the limit on the number of passwords and that we are interested in protecting the software. Begin obfuscating CheckLimtiation.java by applying transformation (1) Name Obfuscation: rename all variables and methods in the bytecode so they no longer provide hints to a reverser when the bytecode is decompiled or edited. Using ProGuard, 59

Table 8.1. Unobfuscated source listing of CheckLimitation.java.


01 public class CheckLimitation { 02 03 private static int MAX_PASSWORDS = 5; 04 private ArrayList<String> passwords; 05 06 public CheckLimitation() 07 { 08 passwords = new ArrayList<String>(); 09 10 public boolean addPassword(String password) 11 { 12 if (passwords.size() >= MAX_PASSWORDS) 13 { 14 System.out.println("[Error] The maximum number of passwords has been exceeded!"); 15 return false; 16 else 17 18 passwords.add(password) ; 19 System.out.println("[Info] password ( " + password + ") added successfully."); 20: return true; 21: } 22: } 23: public static void main(String[] arguments) 24 25 26 CheckLimitation store = new CheckLimitation(); 27 boolean loop = true; 28 for (int i = 0; i < arguments.length && loop; i++) 29 if (!store.addPassword(arguments[i])) loop = false; 30 31 32

obfuscate the bytecode and then decompile it using Jad to observe the effectiveness of the obfuscation; the result of decompiling the obfuscated bytecode using Jad is given Table 8.2. As expected, all user-defined variable and method names have been changed to meaningless ones; of course the names of Java standard library methods must be left asis. ProGuard seems to use a different obfuscation scheme for local variables within a

60

Table 8.2. Jad decompilation of ProGuard obfuscated bytecode. public class CheckLimitation 02 03 private static int a = 5; 04 private ArrayList b; 05 06 public CheckLimitation () 07 { 08 b = new ArrayList(); 09 10 11 public boolean a(String s) 12 { 13 if (b.size () >= a) 14 15 System.out.println("[Error] The maximum number of passwords has been exceeded!"); 16 return false; 17 else 18 19 b.add(s); 20 System.out.println((new StringBuilder()).append("[Info] password(") .append(s) .append(") added successfully.") .toString ()); 21: return true; 22: } 23: } 24: public static void main(String args[]) 25 26 27 CheckLimitation checklimitation = new CheckLimitation(); 28 boolean flag = true; 29 for (int i = 0; i < args.length && flag; i++) 30 if(!checklimitation.a(args[i])) flag = false; 31 32 33
01

method; it's not clear why the variable "loop" in the main method has been changed to "flag" since it's still a very descriptive name. Next we further obfuscate the bytecode by applying transformation (2) String Encryption, and we do so by employing the "String Encoder" obfuscation in SandMark to protect the string literals in the program from being understood by a reverser. The "String 61

Encoder" function in SandMark implements an encryption strategy for literals in the bytecode that is similar to the one which was demonstrated at the source code level in the Wintel machine code anti-reversing background section: each string literal is stored in a weakly encrypted form and decrypted on-demand by a bundled decryption function. Table 8.3 contains the Jad decompilation result for the CheckLimitation.java bytecode that was first obfuscated using ProGuard and subsequently obfuscated using the "String Encoder" functionality in SandMark.
Table 8.3. Jad decompilation of SandMark (and ProGuard) obfuscated bytecode.
01 public class CheckLimitation { 02: 03: private static int a = 5; 04 private ArrayList b; 05: public CheckLimitation() 06: 07: { 08: b = new ArrayList(); 09: } 10: public boolean a(String argO) 11: 12: { 13: if(b.size() >= a) 14: 15: System.out.println(Obfuscator.DecodeString("\253\315\253\315\uFF9E\u2A3 Du5D69\u2AA5\u3884\u91CF\u5341\u5604\uDF5B\uA902\uB6C8\u0C8E\u67 61\ulF3 5\u35 9D\uBD96\uADA4\u94 6F\u8 5EE\uE8A0\u9274\u58 67\u2C9F\u3 07 7\u5E67\u2A 0B\u90D2\uB83 9\u58FC\uBE95\u0EBA\uDDF4\u313C\uB7 51\uFA9D\ul6 6C\u42A3\u6 DlD\uB25A\uA15E\u02 6E\u6ECE\u908C\u557B\u6ABD\uC5D5\u80 0C\uD38A\u3D97\u FB5E\uC4C2\uBBAC\u9ADC\u253E\u769E\u4D32\u4FB3\u0CC7")); 16 return false; 17 else 18 19 b.add(argO); 20 System.out.println((new StringBuilder()).append(Obfuscator.DecodeString("\253\315\253\315\uFF9E \u2A31\u5D7 5\u2ABl\u3 884\u91E0\u533C\u5 654\uDF6E\uA919\uB6DE\u0CD9\u67 6 3\ulF26\u3581\uBDDF\uADEl")).append(argO).append(Obfuscator.DecodeStrin g("\253\315\253\315\uFFEC\u2A58\u5D7A\u2AB3\u388F\u91D8\u5378\u5604\uDF 7C\uA91F\uB6CE\u0CCD\u67 69\ulF27\u35 96\uBD9 9\uADBC\u947 6\u85EF\uE8F9\u9

62

234") ) .toStringO ) ; 21 return true; 22 23 24 public static void main(String arg0[]) 25 26 27 CheckLimitation checklimitation = new CheckLimitation() 28 boolean flag = true; 29 for(int i = 0; i < argO.length && flag; i++) 30 if(!checklimitation.a(argO[i])) flag = false; 31 32

Note that each string literal is decrypted using the Obfuscator class which was generated by SandMark. Since Obfuscator is a public class, it must be generated into a separate file named Obfuscator.classmaking it very straight-forward for a reverser to isolate, decompile, and learn the encryption algorithm. The danger of giving away the code for the string decryption algorithm is that it could then be used to programmatically update the constants pool section of the bytecode to contain the plaintext versions of each string literal, essentially undoing the obfuscation. Ideally, we would like to prevent a reverser from being able to successfully decompile the obfuscated bytecode; this can be accomplished through control flow obfuscations which we explore next.

8.2 Preventing Decompilation of Java Bytecode


One of the most popular, and fragile, techniques for preventing decompilation involves the use of opaque predicates which introduce false ambiguities into the control flow of a programtricking a decompiler into traversing garbage bytes that are masquerading as the logic contained in an else clause. Opaque predicates are false branches, branches that appear to be conditional but are really not [5]. For example, the 63

conditions "if ( 1 == 1 )" and "if ( 1 == 2 )" implement opaque predicates because the first always evaluates to true, and the second always to false. The essential element in preventing decompilation with opaque predicates is to insert invalid instructions in the else branch of an always-true predicate (or the if-body of an always false predicate). Since the invalid instructions will never be reached during normal operation of the program there is no impact on the program's operation. The obfuscation only interferes with decompilation, where a naive decompiler will evaluate both "possibilities" of the opaque predicate and fail on attempting to decompile the invalid, unreachable instructions. Fig. 8.1 illustrates how opaque predicates would be used to protect bytecode from decompilation. Unfortunately, this technique, often used in protecting machine code from disassembly, cannot be used with Java bytecode because of the presence of the Java Bytecode Verifier in the JVM. Before executing bytecode, the JVM performs the following checks using single-pass static analysis to ensure that the bytecode has not been tampered with; to understand why this is beneficial, imagine bytecode being executed as it's received over a network connection. [31] documents the following checks made by the Java Bytecode Verifier: > Type correctness: arguments of an instruction, whether on the stack or in registers, should always be of the type expected by the instruction. > No stack overflow or underflow: instructions which remove items from the stack should never do so when the stack is empty (or does not contain at least the number of arguments that the instruction will pop off the stack). Likewise, 64

instructions should not attempt to put items on top of the stack when the stack is full (as calculated and declared for each method by the compiler). > Register initialization: Within a single method any use of a register must come after the initialization of that register (within the method). That is, there should be at least one store operation to that register before a load operation on that register. > Object initialization: Creation of object instances must always be followed by a call to one of the possible initialization methods for that object (these are the constructors) before it can be used. > Access control: Method calls, field accesses, and class references must always adhere to the Java visibility policies for that method, field, or reference. These policies are encoded in the modifiers (private, protected, public, etc.).
Opaque Predicate Template

doWorkO;

Figure 8.1. Usage of opaque predicates to prevent decompilation. Based on the high-level of bytecode integrity expected by the JVM, introducing 65

garbage or illegal instructions into bytecode is not feasible. However, this technique does remain viable for machine code, though there is some evidence that good disassemblers, such as IDA Pro, do check for rudimentary opaque predicates [5]. The authors of SandMark claim that the sole presence of opaque predicates in Java bytecode, without garbage bytes of course, can make decompilation more difficult. Therefore, SandMark implements several different algorithms for sprinkling opaque predicates throughout bytecode. For example, SandMark includes an experimental "irreducibility" obfuscation function which is briefly documented as "insert jumps into a method via opaque predicates so that the control flow graph is irreducible. This inhibits decompilation." Unfortunately this was not the case with the program DateTime.java shown in Table 8.4 as Jad was still able to decompile DateTime.class without any problems despite the changes made by SandMark's "irreducibility" obfuscation. The bytes of the unobfuscated and obfuscated class files were compared to verify that SandMark did make significant changes; perhaps SandMark does work for special cases, so more investigation is likely warranted. In any event, opaque predicates seem to be far more effective when inserted into machine code because of the absence of any type of verifier that validates all machine instructions in a native binary before allowing it to execute. SandMark's approach of using control flow obfuscations that leverage opaque predicates in an attempt to the confuse decompilers is not unique because Zelix Klassmaster, a commercial product, implements this approach as well. When Zelix Klassmaster V5.2.3a was given DateTime.class as input with both "aggressive" control

66

Table 8.4. Listing of DateTime.java Listing of DateTime.java (abbreviated): 01 public static void main(String arguments[]) 02 new DisplayDateTime().doDisplayDateTime(); 03 04 05 06 public void doDisplayDateTime() 07 08 Date date = new Date(); 09 System.out.println(String.format(DATE_TIME_MASK, date.toString())); 10: }

flow and "String Encryption" selected, some interesting results were observed in the corresponding Jad decompilation. Table 8.5 lists the Jad decompilation of Zelix's attempt at obfuscating DateTime. class. Zelix performed the same kind of name obfuscation seen with ProGuard, except it went a little too far and renamed the main method; this was corrected by manually adding an exception for methods named "main" in the tool. The results of the decompilation show that Zelix's control flow obfuscation and use of opaque predicates is somewhat effective for this particular example because even though Jad was able to decompile most of the logic in DateTime.class; Zelix's obfuscation caused Jad to lose the value of the constant DATE TIME MASK when using it on line 12, and generate a large block of static, invalid code starting at line 22. In the next two sections (8.3 and 8.4), a Java anti-reversing exercise with a complete animated solution is provided. In the solution, decompilation of Java bytecode is prevented through the use of a class encryption obfuscation implemented by SandMark. Issues regarding the use of this obfuscation technique are discussed in the animated solution.

67

Table 8.5. Jad decompilation of DateTime.class obfuscated by Zelix Klassmaster. Listing of Jad decompilation of DateTime.class (abbreviated): 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 public class a public static void main(String as[]) { (new a()) .a ();

public void a() { boolean flag = c; Date date = new Date(); System.out.println(String.format (a, new Object[] date.toString ()})); if(flag) b = !b;

private static final String a; public static boolean b; public static boolean c; static { "'?X@MA%O\005@@wY\001ZQw\\\016J\024#T\rK\024>N@\013Gy"; -1; goto _L1 _L5: a; break MISSING_BLOCK_LABEL_116; _L1: JVM INSTR swap ; toCharArray(); JVM INSTR dup ;

8.3 A Java Bytecode Code Anti-Reversing Exercise


Use Java bytecode anti-reversing tools such as ProGuard, SandMark, and CafeBabe on the Java version of the Password Vault application to apply the antireversing techniques Eliminating Symbolic Information and Obfuscating the Program with the goal of making it more difficult to disable the trial limitation. Instead of 68

attempting to implement a custom control flow obfuscation to inhibit static and dynamic analysis as was done in the solution to the Wintel machine code anti-reversing exercise, apply one or more of the control flow obfuscations available in SandMark and observe their impact by decompiling the obfuscated bytecode using Jad. Show that the Java bytecode reversing solution illustrated in the animated tutorial in Section 5.4 can no longer be carried out as demonstrated.

8.4 Animated Solution to the Java Bytecode Anti-Reversing Exercise


For instructional purposes, an animated solution to the exercise in Section 8.3 that demonstrates the use of anti-reversing tools introduced in Section 8 was created using Qarbon Viewlet Builder and can be viewed using Macromedia Flash Player. The tutorial begins with the Java Password Vault application, ProGuard, SandMark, Jad, CafeBabe, and Sun's Java JDK already installed on a Windows XP machine. Fig. 8.2 contains an example slide from the animated solution. The animated solution for the Java bytecode anti-reversing exercise can be downloaded from the following location: > Java Bytecode Anti-Reversing Animated Solution: http://reversingproject.info/repository.php?fileID=8_4_l Begin viewing the tutorial by extmctmgpassword_vaultjava_antireversing_exercise.zip to a local directory and either runningpassword_vaultjava_antireversing_exercise.exe which should launch the standalone version of Macromedia Flash Player, or by opening the fi\epassword_vaultjava_antireversing_exercise_viewlet_swf.html in a Web browser.

69

SandMaik 3.4.0 (Mystiqi Help f Decompile f Quick Protect f Static Birthmark | Dynamic Birthmark Dynamic Watermark Static Watermark Obfuscate Algorithm: j Class Encrypter Optimize

File

Diff j View

Input File Output File

C:PassworcfVaultTrialJavd'obfuscatertPassworifVaiiltjar iC:\PasswordVaultTrialJava\oWuscated\PasswordVaultEncrypted.jar

^ i

Browse

Encryption Key ||

7 ^ Specify an encryption key that will be used to encrypt and decrypt the ['".class] files in the Java archive. SandMark will bundle a new class in the output Java archive which overrides Java's default class loader to support loading the encrypted classes by first decrypting them.
Help

Obfuscate

Class Encrypter encrypts class files and causes them to be decrypted at runtime.

Figure 8.2. Sample slide from the Java anti-reversing animated tutorial.

9 Reengineering and Reuse of Legacy Software Applications


As stated in the introduction, the literature points to a future where the standard approach of "forward engineering" of software will be complimented with reverse engineering to rediscover the architecture and design as the actual implementation is being created. While any application that is greater than five years old can be considered "legacy", in this section we assume a more severe condition where enough time has passed such that an application has been enhanced and modified by several programmers, over several years, who have since moved on. Most computer science programs of study include object-oriented programming theory; this includes learning how to create 70

diagrams that illustrate the components of a system as well as their interactions during execution. The hope is that these diagrams will be literally translated in to program code, with a perfect correlation between the envisioned system and the implementation. For small projects, it might be fairly easy to check for consistency between the envisioned design, and what has been implemented, but this is not likely so for large projects. When reverse engineering is continuously used during software development, the information gained could be used to update the design diagrams at all levels of granularity [2]. The challenge here is for the computer programmer to interpret the information gathered from these reverse engineering tools. This will require the programmer to draw upon a skillset that ranges from low-level system concepts to high-level design. Unfortunately, the future offers little help in undoing the mistakes of the past. The problem of identifying concrete, reusable components within a software system is especially difficult because as [7] states, "engineers do not know how to design and build truly modular systems from scratch, let alone when starting from legacy code." In [7], Weide and Hollingsworth's main thesis is that while reverse engineering of legacy software is inherently intractable, some of us will inevitably find ourselves in a situation where no other option is available because the cost of rewriting a large, complex software system is prohibitive. In addition, should one choose to absorb the cost of rewriting a system from scratch, there are no known software development techniques that can guarantee a newly-designed system that will not need to be reengineered at some point down the road.

71

The question of whether to reengineer or reuse components of a software system most often arises in the context of large business or government organizations. Over time the processes and procedures of a business or organization will inevitably be reflected in the software systems that enable efficient, day-to-day operations [32]. Therefore, it is not possible to change processes and procedures without adjusting or enhancing the software systems that implement them. If good development practices were followed, a legacy software application is typically composed of three layers [32]: > Presentation Layer: components of a software system that accept input and generate output using various types of hardware devices. Input and output can be entered or analyzed by a human or another by another program. > Business Logic Layer: implementation of some subset of the processes and procedures of the business or organization that is relevant to the application. It is unusual for the business logic of one application to implement all of the processes and procedures. For example, the order processing and payroll applications are not likely to have much business logic in common. > Data Access Layer: this layer is responsible for servicing requests to store or retrieve data on behalf of the presentation and business logic layers. The nature of the code in this layer varies depending on the database technology being used. Technology choices range from simple sequential files to industrial-strength relational or hierarchical databases.

72

Fig. 9.1 illustrates the program architecture one would hope was used when when looking to update, reengineer, or reuse a legacy software application.

Frontend Presentation Layer Business Logic Layer Data Access Layer Backend

Figure 9.1. Layers of a well-structured legacy software application. Legacy applications that are not sufficiently componentized, such that their general organization resembles the three layers, are not good candidates for reengineering and reuse. More often than not, most software development projects in business are done under fairly aggressive time constraints, therefore it it not uncommon to find an interleaving of the layersbusiness logic in the presentation logic, and data access logic in the business logic. The most widely accepted technique to reuse legacy application components is that of Wrappering [32], where a new piece of code provides an interface to a legacy application component or layer without requiring code changes to it. This

73

technique is employed even when the complete source code of a legacy application is available for several reasons: (1) the number of lines of code in any one component or layer is extensive and poorly documentedmaking the cost of understanding the code well enough to make changes too high, (2) modifying the legacy application to be reusable without a wrapper would require locating all of the application's dependencies so that it can be recompiled and tested (3) application modernization, where a nontraditional interface to the application such as XML in the case of Web or RESTful services is desired. Creating a wrapper to a legacy machine code application can be quite challenging especially if all of the source code for the application has been lost. Unless enough of an application's source code remains such that it's possible to identify the names of reusable entry points (procedures) and their I/O data structures, attempting to reuse the application is haphazard at best. While it is possible to learn the names of entry points that have been explicitly exported by an application in the case of a DLL, the names don't indicate the layout of the expected I/O data structures. Probably the best way to discover the entry points and I/O data structures in legacy machine code is to read the source code of other applications which depend on it. For example, if a program a calls procedure 9 of program (3 passing an I/O data structure 5, and a produces correct results, there is good reason to believe that procedure 0 in program p can be reused by a third program p using signature 8. The COBOL programming language is most often associated with legacy 74

software applications. Typically, COBOL programs have a single entry point, which makes the process of identifying reusable methods all but unnecessary because, instead of declaring multiple entry points, it is general practice for legacy COBOL programs to include functional discriminators in their I/O data structure(s) that indicate the desired action(s) to be taken by the program on behalf of the caller. For example, a field called "TRANSACTION-TYPE" with the possible values "DEP", "WTH", and "BAL" would serve the same purpose in a COBOL program as the methods "doDeposit(8)", "doWithdrawl(S)", and "getBalance(8)" would serve in a Java program.
BankAccount doDeposit(...) doWithdrawlf...) getBalance(...)

01

BANK-ACCOUNT-INTERFACE. 02 TRANSACTION-TYPE-CODE P I C XXX. 83 DEPOSIT VALUE 'DEP ' . _-' 88 WITHDRAWL VALUE ' W T H 1 . " 88 BALANCE VALUE 'BAL ' . 02 AC COUNT-NUMBER P I C X ( 3 2 ) .

Figure 9.2. Mapping legacy functional discriminators to an object-oriented design.

Fig. 9.2 illustrates how a functional discriminator in a legacy COBOL data structure maps to modern programming strategies such as object-oriented design. In a real-world situation, we would be looking to reuse legacy components whose machine code is the result of thousands of lines of high-level language statements (COBOL) that implement a particular business process. Instead of going through the error-prone process of rewriting the legacy component, which is likely decades old, we wish to reuse and reengineer it so that it is easily consumed by modern programs and

75

interfaces. Since our focus is more on reuse and reengineering of legacy code at a basic level, it's not necessary to encumber ourselves with a very large program in order to learn strategies for reuse and reengineering. Therefore, for purposes of demonstration, an example COBOL program SMPLCALC.cbl, which implements a simple calculator for integer-only arithmetic, was written to simulate a potentially useful component found in the business logic layer of a legacy business application. The source code for SMPLCALC.cbl is given Table 9.1; the program has single entry point that operates on the I/O data structure SMPCALC-INTERFACE.
Table 9.1. Sample business logic component to reuse and reengineer.
01

02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

****************************************************************** ** Simple COBOL program that performs integer arithmetic ** ****************************************************************** IDENTIFICATION DIVISION. PROGRAM-ID. 'SMPLCALC. DATA DIVISION. WORKING-STORAGE SECTION. 77 MSG-NUMERIC-OVERFLOW PIC X(25) VALUE 'Numeric overflow occurred'. 77 MSG-SUCCESSFUL PIC X(22) VALUE 'Completed successfully'. LINKAGE SECTION. * Input/Output data structure 01 SMPLCALC-INTERFACE. 02 SI-OPERAND-1 PIC S9(9) COMP-5. 02 SI-OPERAND-2 PIC S9(9) COMP-5. 02 SI-OPERATION PIC X. 88 DO-ADD VALUE '+'. 8 8 DO-SUB VALUE '-'. 88 DO-MUL VALUE '*'. 02 SI-RESULT PIC S9(18) COMP-3. 02 SI-RESULT-MESSAGE PIC X(128). PROCEDURE DIVISION USING BY REFERENCE SMPLCALC-INTERFACE. MAINLINE SECTION. * Perform requested arithmetic INITIALIZE SI-RESULT SI-RESULT-MESSAGE EVALUATE TRUE WHEN DO-ADD

76

30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58

COMPUTE SI-RESULT = SI-OPERAND-1 + SI-OPERAND-2 ON SIZE ERROR PERFORM HANDLE-SIZE-ERROR END-COMPUTE WHEN DO-SUB COMPUTE SI-RESULT = SI-OPERAND-1 - SI-OPERAND-2 ON SIZE ERROR PERFORM HANDLE-SIZE-ERROR END-COMPUTE WHEN DO-MUL COMPUTE SI-RESULT = SI-OPERAND-1 * SI-OPERAND-2 ON SIZE ERROR PERFORM HANDLE-SIZE-ERROR END-COMPUTE END-EVALUATE * Successful return MOVE MSG-SUCCESSFUL TO SI-RESULT-MESSAGE MOVE 2 TO RETURN-CODE GOBACK

** Handle numeric overflow and end the program HANDLE-SIZE-ERROR. MOVE MSG-NUMERIC-OVERFLOW TO SI-RESULT-MESSAGE MOVE 16 TO RETURN-CODE GOBACK END PROGRAM 'SMPLCALC'.

**

Looking at the source code for the COBOL program SMPLCALC.cbl, we can easily determine the entry point name and the data layout of the I/O data structure. However, even knowing the full details of the application's interface does not solve the problem of making it easily reusable from Java or C because of the differences in the language data type systems. For example, Packed Decimal (Computational-3) is a numeric type that is commonly found in COBOL mainframe programs, but is not directly supported in the Java and C/C++ languages. Even floating-point numbers can be problematic because some COBOL compilers, including IBM's, do not use the standard

77

IEEE floating point representation; they instead use decimal floating point [44]. Without detailing all the differences between the COBOL, Java, and C/C++ data models, it suffices to say that writing custom code to convert between COBOL's data model and the language we wish to invoke it from is error-prone and tedious and there are better alternatives. The problem of mating disparate data models so that new programs, written in modern languages, can interact with legacy software systems, is far from new. There are many commercial tools on the market that can import a COBOL data structure and generate Java helper classes that a programmer can use to build to meet the legacy binary interface using familiar getters and setters. A great many of these tools, including IBM's Rational Application Developer (RAD) [33], leverage Sun Microsystems J2EE Connector Architecture (JCA) [34] to provide a tightly coupled integration between a Java application running in a J2EE container (server) and an enterprise application (likely written in COBOL or PL/I) running on a mainframe. The JCA architecture requires a good deal of middleware to exist between a calling Java application running in the J2EE container and the COBOL application running on a legacy software system. While this middleware is powerful because of its capability to marshall Java data into COBOL and PL/I data, it cannot easily be reused for a local scenario where no server runtimes are involved. Fig. 9.3 illustrates how the JCA architecture is used by commercial products to enable legacy business applications to be accessed from Java applications running on distributed J2EE application servers.

78

J2EE Server
J Sim DleCpIri ilatnr

inframe

Transaction Server

l >

Sm pi Calclnte trace Helper 1k

c
A

J A
^ , SMPLCALC

IT T 1

Java to COBOL Marshaller

^ >

JSimpleCalculator: Java application that provides a new front-end to the SMPLCALC COBOL application. SmplCalcInterfaceHelper: helper class for building the interface COBOL data structure, can be generated by a commercial product such as RAD. Java to COBOL Marshaller: class library that implements marshalling of Java data types to/from COBOL data types, likely comes with a commercial J2EE server such as WebSphere Application Server (WAS).

>

Figure 9.3. Example JCA implementation for accessing a legacy application. A popular alternative to using the JCA architecture to reengineer and reuse legacy applications is to implement a Service Oriented Architecture (SOA) [38]. When migrating a legacy software system to an SOA, application programs that are candidates for reuse are identified. Typically, candidate applications should be well structured such that the business logic can be isolated, encapsulated, and made into reusable components. These SOA components become capable of communicating without the tight and fragile coupling of traditional binary interfaces because they are wrappered with a platformneutral interface such as XML and Web services. Once a business or organization has created a collection of reusable components from stable and well tested code, it becomes possible to quickly assemble new applications without having to rewrite and test the 79

underlying business logic. When XML is used as envisioned, all data, both of type character and numeric are represented as printable text, completely divorced from any platform-specific representation or encoding. The net effect is that two entities or programs can interact without having to know the data structures that comprise each other's binary interface. Of course, the XML that is exchanged cannot be arbitrary, so industry standards such as XML Schema (XSD) [39], [40] and Web Services Definition Language (WSDL) [41], [42] were developed. XML Schema is used to formally describe XML documents, while WSDL is used to describe services and the operations they support. Operations in Web services are akin to public methods in the object-oriented programming paradigm. A Web service is considered to be WS-I compliant [43], or generally interoperable, if it meets many criteria, not the least of which is using XML documents for the input and output of each operation. There are many criteria defined by WS-I that apply to a Web service definition, but this particular facet, where XML is the interoperable interface of choice, sets the stage for a meaningful exercise where the focus is on the activity of making a component from a COBOL program that is reusable from Java using XML in a light-weight, local environment. In recent history, the ability to parse and generate XML documents has been added to the COBOL language in many implementations including the Micro Focus and IBM COBOL compilers and runtimes [37], [44]. XML parsing in COBOL is accomplished through the use of the XML PARSE statement, which performs an event80

driven parse of an XML document. In a event-driven parse, the initiator registers a handler which the XML parser invokes with each XML construct found in the document. For example, the start and end of an XML element would be reported as two separate events. XML generation in COBOL is accomplished through the use of the XML GENERATE statement, which, given a COBOL data structure and an output buffer, will generate XML that has the same hierarchical organization as the data structure [37, 44]. By default, the XML GENERATE statement will form XML element and attribute names using the name of each member in the COBOL data structure. This can be less than ideal in circumstances where data structure members have cryptic names that don't conform to the spirit of XML where each XML element and attribute is given a name that describes its content. Fortunately, Micro Focus COBOL provides the capability to assign custom XML element and attribute names to each data structure member, which allows for defining an XML Schema that has meaningful element and attribute names [37]. In the exercise which accompanies this section, we are asked to create a languageneutral XML interface to the "legacy" SMPCALC.cbl application program and invoke it from a Java program which incidentally makes it reusable to other Java programs. To describe an XML interface to the legacy COBOL program so that other programs may consume it, an XML Schema must be created; this can be done with a tool that can generate XML Schema from a COBOL data declaration, or by hand using an XML editor. Once an XML interface has been described using XML Schema, it is necessary to implement XML marshalling layers between the calling Java program and the legacy

81

COBOL program. In the example exercise, the XML marshalling layer for each program is implemented in the target language itself. So that the Java program can generate and consume XML based on the XML Schema that describes the interface to the COBOL program, we employ the Java Architecture for XML Binding (JAXB) [35]. JAXB facilitates the conversion of Java objects to XML and vice versa. Sun's Java JDK includes a command-line utility xjc which generates Java marshalling code from an XML Schemamaking it quite easy to write a Java program which consumes and generates XML based on an XML Schema. While generation of XML is nicely handled in COBOL by the XML GENERATE statement, consuming XML involves coding an event handler for the XML PARSE statement. Of course, complete code for both the Java and COBOL XML marshalling layers is included in the solution to the exercise, so if COBOL is a foreign language to you, there's no need for concern. Once the XML marshalling layers are in place, there's one more loose end that needs to be tied up; and that is to figure out how to pass XML documents between the two layers. Since we are in a local scenario, TCP/IP is not an option, therefore a thin Java Native Interface (JNI) layer is needed through which the Java and COBOL marshalling layers can exchange XML; note that the COBOL XML marshalling layer invokes the legacy COBOL application. Fig. 9.4 illustrates the program architecture for the exercise.

82

Java

ISirnpleCalculator JSimpleCalculator longdoAdd(int, int) longdoSub(int, int) JAXB longdoMul(int, int) XMLi

1
XML ^PARSE JNI ^XML XML GEN. &
88 88 88

SMPLCALC XML
DO-ADD DO-SUB DO-MUL

| COBOL Figure 9.4. Architecture for legacy application reengineering and reuse from Java.

In order to try out the code in this section and complete the exercise that accompanies it, a COBOL compiler and runtime environment are needed. The COBOL programs in this section, and in the solution to the exercise which accompanies it, were written, compiled, and run using a student version of Micro Focus Net Express [37]. At the time of this writing, no reasonably functional open source COBOL compiler was available that could compile, link, and run even the most simple COBOL program given in this section; this may have to do with the fact that COBOL remains a very lucrative enterprise for many businesses, so there is little interest in giving away implementations to the open source community. For example, the COBOL for GCC project has not made significant progress yet on the code generation part of the compiler [36]. When and if an open source COBOL compiler gets off the ground, it will be interesting to see what features of the commercial COBOL compilers are implemented.

83

9.1 Legacy Software Reengineering and Reuse Exercise


Provide a command-line (or graphical) interactive Java front-end to the legacy COBOL application SMPLCALC.cbl by implementing the program architecture illustrated in Fig. 9.4. Before starting the exercise, download and extract the following archive file located here http://reversmgproject.info/repositoiy php?fileID=9_l_l. Follow these steps to complete the exercise: 1) Locate the interface data structure for SMPLCALC.cbl in the copybook (source include file) SMPLCALC.cpy. There is only one data structure in the copybook. 2) Create an XML Schema which represents all of the data in the SMPLCALCINTERFACE COBOL data structure. Instead of writing this by hand, you can use the Micro Focus Net Express CBL2XML wizard [37]. 3) Write a Java interface ISimpleCalculator.java for three computation types supported by SMPLCALC.cbl using appropriate method signatures: a) long doAdd(int, int) throws java.lang.ArithmeticException. b) long doSubtract(int, int) throws Java.lang.ArithmeticException c) long doMultipy(int, int) throws Java.lang.ArithmeticException 4) Write a Java class JSimpleCalculator.java that implements the interface defined in ISimpleCalculator.java and provides a user interface for: a) Specifying which computation (add, sub, mul) is desired. b) Specifying the operands to the computation. c) Displaying the result of the computation (can be an error). 5) Use the Java command-line utility xjc, in combination with the XML Schema created in Step 2, to generate Java to XML marshalling code (JAXB). Update JSimpleCalculator.java to call this marshalling code.

84

6) Write a small C/C++ JNI program JavalCblXmlBridge.cpp which exports a method "Java2SmplCalc" which: a) Invokes XML2CALC.cbl (see Step 7), passing the XML document received from JSimpleCalculator.java. b) Returns the XML document generated by XML2CALC.cbl (see Step 7) on return from SMPLCALC.cbl to JSimpleCalculator.java. 7) Write a COBOL program XML2CALC.cbl which: a) Marshalls XML received from the Java2CblXmlBridge.cpp, based on the XML Schema created in Step 2, into SMPLCALC-INTERFACE. b) Invokes SMPLCALC.cbl, passing SMPLCALC-INTERFACE by reference. c) Marshalls SMPLCALC-INTERFACE back to XML before returning to Java2CblXmlBridge. cpp. 8) Compile XML2CALC.cbl and link it with the machine/object code for SMPLCALC.cbl (SMPLCALC.obj). a) To simulate a situation where only partial source code for an application is available, do not recompile SMPLCALC.cbl; use the object file (machine code) that comes with this exercise instead. 9) Create a DLL that can be loaded an used by JSimpleCalculator.java by compiling and linking Java2CblXmlBridge.cpp with the object code for XML2CALC.cbl. 10) Update JSimpleCalculator.java to use the XAC-generated marshalling code to send/receive XML through the JNI method defined in Step 8 and display the results of the computations performed downstream by SMPLCALC.cbl.

85

9.2 Legacy Software Reengineering and Reuse Exercise Solution


This section gives a solution to the exercise given in Section 9.1. The details of the solution are organized according to the steps of the exercise. Software requirements to build and test the solution include: Sun's Java JDK SE V6, Microsoft Visual C++ Studio Express 2008, and Micro Focus Net Express v5.1 (COBOL). Most of the source listings in this section are abbreviated, and some of the steps are skipped. The complete source and binaries for he solution can be downloaded from http://reversingproject.info/repository.php?fileID=9_2_l.

! 1) Locate the interface data structure for SMPLCALC.cbl in the copybook (source include file) SMPLCALC.cpy. There is only one data structure in the copybook. The interface data structure for SMPLCALC.cbl is located in SMPLCALC.cpy and is named SMPLCALC-INTERFACE (see Table 9.2). COBOL data structures begin with a level 01 declaration and are usually hierarchical but can be elementary.
Table 9.2. Interface data structure SMPLCALC-INTERFACE in SMPLCALC.cpy. 01 02 03 04 05 06 07 08 09 10 * Input/Output data structure 01 SMPLCALC-INTERFACE. 02 SI-OPERAND-1 PIC S9(9) COMP-5. 02 SI-OPERAND-2 PIC S9(9) COMP-5. 02 SI-OPERATION PIC X. 88 DO-ADD VALUE '+'. 88 DO-SUB VALUE '-'. 8 8 DO-MUL VALUE '*'. 02 SI-RESULT PIC S9(18) COMP-5. 02 SI-RESULT-MESSAGE PIC X(128).

86

2) Create an XML Schema which represents all of the data in the SMPLCALCINTERFACE COBOL data structure. Instead of writing this by hand, you can use the Micro Focus Net Express CBL2XML wizard [3 7]. The CBL2XML wizard in Micro Focus Net Express conveniently generates an XML Schema from a COBOL data structure. The result of using SMPLCALC.cpy as input to the CBL2XML wizard is given in Table 9.3.
Table 9.3. XML Schema generated from the COBOL data structure. <?xml version="l.0" encoding="UTF-8"?> <schema xmlns="http://www.w3.org/2 0 01/XMLSchema" elementFormDefault="qualified"> <element name="SMPLCALC-INTERFACE"> <complexType> <sequence> <element name="SI-OPERAND-l"> <simpleType> <restriction base="integer"> <totalDigits value="9" /> </restriction> </simpleType> </element> <element name="SI-OPERAND-2"> <simpleType> <restriction base="integer"> <totalDigits value="9" /> </restriction> </simpleType> </element> <element name="SI-OPERATION"> <simpleType> <restriction base="string"> <enumeration value="+" /> <enumeration value="-" /> <enumeration value="*" /> </restriction> </simpleType> </element> <element name="SI-RESULT"> <simpleType> <restriction base="integer"> <totalDigits value="18" /> </restriction> </simpleType> </element>

87

27 28 29 30 31 32 33 34 35 36 37

<element name="SI-RESULT-MESSAGE"> <simpleType> <restriction base="string"> <maxLength value="128" /> </restriction> </simpleType> </element> </sequence> </complexType> </element> </schema>

4) Write a Java class JSimpleCalculator.java that implements the interface defined in ISimpleCalculator.java and provides a user interface for: a) Specifying which computation (add, sub, mul) is desired. b) Specifying the operands to the computation. c) Displaying the result of the computation (can be an error). There is a great deal of flexibility in this part of the exercise. Some examples of the types of user interfaces that can be implemented include: command-line interactive (console-based), graphical, Java servlet (Web-based). A command-line interactive interface was implemented for the solution. A screen capture of the interface is given Fig. 9.5. Notice that a debugging mode is available to trace the various steps in the process of exchanging XML between the Java and COBOL XML marshalling layers.

88

**************************************************** ** Program: Java Front-end to COBOL Calculator ** ** Purpose: Demonstrate reengineering and reuse ** ** of a COBOL program from Java by ** ** establishing an XML bridge leveraging ** ** JAXB, JNI, and COBOL XML support. -* ** Author: Teodoro Cipresso ** ** tcipress@hotmail.com ** **************************************************** Select a task from the following menu: (1) (2) (3) (4) (5) Addition Subtraction Multiplication Toggle Debug ON Quit Program

Specify selection: 3 Specify integer operand #1: 12 Specify integer operand #2: 12 [***] COBOL multiplication result: 144

Figure 9.5. Console-based Java interface to the legacy COBOL program.

5) Use the Java command-line utility xjc, in combination with the XML Schema created in Step 2, to generate Java to XML marshalling code (JAXB). Update JSimpleCalculator.java to call this marshalling code. The xjc command-line utility generates two types of artifacts for each global (top level) element in an XML Schema: (1) Java classes that expose getters and setters for the data contained in instances of the XML Schema (XML documents), (2) Java classes that serve as metadata for the JAXB XML marshalling engine. In the solution archive file, the two classes generated by JAXB are: SmplCalcJaxbFactory.java (getters and setters) and SmplCalcJaxbMarshaller.java (JAXB XML marshalling metadata). Note these are not the default class names generated by xjc.

89

To cleanly integrate the JAXB marshalling with JSimpleCalculator.java, SmplCalcJaxbMarshaller.java, which encapsulates the interaction with the JAXB, was created. Table 9.4 gives an abbreviated listing of this class.
Table 9.4. Partial listing of SmplCalcJaxbMarshaller.java interaction with JAXB. private static JAXBContext jaxbContext = null; 02 private static Marshaller marshaller = null; 03 private static Unmarshaller unmarshaller = null; 04 05 static 06 { 07 try 08 { 09 jaxbContext = JAXBContext.newlnstance(SMPLCALCINTERFACE.class) ; 10 marshaller = jaxbContext.createMarshaller(); 11 unmarshaller = jaxbContext.createUnmarshaller(); 12 } catch (JAXBException _je) {...} 13 14 15 public static String serializeXML(SMPLCALCINTERFACE request) 16 { ByteArrayOutputStream xmlBytes = new ByteArrayOutputStream(); 17 try 18 { 19 20 marshaller.marshal(request, xmlBytes); 21 } catch (JAXBException _je) {...} 22 String xmlDoc = new String(xmlBytes.toByteArray()); 23 return xmlDoc; 24 25 26 public static SMPLCALCINTERFACE loadXML(String xmlDoc) 27 { 28 SMPLCALCINTERFACE response = null; 29 ByteArraylnputStream xmlBytes = new ByteArraylnputStream(xmlDoc.getBytes ()); 30 try 31 { 32 response = (SMPLCALCINTERFACE)unmarshaller.unmarshal(xmlBytes); 33 } catch (JAXBException _je) {...} 34 return response; 35
01

Next we need to update the add, subtract, and multiply methods in JSimpleCalculator.java to use SmplCalcJaxbMarshaller.java to generate and consume 90

XML in preparation to use the JNI XML bridge to the legacy COBOL application. Table 9.5 contains an abbreviated listing of the updated to JsimpleCalculator.java. Note that the call to method smplCalcXmllnterface is commented out. This is a call to the JNI XML bridge which will be implemented in a later step.
Table 9.5. Updates to JSimpleCalculator.java in support of JAXB marshalling.
01 public long doAddfint _lstOp, int _2ndOp) 02 { SMPLCALCINTERFACE addResult = invokeXmllnterface("+", IstOp, 03 2ndOp); return addResult.getSIRESULT().longValue(); 04 05 06 07 public SMPLCALCINTERFACE invokeXmllnterface(String calcType, int _lstOp, int _2ndOp) 08: { 09: SMPLCALCINTERFACE inputData = new SmplCalcJaxbFactory(). createSMPLCALCINTERFACE(); 10: inputData.setSIOPERATION(calcType); 11: inputData.setSIOPERANDl(Biglnteger.valueOf(_lstOp) ) ; 12: inputData.setSIOPERAND2(Biglnteger.valueOf(_2ndOp)); 13: inputData.setSIRESULTMESSAGE(""); 14: inputData.setSIRESULT(Biglnteger.valueOf(0)); 15: String inputXml = SmplCalcJaxbMarshaller.serializeXML(inputData);

16:
17:

// TODO JNI: String

outputXml

smplCalcXmllnterface(inputXml);

SMPLCALCINTERFACE outputData = SmplCalcJaxbMarshaller.

loadXML (outputXml) ;
18: return outputData; 19: }

6) Write a small C/C++ JNI program Java2CblXmlBridge.cpp which exports a method "Java2SmplCalc" which: a) Invokes XML2CALC.cbl (see Step 7), passing the XML document receivedfrom JSimpleCalculator.java. b) Returns the XML document generated by XMLlCALC.cbl (see Step 7) on return from SMPLCALC.cbl to JSimpleCalculator.java Sun's Java SDK includes the command-line utility javah that generates

appropriate C/C++ header files for a native method declaration in a Java class. The 91

generated header file will contain a function prototype that reflects the fully qualified name and signature of the method. Using the function prototype, it is the responsibility of the programmer to write a C/C++ method that conforms to it and interacts properly with the JVM. Please note that garbage collection does not apply to any memory allocated by the native code, so be sure to free it. To generate the JNI header file, we must first declare a native method in JsimpleCalculator.java that we wish to implement in C/C++. In addition, we must also indicate the name of the DLL Java will need to load in order to call it. Table 9.6 contains the needed additions to JsimpleCalculator.java to declare the native method. Note that on the System.loadLibrary call, the file extension of the DLL file is not specified.
Table 9.6. Example native method declaration for the JNI XML bridge. 01: public class JSimpleCalculator implements ISimpleCalculator 02: { 03: native String smplCalcXmllnterface(String xmldoc); 04: static 05: { 06: System.loadLibrary("Java2CblXmlBridge") ; 07: } 08: ... 09: }

When using the javah command-line utility, keep in mind that it operates on *.class files instead of *.java files; this is because the Java reflection APIs are used to get the qualified name and signature of the native method declaration instead of having to parse the source file. To generate a C/C++ header file from the JSimpleCalculator.class file, issue the command "javah -jni info.reversingproject.jsimplecalculator.JSimpleCalculator. " Table 9.7 gives the source 92

for the JNI program Java2CblXmlBridge.cpp, which implements the JNI method described in the generated header file.
Table 9.7. Example implementation of the Java to COBOL JNI XML bridge. 01: #include "package_JSimpleCalculator.h" 02: #include "cobcall.h" /* * Class : info_reversingproject_jsimplecalculator_JSimpleCalculator * Method: smplCalcXmllnterface * Signature: (Ljava/lang/String;)Ljava/lang/String; */ 03: jstring JNICALL Java_info_reversingproject_j simplecalculator__JSimpleCalculator_smplCalc Xmllnterface (JNIEnv *env, jobject parent_obect, jstring xml_doc) 04 { 05 // Get input XML document passed from Java 06 jboolean iscopy; 07 jstring output_xml; 08 char *xml_buffer = NULL; 09 char *xml_buffer_ptr = NULL; 10 const char *xml_input = (*env)->GetStringUTFChars(env, xml_doc, Siscopy); 11 int xml_len = strlen(xml_input); 12 // Allocate XML I/O buffer and copy input XML 13 xml_buffer = (char*)malloc(32767); 14 memset(xml_buffer, 0x00, 32767); // initialize 15 memcpy(xml_buffer, xml_input, xml_len); 16 // Free JNI memory used for MBCS to SBCS conversion 17 (*env)->ReleaseStringUTFChars(env, xml_doc, &iscopy); 18 // call COBOL to XML marshalling layer, passing XML I/O buffer 19 cobinitO; // Initialize Micro Focus COBOL runtime 20 XML2CALC(&xml_len, xml_buffer); // Call COBOL 21 // Null terminate XML returned from COBOL 22 xml_buffer_ptr = xml_buffer; 23 xml_buffer_ptr += xml_len; 24 *(xml_buffer_ptr) = 0x00; 25 // Allocate UTF version of XML to return to Java 26 output_xml = (*env)->NewStringUTF(env, xml_buffer); 27 // Free XML I/O buffer 28 free(xml_buffer); 29 // Return XML generated by COBOL as Java String 30 return output_xml; 31

93

7) Write a COBOL program XML2CALC.cbl which: a) Marshalls XML received from the Java2CblXm.lBridge.cpp, based on the XML Schema created in Step 2, into SMPLCALC-INTERFACE. b) Invokes SMPLCALC.cbl, passing SMPLCALC-INTERFACE by reference. c) Marshalls SMPLCALC-INTERFACE back into XML document before returning to Java2CblXmlBridge.cpp. Using the recently added XML support in COBOL [37, 44], parsing and generation of XML is fairly straight-forward. Two statements in the COBOL language, XML PARSE and XML GENERATE, are used to implement the program XML2CALC.cbl. Note that the XML GENERATE statement only allows assignment of non-default XML element names to data structure members when reading or writing from an XML file. Since we are working with XML in a stream, the XML Schema defined in the solution to Step 2 uses the default XML element names generated by the Micro Focus Net Express CBL2XML wizard. Table 9.8 gives the source code for XML2CALC.cbl, the XML layer to the legacy COBOL application.

94

Table 9.8. Implementation of a COBOL XML layer to the legacy application. $set preprocess(prexml) o(foo.pp) warn endp
K i t * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

** Wrapper program that provides an XML interface to SMPLCALC IDENTIFICATION DIVISION. PROGRAM-ID. 'XML2 CALC' . DATA DIVISION. WORKING-STORAGE SECTION.
* Input/Output data structure

**

i t * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

01 SMPLCALC-INTERFACE. 02 SI-OPERAND-1 PIC S9(9) COMP-5. 02 SI-OPERAND-2 PIC S9(9) COMP-5. 02 SI-OPERATION PIC X. 8 8 DO-ADD VALUE '+'. 8 8 DO-SUB VALUE 8 8 DO-MUL VALUE '*'. 02 SI-RESULT PIC S9(18) COMP-5. 02 SI-RESULT-MESSAGE PIC X(128).
* XML parsing state

01 XML-PARSE-STATE. 02 CURR-ELE-NAME PIC X(256). 02 CURR-ELE-CONT PIC X(256). LINKAGE SECTION. 01 XML-DOC-LEN PIC S9(9) COMP-5. 01 XML-DOC-TXT PIC X(32767). PROCEDURE DIVISION USING XML-DOC-LEN XML-DOC-TXT. MAINLINE SECTION.
* Parse XML into SMPLCALC-INTERFACE

XML PARSE XML-DOC-TXT(1:XML-DOC-LEN) PROCESSING PROCEDURE XML-HANDLER END-XML


* Invoke legacy COBOL application SMPLCALC

CALL 'SMPLCALC USING SMPLCALC-INTERFACE


* Generate XML from SMPLCALC-INTERFACE

XML GENERATE XML-DOC-TXT FROM SMPLCALC-INTERFACE COUNT IN XML-DOC-LEN END-XML


* Return to client program

GOBACK *+ * | XML event handler for marshalling XML into COBOL data * + XML-HANDLER. EVALUATE XML-EVENT WHEN 'START-OF-ELEMENT' MOVE XML-TEXT TO CURR-ELE-NAME WHEN 'CONTENT-CHARACTERS' EVALUATE CURR-ELE-NAME
+

I +

95

50 51 52 53 54 55 56 57 58 59 60 61

WHEN 'SI-OPERAND-1' MOVE FUNCTION NUMVAL(XML-TEXT) TO SI-OPERAND-1 WHEN 'SI-OPERAND-2' MOVE FUNCTION NUMVAL(XML-TEXT) TO SI-OPERAND-2 WHEN 'SI-OPERATION' MOVE XML-TEXT TO SI-OPERATION END-EVALUATE WHEN 'END-OF-ELEMENT' INITIALIZE CURR-ELE-NAME END-EVALUATE END PROGRAM 'XML2CALC.

\ 10) Update JSimpleCalculator.java to use the X/C-generated marshalling code to send/receive XML through the JNI method defined in Step 8 and display the results | of the computations performed downstream by SMPLCALC.cbl.

To begin using the JNI XML bridge, create or uncomment a line in your code that corresponds to the bolded line in Table 9.5. Essentially, code a call to method Java2CblXmlBridge.smplCalcXmlInterface(inputXmlDoc), passing the JAXB generated XML document, to invoke the legacy COBOL application SMPLCALC.cbl through JNI and the XML layers. Table 9.9 lists the results of running the complete solution code for the exercise with debug tracing turned on.
Table 9.9. Example run of the solution code with debug statements turned on. 01 02 03 04 05 06 07 08 09 10 11 12 **************************************************** ** Program: Java Front-end to COBOL Calculator ** ** Purpose: Demonstrate reengineering and reuse ** ** of a COBOL program from Java by ** ** establishing an XML bridge leveraging ** ** JAXB, JNI, and COBOL XML support. ** ** Author: Teodoro Cipresso ** ** tcipress@hotmail.com ** **************************************************** Select a task from the following menu:

96

(1) Addition 13 (2) Subtraction 14 (3) Multiplication 15 (4) Toggle Debug OFF 16 17 (5) Quit Program 18 19 Specify selection: 3 20 21 Specify integer operand #1: 16 22 23 Specify integer operand #2: 32 24 [D] JSimpleCalculator.doMultiply(16, 32) 25 26 [D] JSimpleCalculator.invokeXmllnterface(*, 16, 32) 27 28 [D] SmplCal c JaxbMar sha H e r .serial izeXML () 29 30 31: [D] SmplCalcJaxbMarshaller.serializeXML().xmlDoc[<?xml version="l.0" encoding="UTF-8" standalone="yes" 7 X S M P L C A L C INTERFACEXSI-OPERAND-l>16</SI-OPERAND-lXSI-OPERAND-2>32</SI-OPERAND2><SI-OPERATION>*</SI-OPERATIONXSI-RESULT>0</SI-RESULTXSI-RESULTMESSAGEX/SI-RESULT-MESSAGEX/SMPLCALC-INTERFACED 32: 33: [D] JSimpleCalculator.invokeXmllnterface(): Before call to Java2CblXmlBridge 34: 35: [D] JSimpleCalculator.invokeXmllnterface(): After call to Java2CblXmlBridge 36: 37: [D] SmplCalcJaxbMarshaller.loadXMLO .xmlDoc[<SMPLCALCINTERFACEXSI-OPERAND-l>16</SI-OPERAND-lXSI-OPERAND-2>32</SI-OPERAND2><SI-OPERATION>*</SI-OPERATIONXSI-RESULT>512</SI-RESULTXSI-RESULTMESSAGE>Completed success fully</SI-RESULT-MESSAGEX/SMPLCALCINTERFACE>] 38: 39: [***] COBOL multiplication result: 512

97

10 Identifying, Monitoring, and Reporting Malware


Malware describes a category of software that for one reason or another does not fit the description of a program that always operates in a way that benefits the user [5J. Of course, those of us who have ever used software might contend that this definition of malware will cause programs that we use every day to be categorized as malware. For example, the word processor used to write this paragraph has crashed more than once during the writing of this paper, and, in that regard, it's not acting in a way that benefits the user. To tighten the definition of malware, let's qualify it a bit: the malicious or annoying behaviors of malware are intentional, not the result of one or more bugs. There are currently five types of malware that affect computer systems [5] [21]: > Viruses: a virus is malware that requires some deliberate action to help it spread. For example, a user downloading and installing an infected program that in turn infects emails sent by the user. > Worms: a worm is similar to a virus but can spread by itself over computer networks. Worms have superseded viruses as the popular choice of hackers. > Trojan horses: a Trojan horse is software that has hidden and unadvertised functionality that occurs during normal use. > Backdoor: a backdoor is a vulnerability purposely embedded in software that allows an attacker to connect to the users machine with malicious intent. > Rabbit: a rabbit is a program that exhausts system resources. Types of resources that can be exhausted include memory, disk space, CPU time. 98

To experiment with most of the types of malware listed here is dangerous. Therefore, if one decides to try one's hand at analyzing real-life malware, using the machine code and bytecode reversing techniques demonstrated in this paper, one should do so in a carefully prepared environment. One should not install any malware on a computer that must remain in operating condition. Worms and backdoors can be especially dangerous because they can propagate to other systems on computer networks. Be aware that using virtualization tools such as VMware to create secondary operating system images on which to install malware can still result in the infection of the primary operating system, especially if the VMware-hosted image has connectivity enabled. The goal of this section is to help you become familiar with using software tools to identify, monitor, report, and securely delete software that you suspect to be malicious. Since it's not practical to ask that you install a virus, worm, backdoor, or rabbit on your machine, we are left with the possibility of a guaranteed benign software Trojan. It's important to note here that malware usually isn't of just one type; for example, 3 of the top 10 malicious codes families reported in 2008 were Trojans with a backdoor component [45]. It turns that focusing on software Trojans is appropriate because as Symantec's 2009 Global Internet Security Threat Report [45] states, "Trojans made up 68 percent of the volume of the top 50 malicious code samples reported in 2008", and "Five of the top 10 staged downloaders in 2008 were Trojans." For the vast majority of us, the story of the Trojan horse from antiquity is quite familiar. Essentially, the Greeks, in a 10-year siege against the city of Troy, devised a 99

brilliant plan of putting 40 of their best soldiers into the body of a large wooden horse while the rest of the army sailed away out of sight. The Trojans, assuming that the Greeks had given up, pulled the horse into their city as a trophy of their victory. As night fell over the city of Troy, the Greek army sailed back to shore. Meanwhile, the soldiers in the Trojan horse silenced some guards and opened the gatesallowing the Greek army to flood in and take the city by surprise. So what does all this have to do with software? Not too surprising, a Trojan software program is one that is not entirely what it seems. For example, imagine a program is offered for free on the Internet that claims to be able to convert audio files between different formats. The program fits the needs of many, and is definitely the right price, so it has a large install base. What users of the program are not told is that while the program is performing its advertised functions, it will perform other annoying or malicious tasks in the background such as: scanning the system for sensitive information and uploading it to a rogue site, affecting the stability and performance of the system by doing repeated expensive operations. In 1996, Mark Russinovich founded a company called "Winternals Software" where he was the chief software architect on a comprehensive suite of tools for diagnosing, debugging, and repairing Windows systems and applications [46]. Mark's company has since been purchased by Microsoft and his suite of tools have been rebranded "Windows Sysinteraals" and are offered for free on Microsoft Technet. An example of one of the more powerful tools in the Sysinternals suite is the Process 100

Monitor. The Process Monitor can capture detailed information about any running process in a Windows system including: filesystem, registry, and network activity. Just the Process Monitor alone is helpful in analyzing the behavior of an application when making the determination of whether or not it is malicious. As an aside, Mark's story is an interesting one because he is recognized as a true expert on the internals of Windows even though he did not participate in its developmenta true testament to what can be learned about software through reverse engineering. At the time of this writing, the Sysinteraals suite contained 66 different utilities, but we'll focus on the most useful one in this context of analyzing the behavior of malware: Process Monitor. In the exercise that accompanies this section, it is recommended that you use Process Monitor to complete it. If you have the opportunity to experiment with other tools in the Sysinternals suite, you are encouraged to do so. The following description of Process Monitor is given on the Windows Sysinternals web site [46]: "Process Monitor is an advanced monitoring tool for Windows that shows real-time file system, Registry and process/thread activity. It combines the features of two legacy Sysinternals utilities, Filemon and Regmon, and adds an extensive list of enhancements including rich and non-destructive filtering, comprehensive event properties such session IDs and user names, reliable process information, full thread stacks with integrated symbol support for each operation, simultaneous logging to a file, and much more. Its uniquely powerful features will make Process Monitor a core utility in your system troubleshooting and malware hunting toolkit. " Fig. 10.1 contains a capture of a Process Monitor session where the filesystem activity of the Password Vault application is recorded. When using Process Monitor, you can selectively monitor registry, filesystem, network, and thread activity.

101

File

Edit

Event

Filter

Tools

Options

Help

? H

i ^ m E> i v
PIP 5072 5072 5072 5072 5072 5072 5072 5072 5072 5072 5072 5072 5072 5072 5072

i1

Process Name 1 PasswordVault. exe IPasswordVault.exe 6 3 PasswordVault. exe IPasswordVault.exe jPasswordVault.exe 6 3 PasswordVault. exe JPasswordVault.exe JPasswordVault.exe JPasswordVault.exe JPasswordVault.exe 1 Pass wordVault. exe ] Pass wordVault. exe JPasswordVault.exe JPasswordVault.exe IPasswordVault.exe

|; Operation 0JRP_MJ .CREATE C:\PasswordVaultTrialCpp\user01 ;dat' y*FAST 10. .QUE RY_... C APasswordVauItT rialCpp\user01d a t | EMRP_MJ. .READ CAPasswordVaultTrialCpp\user01 dat p. 0URP_MJ_ CLEANUP CAPasswordVaultTrialCpp\user01 dat|: ''&.. yjjRP_MJ .Q UE RY_... C APasswordVauItT rialCpp m 0URP_MJ CR EAT E C: \PasswordVaultT rialCpp\user01datss*: 0JRP_MJ .CREATE C APasswordVauItT rialCpp 0URP_MJ_ CLEANUP C APasswordVauItT rialCpp 0URP_MJ_ .CLOSE C APasswordVauItT rialCpp &.IRP_MJ_ WRITE C APasswordVauItT rialCpp\user01 dat BURP_MJ .CLEANUP CAPasswordVauItTrialCpp\user01 dat 0URP_MJ CLO SE C APasswordVauItT rialCpp\user01 dat @URP_MJ READ C: /: 0JRP_MJ CLEANUr gklRP_MJ CL0SE

Save the vault file.

| [Showing 49 of 36,344 events {0.13%)

Backed by page file

JU
^4

Figure 10.1. Process Monitor session for the Password Vault application. Most of the malicious operations carried out by Trojans can be detected using Process Monitor, including those that contain Backdoors. Of course, Process Monitor itself doesn't identify malware, it simply reports what a process is doing. With a little bit of ingenuity, one can identify activities that don't seem to fit with the advertised functionality of a program. For example, a program that accesses registry keys, files, or network locations that are unrelated to it, is probably malicious. It's common practice these days for users to download free software from the Internet, and because we've been convinced that open-source software, which is sometimes confused with free software, should have the fewest number of vulnerabilities, we do it without much afterthought. Incidentally, the data on the number of vulnerabilities found in popular Internet browsers

102

does not support this belief. [45] reports that "Mozilla browsers were affected by 99 new vulnerabilities in 2008, more than any other browser; there were 47 new vulnerabilities identified in Internet Explorer, 40 in Apple Safari, 35 in Opera, and 11 in Google Chrome." It seems counter-intuitive that an open-source browser would have twice as many security holes than a closed-source browser like Internet Explorer. Mozilla is not malware, but it's interesting to note that in the case of software, open-source doesn't guarantee security. Becoming familiar with the Windows Sysinternals suite can help you evaluate whether the software on your Windows machine is acting in your best interest. If you suspect a particular program to be malware, it can be submitted online to a service called ThreatExpert [47]. ThreatExpert is a Web-based tool that supports submission of software executables that are to be evaluated against an on-line malware database. The tool analyzes the instruction sequences in submitted executables and attempts to match them against those of known malware. Matching against existing malware is just one part of ThreatExpert's automated engine; the service actually tries to execute suspected malware in an isolated environment in order to perform heuristic analysis of its actions. An example of a report generated by ThreatExpert for a particularly dangerous piece of malware is shown in Fig. 10.2. The figure contains only the top-level summary of the report whereas the full report contains much more detail, such as filesystem, memory, registry, network and other activity. Note that all of the malicious behaviors of the submitted executable could have been learned by

103

m ThreatExpert
Submission Summary:
a Submission details: Submission received: 2 May 2009, 1 3 : 5 3 : 2 5 Processing time: 6 min 33 sec S u b m i t t e d sample: File MDS: 0xD5D9730AF3DE7006C9940791E96B20CE File S H A - 1 : Alias: Virus,Win32.Parite.b [Kaspersky Lab] Virus,Win32,Parite [Ikarus] Summary o f t h e findings: What's been found Severity Level
Bssssssms]

OxC4AD816CC3AD6206735E24903DC58729AAB6B388

Filesize: 4 0 6 , 7 7 1 b y t e s

A n e t w o r k - a w a r e worm t h a t uses known exploit(s) in order to replicate across vulnerable n e t w o r k s . M S 0 4 - 0 1 1 : LSASS Overflow exploit - replication across TCP 445 (common for Sasser, Bobax, Kibuv, Kongo, Gaobot, S p y b o t , Randex, o t h e r IRC Bots). : Replication across networks by exploiting weakly r e s t r i c t e d shares ! (common for Randex family of worms). 1 Communication w i t h a remote IRC server. I Downloads/requests other files from I n t e r n e t , I Creates a s t a r t u p registry e n t r y . ; There were some s y s t e m executable files modified, which might 1 indicate t h e presence o f a PE-file infector.

;:i@gg@iSe@i';

filflQQJl

I
| Contains c h a r a c t e r i s t i c s of an identified security risk.

[9@HSS3@SEI9i

Figure 10.2. Example ThreatExpert report summary for submitted malware. monitoring it using Process Monitor, though it would have taken much more time. To facilitate the exercise which accompanies this section, a benign Java software

104

Trojan named "Alarm Clock" was written. The Alarm Clock program is a multithreaded, console-based application that allows you to interact with it while it continually checks whether or not to sound the alarm. Obviously, the Alarm Clock program does a bit more than its advertised function, and the goal of the exercise is to help build familiarity with the Windows Systinternals tool suite through attempting to figure out what the additional actions taken by the program are. Keep in mind that malware will not necessarily accomplish its goals as quickly possible, it may spread out or pace malicious activity in order to use fewer system resourceshelping it stay under the radar of the user. The user interface of the Alarm Clock application is shown in Fig. 10.3.
+ I (1) (2) (3) (4) + |

Alarm Clock VI.0 Display the current date and time. Display the alarm date and time. Set the alarm date and time. Quit.

>> Type an option number and press Enter: 1 [INFO] The current time is (05/02/09 13:49:48). + I + (1) (2) (3) (4) + | +

Alarm Clock VI.0 Display the current date and time. Display the alarm date and time. Set the alarm date and time. Quit.

>> Type an option number and press Enter: 3 >> Specify the alarm date and time...(mm/dd/yy HH:MM:SS). The current date and time is (05/02/09 13:49:53). >> Type the alarm date and time to set ==> 05/03/09 08:00:00 [INFO] Alarm set is successful.

Figure 10.3. Console-based Ul for the Alarm Clock example software Trojan.

105

10.1 Malware Identification and Monitoring Exercise


Using the Windows Sysinternals suite of diagnostic tools, identify the behaviors of the Alarm Clock application that make it a software Trojan. Note any filesystem, memory, registry, or other activity that is unrelated to the program's advertised functionality. The Alarm Clock application is available at the following location: > Alarm Clock Java Application Windows installer: http://reversingproject.info/repository. php?fileID=10_l_l Note that even though the Alarm Clock application is written in Java, the bytecode has been aggressively obfuscated to discourage the use of decompilation as a strategy for learning the application's behavior.

10.2 Malware Identification and Monitoring Exercise Solution


The Alarm Clock application is a benign software Trojan that in addition to being a rudimentary alarm clock, collects information about the Windows installation, and randomly scans for computers on the Internet or Intranet that will respond to an ICMP ping. The application logs all of the information it gathers into several files in a directory off of the root filesystem, or off of the current directory (if the root filesystem is not writeable). The specific information gathered by the application is as follows: > Registry data on the Windows installation including the license key. > Registry data on the currently installed programs. > The locations of Microsoft Office, OpenOffice, PDF, and text documents in the

106

"Documents and Settings" folder. > IP addresses of random Internet/Intranet hosts that respond to an ICMP ping.

Conclusion
Unless something is done to include a required amount of reverse engineering instruction in computer science and software engineering programs of study, new engineers will remain ill-equipped to work with legacy software systems as well as be unable to ensure that software is secure and safe to deploy. Most large companies have existing software systems that have been the underpinning of their business for years. It's highly difficult, not to mention cost-prohibitive, to rip and replace mission-critical software systems in response to the emergence of a new technology. As a result, organizations are always looking for candidates that can help them understand what they have and how it can be evolved to interact with the latest technologies. Students and practicing engineers need reverse engineering skills to be able to help organizations, both large and small, understand their current technology stack and recommend an integration strategy for new technologies. Software security issues, such as how the latest virus or worm infects computer systems, also require extensive reverse engineering knowledge. Since students and engineers need to learn reverse engineering, instructors need to be able to teach it to them. At the present time, even experienced computer science and software engineering instructors may not have enough knowledge of reverse engineering to teach a course on it. Compounding the problem is the fact that materials for teaching a course on reverse engineering may be difficult to find in a format that is compatible with 107

classroom delivery. Several books exist on reverse engineering that cater to industry professionals or those interested in self-study. However, in a university setting, instructors engage students in ordered learning through exercises, quizzes, and exams. Since SRE is not a standard part of the computer science curriculum, instructors will be mostly on their own to create a course that they feel gives an adequate education on the subject. Since the uses of software reverse engineering have been well documented in the literature, it is certainly feasible to provide education on the topic, though coming up with good exericses is challenging. The importance of making this education available was emphasized by El-Ramly at the 28th International Conference on Software Engineering when he stated "Reengineering skills are survival skills for those who have to carry out software renovation and modernization projects" [48]. The integration of reverse engineering techniques as part of learning in traditional computer science courses has been tried at the University of Missouri-Rolla [3]. When students were polled, 77% indicated that applying reverse engineering techniques to their normal programming assignments reinforced concepts taught during lectures [3]. Furthermore, 82% of students wanted reverse engineering to be blended in future courses, especially those that dealt with design [3]. Given these promising trials, universities should continue to work toward establishing standard content for software reverse engineering and software maintenance courses.

108

References
H. A. Miiller, J. H. Jahnke, D. B. Smith, M. Storey, S. R. Tilley, and K. Wong, "Reverse engineering: a roadmap," in Proc. Conf. Future of Software Engineering, Limerick, Ireland, 2000, pp. 47-60. G. Canfora and M. Di Penta, "New Frontiers of Reverse Engineering," in Proc. Future of Software Engineering, Minneapolis, MN, 2007, pp. 326-341. M. R. Ali, "Why teach reverse engineering?" ACM SIGSOFT SEN, v.30, n.4, pp. 1-4, Jul 2005. A. V. Deursen, J. Favre, R. Koschke, and J. Rilling, "Experiences in Teaching Software Evolution and Program Comprehension," in Proc. 11th IEEE Int. Workshop on Program Comprehension, Washington, DC, 2003, pp. 2834-284. E. Eliam, Secrets of Reverse Engineering, Indianapolis, IN: Wiley, 2005. L. Cunningham. (2008, Jul 9). COBOL Reborn [Online]. Available: http://it.toolbox.com/blogs/oracle-guide/cobol-reborn-25896 B. W Weide, W D. Heym, J. E. Hollingsworth, "Reverse engineering of legacy code exposed," in Proc. 17th Int. Conf. Software Engineering, Seattle, Washington, WA, 1995, pp. 327-331. Wikipedia contributors. (2008, Sept 9). Compiler [Online]. Available: http://en.wikipedia.org/w/index.php?title=Compiler&oldid=237244781 B. Gough,^ introduction to GCCfor the GNU Compilers gcc and g++, Bristol, United Kingdom: Network Theory Limited, 2005. K. Irvine, Assembly Language: For Intel-Based Computers, Upper Saddle River, NJ: Prentice Hall, 2007. Boomerang Decompiler Project. (2006), Boomerang: a general, open source, retargetable decompiler of machine code programs (Version 0.3.2) [Online]. Available: http://boomerang.sourceforge.net Backer Street Software. (2007). REC: Reverse Engineering Compiler (Version 2.1) [Online]. Available: http://www.backerstreet.com/rec/rec.htm O. Yuschuk. (2000). OllyDbg: 32-bit assembler level analysing debugger for Microsoft Windows (Version 1.1) [Online]. Available: http://www.ollydbg.de Wikipedia contributors. (2008, Oct 2008). Machine code [Online]. Available: http://en.wikipedia.org/w/index.php?title=Machine_code&oldid=246690032 P, Haggar. (2001, Jul 1). Java bytecode: Understanding bytecode makes you a better programmer [Online]. Available: http://www.ibm.com/developerworks/ibm/library/it-haggar_bytecode

109

[16] P. Kouznetsov. (2001). J ad: J ad is a Java decompiler, i.e. program that reads one or more Java class files and converts them into Java source files which can be compiled again (Version 1.5.8g) [Online]. Available: http://www.kpdus.com/jad.html [17] Wei Dai. (2008). Crypto++ Library, Crypto+ + Library is a free C+ + class library of cryptographic schemes (Version 5.5.2) [Online]. Available: http ://ww w. cryptopp. com [18] G.M.Weinberg, The Psychology of Computer Programming, New York, New York: Dorset House Publishing, 1998. [19] A. Kalinovsky, Covert Java: Techniques for Decompiling, Patching, and Reverse Engineering, Indianapolis, IN: Sam's Publishing, 2004. [20] A. Sinkov, Elementary Cryptanalysis: A Mathematical Approach. Washington, DC: The Mathematical Association of America, 1980. [21] M. Stamp, Information Security: Principles and Practice, Hoboken, NJ: John Wiley & Sons, 2006. [22] Wikipedia contributors. (2009, Feb 9). ROT13 [Online]. Availble: http://en.wikipedia.org/w/index.php?title=ROT13&oldid=269492700 [23] B. Baier. (2006). COBF: the Freeware C/C++ Sourcecode Obfuscator (Version 1.06) [Online]. Available: http://home.arcor.de/bernhard.baier/cobf [24] T.J. McCabe, "A Complexity Measure," IEEE Trans. Softw. Eng, vol. 2, no. 4, pp. 308-320, July 1976. Available: http://www.literateprogramming.com/mccabe.pdf [25] Wikipedia contributors. (2008, Sept 26). Levenshtein distance [Online]. Available: http://en.wikipedia.Org/w/index.php? title=Levenshtein_distance&oldid=273450805 [26] Zelix Pty Ltd. (2009). Zelix Klassmaster: Java Bytecode Obfuscator (Version 5.2) [Online]. Available: http://www.zelix.com/klassmaster/features.html [27] The University of Arizona, Department of Computer Science. (2004). SandMark: A Tool for the Study of Software Protection Algorithms (Version 3.4) [Online]. Available: http://sandmark.cs.arizona.edu [28] Retrologic Systems. (2007). RetroGuardfor Java Obfuscation (Version 2.3.1) [Online]. Available: http://www.retrologic.com/retroguard-main.html [29] E. Lafortune. (2008). ProGuard v4.3: a Free Java bytecode Shrinker, Optimizer, Obfuscator, andPreverifier (Version 4.3) [Online]. Available: http://proguard.sourceforge.net [30] A. G. Shvets. (1999). CafeBabe: Graphical Classfile Disassembler, Editor, Stripper, Migrator, Compactor and Obfuscator (Version 1.2.7.a) [Online]. Available: http://www.geocities.com/CapeCanaveral/Hall/2334/programs.html

110

[31] M. R. Batchelder, "Java Bytecode Obfuscation", M.S. Thesis, Dept. Comp Sci., McGill Univ., Montreal, Canada, 2007. Available: http://digitool.library.mcgill.ca: 1801/webclient/StreamGate? folder_id=0&dvs=1236657408333~988 H. M. Sneed, "Encapsualtion of legacy software: A technique for reusing legacy software components", in Ann. Software Engineering, v.9, n.4, pp.293-313, 2000. IBM, (2008). IBM Rational Application Developer for WebSphere Software (Version 7.5.1) [Online]. Available: http://www01 .ibm.com/software/awdtools/developer/application Sun Microsystems. (2005, May 11). J2EE Connector Architecture [Online]. Available: http://java.sun.com/j2ee/connector Wikipedia contributors. (2009, Mar 24). Java Architecture for XML Binding [Online]. Available: http://en.wikipedia.Org/w/index.php? title=Java_Architecture_for_XML_Binding&oldid=279402856 Free Software Foundation. (2000). COBOL For GCC: a project to produce a free COBOL compiler compliant with the COBOL 85 Standard, integrated into the GNU Compiler Collection (GCC) (Version 0.1.2) [Online]. Available: http://cobolforgcc.sourceforge.net Micro Focus Ltd (2008). Net Express Personal Edition: a complete environment for quickly building and modernizing COBOL enterprise components and business applications (Version 5.1) [Online]. Available: http://www.microfocus.com/Resources/Communities/Academic World Wide Web Consortium contributors. (2004, Feb 11). Web Services Architecture [Online] .Available: http://www.w3.org/TR/2004/NOTE-ws-arch20040211 World Wide Web Consortium contributors. (2004, Oct 28). XML Schema Part 1: Structures (2nd ed.) [Online]. Available: http://www.w3.org/TR/xmlschema-l World Wide Web Consortium contributors. (2004, Oct 28). XML Schema Part 2: Datatypes (2nd ed.) [Online]. Available: http://www.w3.org/TR/xmlschema-2 World Wide Web Consortium contributors. (2004, Jun 26). Web Services Description Language (WSDL) Part 1: Core Language (Version 2.0) [Online]. Available: http://www.w3.org/TR/wsdl20 World Wide Web Consortium contributors. (2004, Jun 26). Web Services Description Language (WSDL) Part 2: Adjuncts (Version 2.0) [Online]. Available: http://www.w3.org/TR/wsdl20-adjuncts Web Services Interoperability Organization. (2007, Oct 24). Basic Profile (Version 1.2) [Online]. Available: http://www.ws-i.org/Profiles/BasicProfilel_2(WGAD).html

111

[44] IBM. (2007). Enterprise COBOL for z/OS: Language Reference V4R1. (1st ed.) [Online]. Available: http://publibfp.boulder.ibm.com/epubs/pdf/igy31r40.pdf [45] Symantec Corp. (2009 Apr). Symantec Global Internet Security Threat Report (1st ed.) [Online]. Volume 14(1). Available: http://eval.symantec.com/mktginfo/enterprise/white_papers/bwhitepaper_internet_security_threat_report_xiv_04-2009.en-us.pdf [46] Microsoft TechNet. (2009, May 7). Windows Sysinternals: utilities to help manage, troubleshoot and diagnose Windows systems and applications. [Online]. Available: http://technet.microsoft.com/en-us/sysinternals/default.aspx [47] ThreatExpert Ltd. (2009) ThreatExpert: ThreatExpert is an advanced automated threat analysis system designed to analyze and report the behavior of computer viruses, worms, trojans, adware, spyware, and other security related risks in a fully automated mode. [Online]. Available: http://www.threatexpert.com [48] M. El-Ramly, "Experience in teaching a software reengineering course," in Proc. 28th Int. Conf on Software Engineering. Shanghai, China, 2006, pp. 699-702.

112

You might also like