Decompiling is often used in conjunction with recovering lost source code, or in reverse-engineering code when we do not have access to the source. Here we describe a novel use: places where accurate position reporting, even in the...
moreDecompiling is often used in conjunction with recovering lost source code, or in reverse-engineering code when we do not have access to the source. Here we describe a novel use: places where accurate position reporting, even in the presence of optimized code or where source code is not available, could be helpful. Examples include tracebacks, debuggers, core dumps and so on. Also new is using decompilation to assist debugging at run-time. We show how more conventional methods of source-code reporting are vague and ambiguous. Although eecting a pervasive change in a compiler is arduous and error-prone, a decompiler can be developed somewhat independently. However, for the vast number of programming languages, decompilers do not exist. This is true even for dynamic interpreted languages where there is little going on in the way of " compilation. " Traditionally, decompilers are labor intensive and ad hoc, and their construction might also be unfamiliar. So this paper shows how to ameliorate these problems by describing a pipeline for writing decompilers for dynamic languages that we hope will become standard in the same way that the pipeline from the Dragon Book [9] has. Our pipeline diers somewhat from the standard compiler pipeline and from earlier decompiler pipelines [13, 16]. The dierences may lead to further avenues of research and practice. We use a grammar-directed parsing of instructions with an ambiguous grammar to create an AST that, at the top-levels, resembles that of the source-code AST. This is helpful in error reporting. An ambiguous grammar can give an exponential number of derivations on certain likely input sequences, and we describe techniques to deal with this issue. Finally, we describe issues around developing and maintaining a Python bytecode decompiler, which handles the Python language spanning 15 releases over some 20 years. We hope to encourage the incorporation of decompila-tion into tracebacks, debuggers, core dumps, and so on. Reporting a line number as a position in a program can be vague. Consider the following Python code: x = prev[prev[p]] ... If we get the following Python error at runtime: IndexError: list index out of range a line number, and possibly method name or path, will be reported. But which index caused the problem? Or consider possible runtime errors in these lines: x = a / b / c # which divide? # which index? [x[0] for i in d[j] if got[i] == e[i]] return fib(x) + fib(y) # which fib? # code created at runtime exec(a=%s; b=%s % (y, z)) As seen in the last example, there are problems with functions like eval() which evaluate either a string containing a program-language fragment, or which evaluate some intermediate representation of a program such as an Abstract Syntax Tree or a code object. In these cases, the program is created at runtime, either from strings or from method calls which create AST objects and, either way, there is no le containing source code as a stream of characters. The Pos type in Go's AST is more informative in that it encodes the line number, le name and column number all very compactly as an opaque index [11]. Even better, though, would be to extend this to an interval or a beginning and ending position. Go's implementation cleverly takes advantage of a sparseness property: the number of interesting locations in a program, whether a single point or a pair of points, is much smaller than the program's total number of bytes. Therefore it is practical for single integer to index into a small table whose entries have a lot of information. Long experience has shown that trying to get compilers and interpreters to track location correspondences through the compilation process and then save them in the runtime system is an uphill battle. The Go implementers resist extending a Pos type to cover interval positions because of their fear that the line/column approach already slows down compilation.