Dynamic Re-Compilation of Binary RISC Code For CISC Architectures
Dynamic Re-Compilation of Binary RISC Code For CISC Architectures
d d d
d d d d
d d
d d d
d d d d
d d d
Diplomarbeit
• RISC (PowerPC) machine code is translated into CISC (i386) code. Special prob-
lems caused by this combination, such as register allocation and condition code con-
version, are addressed.
• The first translation pass is optimized for maximum translation speed and designed
to be only slightly slower than one interpretive run.
• The system optimizes common cases found in compiled user mode code, such as
certain stack operations.
• recompiler technology that only needs about 25% more time per instruction for trans-
lation compared to a single interpretation, and produces code that is only 3 to 4 times
slower than native code.
• recompiler technology that only needs about double the time per instruction for trans-
lation compared to a single interpretation, and produces code that only takes twice as
long for execution as native code.
• recompiler technology that, without using intermediate code, optimizes full functions
and allocates registers dynamically; and achieves speed close to that of native code.
I would like to thank Georg Acher (supervision), Christian Hessmann (discussions, proofreading), Melissa
Mears (discussions, proofreading), Axel Auweter (discussions, support code). Daniel Lehmann (discussions,
proofreading), Tobias Bratfisch (support code), Sebastian Biallas (discussions), Costis (discussions), Alexan-
der Mayer (proofreading) and Wolfgang Zehtner (proofreading) for their support.
Contents
1 Motivation 13
5
6 CONTENTS
2.4.3 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.4.4 Instruction Set and Addressing Modes . . . . . . . . . . . . . . . . 54
2.4.5 Instruction Encoding . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.4.6 Endianness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.4.7 Stack Frames and Calling Conventions . . . . . . . . . . . . . . . 56
2.4.8 Unique PowerPC Characteristics . . . . . . . . . . . . . . . . . . . 57
3 Design 59
3.1 Differences between PowerPC and i386 . . . . . . . . . . . . . . . . . . . 59
3.1.1 Modern RISC and CISC CPUs . . . . . . . . . . . . . . . . . . . . 59
3.1.2 Problems of RISC/CISC Recompilation . . . . . . . . . . . . . . . 60
3.2 Objectives and Design Fundamentals . . . . . . . . . . . . . . . . . . . . . 62
3.2.1 Possible Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2.2 How to Reconcile all Aims . . . . . . . . . . . . . . . . . . . . . . 62
3.3 Design Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.3.1 The Hotspot Principle Revisited . . . . . . . . . . . . . . . . . . . 64
3.3.2 Register Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3.2.1 Simple Candidates . . . . . . . . . . . . . . . . . . . . . 68
3.3.2.2 Details of Static Mapping . . . . . . . . . . . . . . . . . 74
3.3.3 Condition Code Mapping . . . . . . . . . . . . . . . . . . . . . . . 81
3.3.3.1 Parity Flag . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.3.3.2 Conversion to Signed Using a Table . . . . . . . . . . . . 83
3.3.3.3 Conversion to PowerPC Format . . . . . . . . . . . . . . 84
3.3.3.4 Intermediate i386 Flags . . . . . . . . . . . . . . . . . . 84
3.3.3.5 Memory Access Optimization . . . . . . . . . . . . . . . 84
3.3.3.6 The Final Code . . . . . . . . . . . . . . . . . . . . . . 85
3.3.3.7 Compatibility Issues . . . . . . . . . . . . . . . . . . . . 86
3.3.4 Endianness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.3.4.1 Do Nothing . . . . . . . . . . . . . . . . . . . . . . . . 87
3.3.4.2 Byte Swap . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.3.4.3 Swapped Memory . . . . . . . . . . . . . . . . . . . . . 89
3.3.4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.3.5 Instruction Recompiler . . . . . . . . . . . . . . . . . . . . . . . . 93
3.3.5.1 Dispatcher . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.3.5.2 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.3.5.3 i386 Converter . . . . . . . . . . . . . . . . . . . . . . . 94
3.3.5.4 Instruction Encoder . . . . . . . . . . . . . . . . . . . . 95
3.3.5.5 Speed Considerations . . . . . . . . . . . . . . . . . . . 95
3.3.6 Basic Block Logic . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.3.6.1 Basic Blocks . . . . . . . . . . . . . . . . . . . . . . . . 95
3.3.6.2 Basic Block Cache . . . . . . . . . . . . . . . . . . . . . 96
3.3.6.3 Control Flow Instructions . . . . . . . . . . . . . . . . . 96
3.3.6.4 Basic Block Linking . . . . . . . . . . . . . . . . . . . . 97
3.3.6.5 Interpreter Fallback . . . . . . . . . . . . . . . . . . . . 98
3.3.7 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
CONTENTS 7
3.3.7.1 Loader . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.3.7.2 Memory Environment . . . . . . . . . . . . . . . . . . . 99
3.3.7.3 Disassembler . . . . . . . . . . . . . . . . . . . . . . . . 100
3.3.7.4 Execution . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.3.8 Pass 2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.3.8.1 Register Mapping Problem . . . . . . . . . . . . . . . . 101
3.3.8.2 Condition Code Optimization . . . . . . . . . . . . . . . 101
3.3.8.3 Link Register Inefficiency . . . . . . . . . . . . . . . . . 102
3.3.8.4 Intermediate Code . . . . . . . . . . . . . . . . . . . . . 102
3.3.8.5 Pass 2 Design Overview . . . . . . . . . . . . . . . . . . 105
3.3.9 Dynamic Register Allocation . . . . . . . . . . . . . . . . . . . . . 105
3.3.9.1 Design Overview . . . . . . . . . . . . . . . . . . . . . 105
3.3.9.2 Finding Functions . . . . . . . . . . . . . . . . . . . . . 106
3.3.9.3 Gathering use/def/pred/succ Information . . . . . . . . . 106
3.3.9.4 Register Allocation . . . . . . . . . . . . . . . . . . . . 107
3.3.9.5 Signature Reconstruction . . . . . . . . . . . . . . . . . 107
3.3.9.6 Retranslation . . . . . . . . . . . . . . . . . . . . . . . . 108
3.3.9.7 Address Backpatching . . . . . . . . . . . . . . . . . . . 109
3.3.9.8 Condition Codes . . . . . . . . . . . . . . . . . . . . . . 109
3.3.10 Function Call Convertion . . . . . . . . . . . . . . . . . . . . . . . 110
3.3.10.1 ”call” and ”ret” . . . . . . . . . . . . . . . . . . . . . . 110
3.3.10.2 Stack Frame Size . . . . . . . . . . . . . . . . . . . . . 111
3.3.10.3 ”lr” Sequence Elimination . . . . . . . . . . . . . . . . . 111
6 Future 133
6.1 What is Missing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.2 Further Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.2.1 Translation Concepts . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.2.2 Evaluation of Simplification Ideas . . . . . . . . . . . . . . . . . . 134
6.2.3 Dispatcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.2.4 Peephole Optimization . . . . . . . . . . . . . . . . . . . . . . . . 136
6.2.5 Instruction Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.2.6 Calculated Jumps and Function Pointers . . . . . . . . . . . . . . . 137
6.2.7 Pass 1 Stack Frame Conversion . . . . . . . . . . . . . . . . . . . 137
6.2.8 Register Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.2.8.1 Adaptive Static Allocation . . . . . . . . . . . . . . . . . 138
6.2.8.2 Keeping Lives of a Source Register Apart . . . . . . . . 138
6.2.8.3 Better Signature Reconstruction . . . . . . . . . . . . . . 138
6.2.8.4 Global Register Allocation . . . . . . . . . . . . . . . . 139
6.2.8.5 Condition Codes . . . . . . . . . . . . . . . . . . . . . . 139
6.2.8.6 Other Architectures . . . . . . . . . . . . . . . . . . . . 140
6.2.9 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.2.9.1 System Emulation . . . . . . . . . . . . . . . . . . . . . 140
6.2.9.2 Hardware Implementation . . . . . . . . . . . . . . . . . 141
6.3 SoftPear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
9
10 LIST OF FIGURES
List of Tables
11
12 LIST OF TABLES
Chapter 1
Motivation
There has always been and there will always be different and incompatible computer sys-
tems. Although the home computer wars of the 1980s and early 1990s with their extreme
variety of platforms are over, there are still a handful of desktop and server architectures
today, with different and incompatible CPUs. Whenever there is incompatibility, there are
solutions to make platforms compatible:
• Commodore 128: Two CPUs made this computer compatible with the Commodore
64 as well as the CP/M operating system.
• Java: Applications written for the Java Virtual Machine can run on many otherwise
incompatible computers.
Interoperability can be achieved through hardware or software. The software solution either
means porting the source code, or, if no source code is available, emulation.
Regarding the purpose of emulation, most of the available emulators fall into one of the
following categories:
1. Commercial emulators, mostly intended for migration: These emulators are typi-
cally highly optimized and target a machine that is superior to the emulated machine
(DR Emulator [1]) or even constructed for optimal emulation performance (Crusoe
[6]). Therefore, they avoid problems of the general case.
2. Hobby emulators, mostly of classic systems (UAE [14]): Most of these emulators
target the i386 CPU, but they are hardly optimized for performance, as sufficient
emulation speed is easily achieved because the emulated system is typically clearly
inferior.
13
14 CHAPTER 1. MOTIVATION
3. Portability solutions (Java [10], .NET [11]): These emulators are highly optimized.
Their input language is an artificial intermediate language that includes a lot of meta-
data and makes analysis easy. They are more similar to a back end of a compiler than
to a conventional emulator.
This thesis is supposed to close the gap between 1 and 2: Commercial emulators often
achieve very high speeds, which is easy because they target a superior CPU architecture.
Hobby emulators sometimes target an inferior CPU architecture, but they rarely have a
need to be fast.
The objective of this thesis is to develop concepts for a recompiler of a RISC architecture
that targets a CISC CPU: PowerPC machine code is supposed to be translated into i386
machine code. Both are architectures that are widely used today, and the newest members
of each family achieve a comparable performance. So PowerPC to i386 recompilation
means translating code to an equally fast system that has an inferior ISA.
The RISC/CISC combination is not supposed to be the single characteristic that distin-
guishes this work from other recompilers. These four concepts are the basic ideas of what
shall be developed:
1. RISC to CISC: The recompiler will be designed to translate code to an inferior
instruction set architecture. It must deal with differences such as the number of
registers and three-register-logic as well as with complex instruction encoding. Low-
level solutions are supposed to be found to produce very optimized code.
2. Recompiler/Recompiler Hotspot Method: Dynamic recompilers almost always
use the hotspot system that combines recompilation with interpretation. In the scope
of this thesis, a hotspot recompiler will be developed that never interprets but has two
levels of recompilation.
3. Simple and Fast Recompiler: Most recompilers are either optimized for simplicity
or for producing excellent code. The first pass of this hotspot system will consist
of a very fast recompiler optimized for high speed dispatching, decoding and code
generation, which still produces fairly good code.
4. Compiled User Mode Code: Many recompilers emulate complete computers in-
cluding the operating system. This solution will be designed for user mode code
only, and optimized for code that has been produced by a compiler.
Although some of the ideas involve optimization and speed, the important point is to eval-
uate algorithms that promise good performance instead of optimizing implementations.
The solutions developed in this thesis can be useful for different applications:
• Softpear: Softpear [28] is an Open Source project working on a solution to run Mac
OS X applications on i386 CPUs. This will be achieved by using the i386 version of
the base operating system Darwin and emulating user mode applications.
• Software Controlled Radio: An addition to Digital Audio Broadcasting being re-
searched transmits audio codecs in a RISC-like intermediate language. The RISC to
i386 recompiler could be used to target the i386 CPU.
15
Additionally, the source code or the ideas might be used in other PowerPC to i386 emula-
tors, like PearPC [15] or a Nintendo GameCube emulator.
Apart from concrete applications, recompilation and especially RISC to CISC recompila-
tion has some value to research. Although CISC CPUs have been written off more than
a decade ago, the majority of desktop computers is driven by i386 CISC CPUs, and the
AMD64/EM64T extensions will lengthen the life of CISC even more. This is caused by
the fact that there is a gigantic base of software available for this platform, which would
be lost when switching CPU architectures. Recompilers can solve this problems; they can
help leaving the i386 platform, so recompiler research is important.
But as long as CISC is so important, it is also interesting as a target for recompilation.
Research seems to always have concentrated on migration from CISC to RISC, but inter-
operability between the two is also an important goal.
This thesis is divided into six chapters. The following chapter gives an overview of the
technologies that form the basis of the work on this topic: recompilation in general and
the i386 and PowerPC CPUs. Chapter three describes the basic concept as well as all
details of the design of the recompiler that has been developed. Interesting implementation
details are described in chapter four. Chapter five evaluates the code quality as well as the
speed of the generated code. Limitations of the current design and ideas for improvements
are discussed in chapter six. The Appendix contains speed measurement code as well as
detailed statistics that are referred to throughout this thesis.
16 CHAPTER 1. MOTIVATION
Chapter 2
• Historical Preservation: Old machines often have some historic value, and there
might be only few or even no working machines left. Emulation makes sure that
these systems can still be explored and learnt from today. Emulators are available for
many classic machines, like the DEC PDP series and the Altair 8800 [12].
• Migration: Work flow in a company may for decades depend on an application that
is only available for a single architecture. This application may not have been ported
to other (later) computer systems, because the original producer has gone out of busi-
ness, the source code of the application has been lost, or because the application has
been implemented in assembly language. If the old architecture ceases to exist and
the old machines stop working, emulation makes it possible to continue running the
application on current machines, maybe even faster than on the old machine. In
general, migrating from one architecture to another can be simplified by emulation,
because the user can have all the benefits of the new system without sacrificing com-
patibility with the old system.
Apple’s DR Emulator [1] is one of the most prominent examples of this kind of emu-
lators. It allowed easy migration from Motorola M68K based computers to PowerPC
machines, by translating old code and even parts of the operating system on the fly.
Intel provides an emulator of i386 (IA-32) Windows applications for IA-64 CPUs.
17
18 CHAPTER 2. INTRODUCTION TO EMULATION, RISC AND CISC
This can be done by either making the environment of one system available on an-
other system, or by creating a new system that is somewhat close to several existing
systems, and writing emulators of this artificial system for physical systems. The
Java Virtual Machine is such an artificial machine: A Java runtime environment is
an emulator for a system that does not exist. The Java VM has not been designed
to be implemented in hardware, but to be emulated on common desktop and server
architectures. Examples of emulators of one current system on another one for the
sake of interoperability are VirtualPC/Mac (IBM PC on Apple Macintosh), Digital
FX!32 (i386/IA-32 Windows applications on Alpha) and WINE (Windows on Linux,
actually only an API converter)1 .
Most well-known emulators fall into the ”migration” and ”interoperability” categories:
Emulators exist for many classic computer systems and gaming consoles since the 1970s
up to the current generation, mainly for providing classic games or games intended for re-
cent gaming consoles on modern desktop computers. While these are often hobby projects
(Basilisk II, UAE, VICE, SNES9X, Nemu, Dolwin, cxbx), compatibility solutions between
current systems are typically commercial software (VirtualPC, VMware, Win4Lin).
1 It is a matter of perspective whether to regard certain emulators as falling into the migration or interop-
erability categories.
2.1. INTRODUCTION TO EMULATION AND RECOMPILATION 19
• Data Path Accuracy is the highest level of accuracy. It simulates all internals of the
hardware exactly, including the data flow inside the CPU. This kind of emulators rep-
resents the software version of hardware implementations of a system. This level of
accuracy is mostly needed for hardware development, i.e. the emulator is a prototype
of the actual device.
• Cycle Accuracy means emulating the hardware including the ”cycle exact” timing
of all components. This type of emulators only implements the exact specification
of hardware, but not its implementation. Many 1980s computer games (typically on
8 and 16 bit computers or gaming consoles) require this level of accuracy: Because
it was easier to count on the timing of, for instance, CPU instructions as it is today,
games could easily use timing-critical side effects of the machine for certain effects.
For development of real time operating systems, cycle accurate emulators are needed
as well. For example, VICE [13] is an emulator that emulates a Commodore 64 and
its hardware cycle-accurately.
• Basic-Block Accuracy does not include the execution of individual instructions. In-
stead, blocks of instructions are emulated as a whole, typically by translating them
into an equivalent sequence of native instructions. This level of accuracy does not
allow single stepping through the code, and advanced tricks like self modification
do not work without dedicated support for them. The Amiga emulator UAE [14] is
an example of such an emulator: Motorola M68K code is translated block-wise into
i386 code.
the first widely-known emulator to apply this technique. Some emulators of com-
puter systems, such as IBM PC Emulators VirtualPC [2] and VMware, as well as the
Amiga emulator UAE can optionally do high-level emulation, if the user installs a set
of device drivers that directly divert all requests to the host machine without going
through hardware emulation.
The lower the level of emulation, the more possible uses there are, but also the more com-
puting power is needed. A data path accurate emulator, if implemented correctly, will
without exception run all software that has been designed for this system, but the speed
of the host machine must be many thousand times faster than the speed of the emulated
system in order to achieve its original speed. Emulators that implement high-level or very-
high-level accuracy cannot be used for all these purposes, and often even only run a small
subset of all existing applications, but they can achieve realtime speed on machines that are
only marginally faster.
For example, for realtime emulation of a Commodore 64 system with its two 1 MHz 8
bit CPUs, a 500 MHz 32 bit CPU is required (VICE emulator, cycle accuracy), but the
same host computer would also run many games of the 93.75 MHz 32/64 bit Nintendo 64
(UltraHLE, high-level accuracy).
• Emulate the complete system including all hardware. The original operating system
will run on top of the emulator, and the applications will run on top of this operating
system.
• Emulate the CPU only and provide the interface between applications and the operat-
ing system. Operating system calls will be high-level emulated, and the applications
run directly on top of the emulator.
There are two special cases about the CPU emulation: If the CPU is the same on the
host and the emulated machine, it need not be emulated. Also, if the source code of the
application is available, CPU emulation may not be necessary, as the application can be
compiled for the CPU of the host computer. In both cases, only the rest of the machine, i.e.
either the machine’s support hardware2 or the operating system’s API has to be provided to
the application.
2 In the following paragraphs, this support hardware, which includes devices like video, audio and generic
There is also a special case about hardware emulation: If the operating system the applica-
tion runs on is available for the host platform, i.e. only the CPUs are different, it is enough
to emulate the CPU and hand all operating system calls down to the native operating system
of the host.
Table 2.2: Possible solutions for user mode and API emulation
same API diff API
same CPU - API conv. (Wine)
diff CPU CPU emul. (FX!32) CPU emul. & API conv. (Darwine)
ent, the code can be run natively. Only the library and kernel calls have to be trapped and
translated for the host operating system. Wine [22] is a solution that allows running i386
Windows executables on various i386 Unix systems. It consists of a loader for windows
executables, as well as reimplementations of Windows operating system and library code,
most of which make heavy use of the underlying Unix libraries.
If both the CPU and the API are different, the CPU is emulated and the API has to be
simulated as in the previous case. Darwine [23] is a version of Wine that is supposed to
allow i386 Windows applications to run on PowerPC Mac OS X.
If an operating system is available for multiple incompatible CPU types, it is possible to
create an emulator that only emulates the CPU, and passes all system and library calls
directly to the host operating system. DEC/Compaq FX!32 [7] is a compatibility layer that
allows running i386 Windows applications on the Windows version for Alpha CPUs. A
similar solution, called the ”IA-32 Execution Layer”, is provided by Intel to allow i386
Windows applications run on Windows for IA-643 [9].
Another advantage of a CPU/API emulator is its integration into the host operating system.
A hardware emulator would run as a single application on top of the host operating system,
with all emulated applications running inside the emulator, so working with both native
and emulated applications can be quite awkward for the user. Emulated and native applica-
tions also cannot access the same file system, share the same identity on the network or do
interprocess communication across emulation boundaries. But many hardware emulators
add additional functionality to circumvent these problems by introducing paths of commu-
nication between the two systems. VirtualPC/Mac and the Mac OS Classic Environment,
for example, apply a technique to route the information about running applications and
their windows to the host operating system. ”Classic” applications even appear as separate
applications on the Mac OS X desktop. Host file system and network access are typically
done by installing special drivers into the emulated OS that connects the emulated and the
host interfaces. Of course, again, direct translation of the simulated API to the host API
can be a lot faster than these routes across both systems.
24 CHAPTER 2. INTRODUCTION TO EMULATION, RISC AND CISC
Finally, hardware emulators require a complete copy of the emulated operating system, i.e.
a license has to be bought. CPU/API emulators do not need this license.
Although the advantages of the CPU/API emulator outnumber those of the hardware emu-
lator, it is not the ideal concept for all applications. There are many examples of emulators
based on either of these concepts. For Windows on Linux, there is Wine, but there also is
VMware. For Windows on Mac, there is g, but there also is Darwine. And: For Mac on
i386 Linux, there is PearPC - but there also is SoftPear.
The heart of every emulator is CPU emulation. Therefore it must be made sure that it is
well designed, to achieve maximum accuracy, flexibility or speed, depending on what is
needed most. There are two kinds of CPU emulators: interpreters and recompilers.
2.1.3.1 Interpretation
An interpreting emulator is the most obvious type of emulator. It executes one instruction
after another, as the program runs, i.e. it simulates the instruction cycle of the CPU: The
opcode gets fetched, dispatched, operands get decoded, and the instruction gets executed.
After that, the interpreter jumps to the beginning of the interpretation loop. The following
C code illustrates this:
while (1) {
instruction = *ip;
opcode = instruction >> 26;
switch (opcode) {
case 0:
rs = (instruction >> 21) & 31;
ra = (instruction >> 16) & 31;
rb = (instruction >> 11) & 31;
register[rb] = register[rs] | register[ra];
case 1:
...
}
ip++;
}
}
2.1. INTRODUCTION TO EMULATION AND RECOMPILATION 25
Interpreters need little memory, are easy to write, easy to understand and because of their
simplicity, even implementations in a high-level language can reach a speed close to the
maximum possible for an interpreter. Typically being written in high-level languages, they
are also very portable. They can reach data path, cycle and instruction level accuracy and
are therefore mostly used for the verification of a CPU during its development cycle, for
cross-development as well as for emulating 8 bit computers and gaming consoles.
Unfortunately, interpreters are inherently slow. The following code is supposed to illustrate
the code flow on the host machine (i386) to interpret a single (PowerPC) instruction (”or
r5, r5, r3”; note that this code is in AT&T syntax, as described in section 2.3.3):
In this example, a PowerPC ”or” instruction is interpreted. i386 instructions 1 and 2 fetch
the 32 bit PowerPC instruction. In instructions 3 to 6, the first opcode field (”31” in this
example) gets extracted. Since the most likely value is ”31”, the code is optimized for this
case, and the branch in instruction 6 is not taken. Instructions 7 to 9 extract the second
26 CHAPTER 2. INTRODUCTION TO EMULATION, RISC AND CISC
opcode field (”444” in this example) and jump to the implementation of the PowerPC ”or”
instruction, using a jump table. The register numbers get extracted in instructions 10 to 18,
in instructions 19 to 21, the actual operation is done, and instruction 22 jumps to beginning
of the interpreter loop.
This is highly optimized assembly code, and there is probably no way to speed it up sig-
nificantly4 . A single source instruction got expanded to 22 destination instructions in this
case, and many other source instructions would even require more destination instructions
to interpret. But this does not mean that interpretation is about 22 times slower than direct
execution - it is even worse.
Modern processors depend on pipelining. While one instruction is being executed, the next
one is already decoded, and another one is being fetched. Branches and jumps with an
undetermined target are a problem: The instructions after the control flow instruction can-
not be fetched or decoded as long as the target of the control flow instruction is unknown.
If the target is known early enough or the control flow instruction prefers a certain target
(branch prediction), the performance penalty is reduced - but in the case of an interpreter,
the branches and jumps (instructions 6 and 9 in the example) do not have an early-known
target nor do they prefer one certain target.
An AMD Duron (1200 MHz) for example spends about 13 clock cycles in one iteration of
the interpreter loop above, if the indirect jump always hits the same target. If it does not,
one iteration takes about 28 clock cycles (see Appendix A). Given the source instruction on
its native hardware would normally be completed in about half a clock cycle on average,
the interpreter is slower than the original by a factor of more than 50. If an i386 and a
PowerPC CPU, both at 1 GHz, have comparable performance, emulating a PowerPC on
the i386 using an interpreter would result in an emulated machine that has the performance
in the class of a 20 MHz PowerPC.
Measurements [56] done by the QEMU authors indicate that Bochs, an interpretive i386
emulator, is about 270 times slower than native code on the same machine. This additional
difference is most likely caused by the fact that CPU flags have to be calculated after every
arithmetic or logic function on the i386, and that Bochs has to emulate the MMU as well.
2.1.3.2 Recompilation
”The idea of a processor that does Dynamic Binary Translation did not
seem very appealing to venture capitalists. That’s when we came up with the
term ’Code Morphing Software’ as a marketing tool to get more people inter-
ested in funding the Crusoe processor development.” - David Ditzel, Founder
& CEO, Transmeta
Interpreters are slow because of the dispatcher jumps and because so many instructions are
needed for interpreting a single source instruction. These facts have been well-known from
interpreters for high-level programming languages as well: Compilers move the decoding
of the source language from execution time into compile time, increasing execution speed.
4 Reordering the instructions might be a good idea, though.
2.1. INTRODUCTION TO EMULATION AND RECOMPILATION 27
library
code
lwz
stw
stmw
lmw
stwu
add
sub
and
or
xor
Threaded Code Recompilers using the ”threaded code” [48] [49] concept are the most
easy ones to implement, as they are closest to interpreters. An interpreter loop consists of
the fetch/dispatch sequence and many short sequences that implement the specific source
instructions. A recompiler that makes use of threaded code translates every source instruc-
tion into a function call of the implementation of this instruction. Typically the decoding of
the instruction is also moved into the recompiler, i.e. the implementations of instructions
do not have to do the decoding, but they are called with the already decoded parameters.
The example from above which illustrates the code executed when interpreting in i386
assembly a single PowerPC assembly instruction, changed to a ”threaded code” recompiler,
would look like this:
The main function loads the parameters of the source instruction (in this case 3, 5 and 5 for
”or r5, r5, r3”) into registers and calls the function that implements the source instruction
28 CHAPTER 2. INTRODUCTION TO EMULATION, RISC AND CISC
”or” (instructions 5 to 8), which does the work and returns. This example only requires
8 target instructions, compared to 22 in the case of the interpreter, and it eliminates the
pipelining problem.
The generated code can be improved easily by inlining the code of the subroutines, instead
of calling them:
This way, the source instruction is translated into only 6 target instructions, but the gen-
erated code is still ”threaded code”, because it is composed of fixed sequences. An AMD
Duron 1200 spends about 3 clock cycles on the code above (see Appendix A) — this is
10 times faster than the interpretive version from above, but still about 5 times slower than
native code.
QEMU [16] is an example of a recompiler that makes use of threaded code. QEMU has
multiple source and multiple target architectures, so the architecture had to be as flexible
as possible. The source instructions are implemented as C functions like in an interpreter,
and get compiled into machine code during compilation of the emulator. At run time, these
sequences, which are then embedded into the emulator’s code, will be put together to form
the recompiled code. This architecture allows to add new target architectures easily, and
source architectures can be added by adding implementations of all source instructions in
C, as it would have to be done for an interpreter. QEMU is about 4 times slower than native
code [56].
Some video games have been ”ported” to new platforms by translating them using threaded
code: Squaresoft’s Final Fantasy 4 to 6 rereleases for the Sony PlayStation were statically
recompiled (see below) versions of the original version for the Nintendo SNES. A rather
simple application translated the complete game code from WDC 65816 machine code to
MIPS machine code. The same is true for Capcom’s Megaman 1 to 8 on the PlayStation 2
and Nintendo Gamecube, released as ”Megaman Anniversary Collection”: Megaman 1-6
had been for the Nintendo NES, 7 for the Nintendo SNES, and 8 for the PlayStation [66].
All in all, a ”threaded code” recompiler can be implemented with little more effort than an
interpreter, but it is radically faster than an interpreter.
Real Recompilation Threaded code is still quite inefficient code. The optimal translation
of ”or r5, r5, r3” into i386 code would look like this:
1 or %eax, %ebx
Given the i386 register EAX corresponds to the PowerPC register r5, and EBX corresponds
to r3 in this context, this i386 code is equivalent to the PowerPC code. This instruction is
just as fast as native code - because it is native code.
2.1. INTRODUCTION TO EMULATION AND RECOMPILATION 29
Instruction 1 Instruction 2
Recompiler
Recompilation Execution
A real recompiler does not just put together fixed sequences of target code from a library,
but works a lot more like a compiler of a high-level language. There is no common concept
that is shared by all recompilers. Some recompilers translate instruction by instruction,
others translate the source code into intermediate code, optimize the intermediate code,
and finally translate it into target code. The following sections discuss some common
recompiler concepts.
Although recompilers are radically faster than interpreters, they typically need more mem-
ory. Especially dynamic recompilers (see below) have this problem, as they have to keep
both the source and the target code in memory, as well as lots of meta information. There-
fore, on small systems (e.g. Java on cellphones), interpretation is often preferred to recom-
pilation.
But static recompilation also has some severe problems. The translation from high-level
source code to machine code resulted in the loss of some data. Recompilers have to recon-
struct a lot of information from the machine code, but some information may be impossible
to reconstruct. Distinguishing between code and data is nontrivial: On a von Neumann ma-
chine, code and data coexist in the same address space and can therefore be intermixed. On
the ARM platform, machine code often loads 32 bit data constants from the code segment,
as PC-relative accesses are handy on the ARM.
[...]
ldr r0, constant // ldr r0, [pc, #4]
ldr r1, r2
bx lr
constant:
.long 12345
On most platforms, C switch instructions are translated into jumps using a jump table
(PowerPC code):
[index to table in r3]
mfspr r2,lr ; save lr
bcl 20,31,label ; lr = label
label: mfspr r10,lr ; r10 = label
mtspr lr,r2 ; restore lr
cmplwi r3,entries
bgt default ; c > entries -> default
addis r6,r10,0x0 ; r6 = label
rlwinm r0,r3,2,0,29 ; r0 = c << 2
addi r5,r6,table-label ; r5 = table
lwzx r4,r5,r0 ; get entry from table
add r3,r4,r5 ; address of code
mtspr ctr,r3 ; ctr = address
bctr ; jump to code
table:
.long case0 - table
.long case1 - table
.long case2 - table
[...]
case0:
[...]
case1:
[...]
2.1. INTRODUCTION TO EMULATION AND RECOMPILATION 31
By just translating instruction by instruction, the recompiler would treat the jump table as
machine code. This example also shows something else: A static recompiler cannot easily
find out all code in the source program: A tree search through the program (functions
are nodes in the tree, calls are edges), starting at the entry point, would not reach the
destinations of the jump table in the example above.
Self-modifying code is regarded as a very bad habit that can break a lot - it also makes
it impossible for static recompilers to generate a correct translation, as the original code
alters its behavior by changing parts of itself. Self-modifying code might seem rare today,
but something else can be regarded as self-modification as well: Operating systems load
new code into memory, dynamic recompilers generate code at runtime, and some applica-
tions decrypt or decode machine code before execution. Neither of these can therefore be
statically recompiled.
UQBT, the University of Queensland Binary Translator [24], is a very powerful static re-
compiler with multiple source and target languages. It is one of the most well-known re-
search projects on binary translation. The ”staticrecompilers” group around Graham Toal
and Neil Bradley [26][27] is developing a recompiler to statically translate classic 8 bit
arcade games into C code, which is then compiled into native code using a standard C
compiler. Finding out all code within the executable is done by playing the game from
the beginning to the end, finding all secrets, and logging what code is reached. Using their
recompiler, the game ”Ms. Pacman”, which had been designed for a 4 MHz Z80 CPU, runs
450 times faster on a 2.4 GHz i386 than on the original hardware, which is about twice the
speed achieved by an optimized interpreter implemented in assembly [57].
Microsoft’s Java competitor, the .NET system, also includes a static recompiler. While
.NET executables are typically dynamically recompiled (i.e. at runtime), ”ngen.exe”, which
ships with the .NET runtime environment, statically translates MSIL (”Microsoft interme-
diate language”) code into native code and replaces the original executable with the trans-
lated version. Complete translation is always possible, because things that can break static
recompilation are impossible in MSIL. The security/safety restrictions of both the high
level language (C# etc.) and MSIL prohibit things such as jump tables of unknown size,
self-modifying code and code/data ambiguity. Since compilation from high-level source
code into MSIL does not discard any data (other than comments and variable names),
ngen.exe might be regarded as the back end of a compiler rather than a real recompiler,
though.
The disadvantage of dynamic recompilers is that they only have limited time to do the
translation, as it will be done just before or during execution of the application. The better
the optimization that is done, the faster the resulting code will be - but the longer translation
takes. As translation and execution time add up to the effective time the user has to wait,
extensive optimization will probably not be worth it. A compromise between translation
speed and speed of the generated code has to be found to maximize the effective execution
speed.
Because of this potential performance problem, there are several techniques to circumvent
it. As there is no need to translate the complete program at the beginning, most dynamic
recompilers only translate a block of code at a time, execute it, then translate the following
block and so on. This makes sure that only blocks are translated that are ever reached.
Functionality of an application that is not used by the user will not be translated. These
blocks of code are typically ”basic blocks”. A basic block is a sequence of code that is
atomic in terms of code flow, i.e. that can be entered only at the beginning, and exited
only at the end. So it can only contain up to one instruction of code flow (jump, branch,
call...), which must be the last instruction of the block, and there are no instructions in other
(known) blocks that jump to any other instruction of this block than the first one5 . In the
following example, the borders between basic blocks have been marked:
--------------------------------------
00001eb0 cmpwi r3,0x1
00001eb4 or r0,r3,r3
00001eb8 ble 0x1ee4
--------------------------------------
00001ebc addic. r3,r3,0xffff
00001ec0 li r9,0x0
00001ec4 li r0,0x1
00001ec8 beq 0x1ee0
--------------------------------------
00001ecc mtspr ctr,r3
--------------------------------------
00001ed0 add r2,r9,r0
00001ed4 or r9,r0,r0
00001ed8 or r0,r2,r2
00001edc bdnz 0x1ed0
5 In practice though, many recompilers use bigger basic blocks, for optimization purposes, although it can
make recompilation slower and lead to code that has been translated more than once.
2.1. INTRODUCTION TO EMULATION AND RECOMPILATION 33
addic. r3,r3,0xfff
li r9,0x0
li r0,0x1
beq 0x1ee0
mtspr ctr,r3
add 2,r9,r0
or r9,r0,r0
or r0,r2,r2
bdnz 0x1ed0
or r0,r2,r2
or r3,r0,r0
blr
--------------------------------------
00001ee0 or r0,r2,r2
--------------------------------------
00001ee4 or r3,r0,r0
00001ee8 blr
--------------------------------------
Every block in the example above is atomic in terms of code flow; there is no instruction
that gets jumped to that is not the first instruction of a block. The following code flow
diagram illustrates this:
A dynamic recompiler can translate basic blocks as an atomic unit. The state between
instructions in a basic block has no significance. A basic block is the biggest unit with
this kind of atomicity, but with some constraints, bigger blocks are possible: For example,
instead of cutting blocks at the first control flow instruction, it is possible to cut them at the
first control flow instruction that points outside the current block.
There is certainly no need to translate the same block twice, therefore a dynamic recompiler
has to cache all blocks that have already been translated. If one basic block has been
executed, it has to be looked up whether the next one is already recompiled. If it is, it gets
executed, if it is not, it gets translated and then executed.
Dynamic recompilers are a lot more common than static ones. The most prominent example
is probably the Java Virtual Machine in many of its implementations (including Sun’s).
Apple’s ”DR” (Dynamic Recompiler), which simplified the migration of Motorola M68K
34 CHAPTER 2. INTRODUCTION TO EMULATION, RISC AND CISC
to PowerPC processors in Mac OS after 1994, and still ships with the latest version of Mac
OS X, can even run system components and drivers correctly. The open source scene has
produced a lot of dynamic recompilers for different types of source and target CPUs, many
of which have been used in a number of different emulators. Bernie Meyer’s Motorola
M68K to i386 recompiler of the Amiga Emulator UAE, for example, has also been used in
the Macintosh M68K emulator Basilisk II [17] and the Atari ST emulator Hatari [18]. On
a 166 MHz Pentium MMX, UAE/JIT achieves about the same performance as a 50 MHz
68060 [58].
Recompiled code is faster than interpreting code, and dynamic recompilation is more com-
patible than static recompilation, so dynamic recompilation seems to be the most sensible
choice for emulators. But dynamic recompilation is not necessarily faster than interpre-
tation: It moves work out of loops, but the work that has to be done per instruction is
increased: Translating an instruction and executing it clearly takes longer than interpreting
it, because the translator also has to generate machine code for the host platform. Therefore
recompilers are slower than interpreters if all instructions are only executed once.
The principle of locality states that the CPU spends 90 % of its time with 10 % of the code -
and 10 % of its time with 90 % of the code. As a consequence, only 10 % of the code really
needs to execute fast, the rest of the code might only be executed a few times or even just
once, so recompilation would not amortize. Moving the dispatching and decoding work
away from the execution of an instruction should only be done if it is moved outside a loop.
It is quite impossible to find out what instructions will be executed more often without
actually executing the code, so one easy algorithm is to interpret code, until it has been
executed a certain number of times, then recompile it. Code that does not get executed a
certain number of times will never be recompiled. This is called the ”hotspot” concept.
[20, p. 27] demonstrates this with the following formulas:
Interpreter nci
Dynamic Recompiler cr + nce
Hotspot Method if n > t then tci + cr + (n − t)ce else nci
ci cost of a single interpretation of the code
ct cost of recompilation of the code
ce cost of running the translated code
n number of times the code is to be emulated
t number of times the code is interpreted before it gets translated
2.1. INTRODUCTION TO EMULATION AND RECOMPILATION 35
100
average emulation time per emulation
80
interpreter
dynamic recompiler
60
hot spot
native execution
40
20
0
1
11
13
15
17
19
21
23
25
27
29
number of times code is emulated
ce = 2, ci = 25, cr = 100, t = 1
The average cost of one execution is calculated by dividing the result by n. Figure 2.10
illustrates the formulas.
In this case, dynamic recompilation is faster than interpretation if code is run at least five
times, but code that is only executed once has incredibly high cost. The hotspot method
also amortizes after five runs, but only has the costs of an interpretative run in case the code
is only executed once. After 30 runs already, the hotspot method is only 14 % slower than
dynamic recompilation, after 1000 iterations, the difference is down to 1 % .
Sun’s Java Virtual Machine is an example of a hotspot compiler[10]. It interprets Java
bytecode for a while to detect hotspots, and compiles methods later based on the data
gathered. The approach of FX!32, a solution that makes it possible to transparently run
i386 Windows code on the Alpha version of Windows, is somewhat different: When an
application is first executed, it is exclusively interpreted, and runtime behavior information
is gathered. A background process can then use the data to translate the application offline,
without having any time constraints. This is particularly effective with GUI applications,
because they spend a considerable amount of time waiting for user input.
For optimization purposes, modern high-level language compilers try to make optimal use
of the CPU’s registers, so they try to map as many local variables of a function to registers,
in order to minimize the number of memory accesses. As registers are limited, and a
function can have more local variables than the machine has registers, a decision needs to
be made what variables should be mapped to registers. Also, just selecting a subset of the
variables and map every variable to one register is inefficient: A variable might not be in
use during the complete function, so the register can be reused for another variable when
the original variable is not in use. Speaking more generally, variables whose lives do not
overlap can be mapped to a single register.
1 a := 0
2 L1: b := a + 1
3 c := c + b
4 a := b * 2
5 if a < 10 goto L1
6 return c
At first sight, it looks like the function needs three variables for the three registers a, b and
c. But in this case, two registers are enough, because the lives of a and b do not overlap.
A variable is live at a certain point of the function if there is a possibility that the value
it is containing will be used in the future, else it is dead. More formally, it is live if a
path (including jumps and conditional branches) exists from this point of the function to
an instruction which reads the variable, and the variable does not get assigned a new value
along this path. As an assignment overwrites the contents of a variable, the old value is no
longer relevant, so it cannot have been live before. Table 2.3 shows the lives of the variables
in the example above.
a is live in 2 and 5, because it gets read in these instructions. Instruction 4 overwrites a,
so a is dead in instruction 4, but also in instruction 3, because the next access is the write
access in 4, and there will be no read accesses earlier than instruction 4. a is also dead in 1,
because it gets written there, and in 6, because it doesn’t get read any more after instruction
5.
2.1. INTRODUCTION TO EMULATION AND RECOMPILATION 37
1 a := 0
2 L1: a := a + 1
3 c := c + a
4 a := a * 2
5 if a < 10 goto L1
6 return c
Now all this has to be translated into an algorithm. A control flow graph (CFG) helps
describe a function for this purpose. A CFG is a directed graph whose nodes are the indi-
vidual statements of the function. Edges represent potential flow of control. The CFG of
the example is shown in figure 2.11.
The two arrays succ[] and pred[] contain the sets of potential successors and predecessors
of each node. For example, succ[3] = { 4 } , succ[5] = { 2,6 } , pred[6] = { 5 } and pred[2]
= { 1,5 } . For the formal definition of liveness two more arrays are needed: def[] contains
the set of variables written (”defined”) by a specific instruction, and use[] contains the set
of variables read (”used”) by a specific instruction. For instance, use[3] = { b,c } and def[3]
= { c } in this example. Table 2.4 is the complete succ/pred/use/def table.
The formal definition of liveness is as follows: A variable v is live at a certain node n if a
path exists from n to node m so that v ∈ use[m] and foreach n <= k < m: v 6∈ def[k]. The
liveness analysis algorithm outputs the arrays of sets in[] (variable is live before instruction)
and out[] (variable is live after instruction). The following rules help design the algorithm:
1. v ∈ use[n] ⇒ v ∈ in[n]
If a variable is used by an instruction, it is alive just before the instruction.
38 CHAPTER 2. INTRODUCTION TO EMULATION, RISC AND CISC
Table 2.4: succ[], pred[], use[] and def[] for the example program
succ pred use def
1 2 a
2 3 1,5 a b
3 4 2 b,c c
4 5 3 b a
5 2,6 4 a
6 5 c
foreach n : {
in[n] = 0
out[n] = 0
}
repeat {
foreach n: {
in0 [n] = in[n]
out0 [n] = out[n]
in[n] = use[n] ∪ (out[n]\ def[n]) // (rules 1 and 2)
S
out[n] = s∈succ[n] in[s] // (rule 3)
}
} until in0 = in ∧ out0 = out
The algorithm repeatedly looks for new elements for in[] and out[] and terminates if there
have been no new additions. Within one iteration, the three rules of register allocation are
applied to all nodes. The result of the example is summarized in table 2.5.
at instruction n if v ∈ in[n]. The complexity of this algorithm is between
A variable v is live
2
O (n) and O n , n being the number of instructions of the function [47].
a b
foreach n: {
foreach a = def[n]: {
foreach b = out[n]: {
connect(a, b)
}
}
}
For every instruction n, every variable that is written to (a ∈ def[n]) is connected to every
variable that is alive just after the instruction (b ∈ out[n]). One exception optimizes the
interference graph: If an instruction is a move command ”a = b”, there is no need to add
the edge (a,b). The interference graph of the example is shown in figure 2.12.
Assigning n registers to the nodes of the interference graph now means coloring the graph
with n colors so that no adjacent nodes have the same color. The algorithm to achieve
this has two phases. In the simplification phase, all nodes are removed one by one and
pushed onto a stack. In every iteration, any node with less than n neighbors will be selected
and removed. If the graph has no node with less than n neighbors, any other node will be
selected and removed. This node is a spill candidate, i.e. it is possible that this variable
cannot be assigned to a register and must reside in memory, so a variable that is infrequently
used should be selected.
In the selection phase, the stack is completely reduced one by one, rebuilding the graph.
Every node that is reinserted into the graph will be assigned a color, i.e. a register. If a node
cannot be colored, it will not be inserted and spilled to memory instead. The complexity of
the register allocation algorithm alone is between O (n) and O n2 , n being the number of
variables. In practice, the speed of all algorithms together very much depends on the speed
of liveness analysis, because the number of instructions is typically a lot higher than the
40 CHAPTER 2. INTRODUCTION TO EMULATION, RISC AND CISC
number of variables.
2.2 Endianness
”It is computed that eleven Thousand Persons have, at several Times, suf-
fered Death, rather than submit to break their Eggs at the smaller End. Many
hundred large Volumes have been published upon this Controversy.” - Jonathan
Swift, Gulliver’s Travels
Endianness (”byte order”) describes the order of bytes (or bits) if multiple bytes (or bits)
are to be stored, loaded or transferred. There is no ”natural” endianness, instead, it is a
matter of convention.
A CPU whose bus transfer is limited to (aligned) machine words has no endianness6 . Oth-
erwise, there need to be conventions:
• If the CPU wants to access a value that has twice the width of a machine word, will
the low or the high word be stored at the lower address in memory?
• If the CPU wants to access parts of a machine word (for example, a byte) in memory,
does reading from the address of the word read the lowermost or the uppermost bits?
The little endian convention stores the least significant data first. When dividing a 32 bit
value into 8 bit fields, the 32 bit value 0x12345678 would be stored as 0x78, 0x56, 0x34,
0x12. Although most desktop computers in use today use the little endian system, it is only
because most desktop computers are i386 systems. The DEC PDP-11 (the original UNIX
machine), the DEC VAX (VMS), the MOS 6502+ (8/16 Bit Commodore, Apple, Atari,
Nintendo) and all Intel main branch CPUs like the 8080 (CP/M), 8086 and the i386 line
(i386, i486, Pentium, ...) are little endian designs.
The advantage of the little endian byte order is that operations like additions on huge data
can be split into smaller operations by handling byte for byte starting with the address of
the value in memory up to higher addresses. This example in 8 bit 6502 code adds two 32
bit values in memory:
clc
lda 10
adc 20
sta 30
lda 11
adc 21
sta 31
lda 12
adc 22
sta 32
6 If values that exceed the width of a machine word has to be processed, the application defines the endi-
anness.
2.3. INTEL I386 41
lda 13
adc 23
sta 33
This is especially useful when working with very huge data in a loop. The 32 bit addition
on the 8 bit CPU in the example could also have been done in a loop, counting up.
Furthermore, with little endian encoding, casting between different sizes is faster: In order
to cast a 32 bit value in memory to 16 bit, it is enough to read a 16 bit value from the same
address, as the least significant bit of the value is stored at a fixed position.
Big endian CPUs store the most significant data first. When dividing a 32 bit value into 8
bit fields, the 32 bit value 0x12345678 would be stored as 0x12, 0x34, 0x56, 0x78. Many
CPUs of the past and the present are big endian, such as the Motorola 6800 and 68000
(M68K) series (32 bit Amiga, Atari, Apple, Sega, Sun), the HP PA-RISC (UNIX) and the
SPARC (Sun). Although most modern CPUs can be switched between both conventions,
they are typically operated in big endian mode (MIPS, PowerPC)7 .
The main advantage of the big endian encoding is that memory dumps are easier to read
for humans, data can be converted into ASCII text more easily and the sign of a value can
be found out more easily, as the most significant bit is stored at a fixed position.
There is no noticeable performance difference between both conventions. Little endian
might seem a little more natural in a mathematical sense, but practically it is only a matter
of taste. It does not matter whether a machine is big or little endian - but endianness
can be a severe problem when machines of different byte order have to interact with each
other, especially when data is transferred from one machine to another machine that has a
different endianness, or when a program written for one machine is supposed to be run on
a machine with a different endianness. In the first case, all 32 bit values would have the
wrong byte order and the bytes would have to be swapped in order to retrieve the correct
value. But ASCII strings for example would be unaffected, as they are read and written
byte by byte, so they must not be converted. Unfortunately there is no way to tell what to
convert and what not to convert when changing the endianness of data.
In case the target CPU does not support the byte order of the source CPU, an emulator has
to address the endianness problem manually. This is typically done by byte swapping all
data that is read or written.
The Intel i386 line is a family of 32 bit CISC CPUs that has been first released in 1985,
as a backwards compatible successor to the popular Intel 8086 line. There is no consistent
7 Probably because UNIX systems are typically big endian, and because some network technologies like
name for this architecture. Intel, who used the name ”Pentium” for almost all CPUs from
this line since 1992, started referring to it as IA-32 (Intel Architecture, 32 bit) since the
1990s, AMD [68] and VIA [69], who are producing compatibles, refer to it as the ”x86
line”. Elsewhere it is also called the i386 or the 80386 line.
The term ”i386” has been chosen to be used in the following chapters, because IA-32 and
Pentium are both Intel marketing terms, and because ”x86” also includes the more primitive
16 bit versions - this work only deals with the 32 bit instruction set of the i386 line.
2.3.1 CISC
The i386 is a very typical example of a CISC processor. The term CISC means ”Complex
Instruction Set Computing” and has actually been first used in the 1980s to differentiate
the new RISC design from the traditional 1970s design. These CPUs have two major
characteristics:
2. Few registers: What makes computers expensive today are high-performance CPUs
- in the 1970s, it was memory. While today, computers can waste virtually as much
memory as they want to, memory was very limited back then, so everything had
to be done to make maximum use of it. In a world of 4 KB of RAM, this also
meant that machine code had to be very compressed9 . As a consequence, encoding
was quite complicated, with some special encodings occupied with very different
instructions, to save bits, and the number of registers was very limited. 16 registers,
for example, would have meant that 4 bits were needed to encode a register number
- if all instructions with only one register as an operand were supposed to occupy
only one byte, then there would only be 4 bits left to encode the actual operation. As
a compromise, many CPUs had more registers, but allowed certain operations only
on some registers. For example, indexed addressing was only possible with four
8 Compiler technology had already been quite advanced by the 1970s, but complex compilers were not
yet available for microprocessors, which were the low end computing line at that time, as opposed to IBM
System/360-style mainframes.
9 In 1977 for instance, Microsoft ported their famous 8K BASIC interpreter for the Intel 8080 onto the
MOS 6502, which has a less compact instruction coding. The 6502 version did not fit into 8 KB, but occupied
9 KB, which was disastrous at that time, because computers were typically equipped with 8 KB of ROM.
2.3. INTEL I386 43
(BX/BP/SI/DI) of eight registers on the 8086, and the Motorola M68K had two sets
of eight registers, one set for arithmetic, one for memory accesses and indexing.
2.3.2 History
The 8086 line of CPUs evolved over more than 25 years, and has its roots even further
in the past. There was no single design group, no formal definition of a new direction.
Instead every extension was based on what was technologically possible at that time, always
remembering not to break backwards compatibility or going too far into new directions.
The following summarized history of the 8086 line of CPUs is supposed to illustrate the
chaotic design of these CPUs [30] [31]:
• 1974: The 8080, which is similar to the 8008 (whose design has been specified by
Intel’s customer Computer Terminal Corporation) from two years earlier, is released.
It has a 16 bit address bus (64 KB) and eight 8 bit registers A, F (flags), B, C, D, E,
H, L, of which the B/C, D/E and H/L combinations could be used as 16 bit registers.
The 8080 and its immediate successors and clones (8085, Zilog Z80) are very popular
CPUs and are a requirement for the most successful 8 bit operating system, Digital
Research CP/M.
• 1978: The 8086 is released. It has a 20 bit address bus (1 MB) and eight 16 bit (non-
general-purpose) registers AX, BX, CX, DX (”X” means ”extended”), SI, DI, BP, SP.
The former four registers can be split into eight 8 bit registers (AL, AH, BL, BH, CL,
CH, DL, DH). It is not binary compatible with the 8080, but it is compatible on the
assembly level: By mapping A to AL (lower half of AX), HL to BX, BC to CX and
DE to DX, 8080 assembly code could be transformed into 8086 assembly code easily
by an automated program [59]. Also the memory model is compatible, as the 8086
addresses its memory in 64 KB ”segments”. The 8086 was supposed to simplify the
migration from 8 bit CP/M to 16 bit CP/M-86. The 8086 has its breakthrough in
1982 with the the IBM PC, running with MS-DOS, a CP/M-86 clone by Microsoft.
• 1980: The 8087 floating point extension unit is released, which has been designed
by an independent team in Israel instead of the 8086 team in California. As the 8087
is only an extension to the 8086, instructions are fetched by the 8086 and passed to
the 8087. Since the protocol between the two chips requires the data to be minimal,
instructions for the 8087 cannot be long enough to contain two registers as operands,
so the designers chose to create a floating point instruction set that needs one operand
at most, by equipping the 8087 with a stack consisting of 8 registers. Binary instruc-
tions can either work on the two uppermost values on the stack, or with the value on
the top-of-stack, and another value taken from any position in the stack. Stack over-
flows are supposed to be handled in software, using exceptions, to simulate a stack
of an arbitrary size. Unfortunately, a design flaw 10 made this idea, which certainly
doesn’t excel in speed, ultimately uneconomic.
10 Instead of returning an address pointing directly to the faulty instruction, so it could be executed again,
• 1982: The 80286 is released, which sports a 24 bit address bus (16 MB), a more
sophisticated memory model and simple memory protection. The CPU can switch
between ”real mode” (8086 compatible) and ”protected mode” (new memory model
and protection). Two design flaws in the 80286 are patched by IBM in its IBM AT
computers: As the address bus is now 24 bits wide, real-mode addresses above 1 MB
do not wrap around to the bottom of RAM, as some 8086 applications required. This
is fixed by IBM’s A20 gate, which can be instructed to hold down the CPU’s address
line 20 to zero, simulating the effect. Intel moved this feature into the CPU later and
is still present in today’s i386 CPUs. The other flaw made it impossible to switch
from protected mode into real mode11 . Many old (real mode) applications wanted
to access memory above 1 MB by switching into protected mode, copying memory
and switching back, but this was only made possible when IBM added functionality
that would set a flag in the NVRAM, soft-reset the CPU, detect in the startup ROM
whether the flag had been set, and continue execution in the original code if it was
the case. This is fixed in the i386 with a direct way to exit protected mode, though
the reset method is still supported by ROM for compatibility.
• 1985: The i386 is released. It has eight 32 bit registers, EAX, EBX, ECX, EDX, ESI,
EDI, EBP, ESP (”E” means ”extended”, once again, so ”EAX” is ”extended” accu-
mulator ”extended”), which are a lot more general purpose than the 8086’s registers,
improved and orthogonalized instructions and addressing modes, a (theoretical) 32
bit address bus (4 GB) and virtual (paged) memory. The i386 can still access the 16
(”AX”) and 8 bit parts (”AL, ”AH”) of all registers and still has the 8086 real mode
and the 80286 protected mode next to the new 32 bit protected mode and the virtual
8086 mode. However, a design flaw of the 32 bit protected mode made the i386
unsuited for many modern operating systems like Windows NT, because when exe-
cuting in kernel mode, write-protection on memory pages is not enforced12 . Other
than switching temporarily to 32 bit protected mode to copy between the low 1 MB
and the extra memory above 1 MB, and some games running on so-called ”DOS
Extenders”, the i386 extensions were hardly used for 10 years, until the release of
Windows 95. Yet, the i386 extensions have been the most important so far to the
8086 line.
• Between 1985 and 1997, although the i386 line of CPUs was drastically optimized
internally with each new CPU, there were no significant extensions to the i386 on
the ISA level. The i486 (1989) and the Pentium (1992) introduced only a few new
instructions (dealing with semaphores13 and endianness), and the Pentium Pro (1995)
11 Actually,it was later discovered that a secret Intel instruction, known as ”loadall”, was capable of exiting
protected mode into real mode. However, by the time this knowledge reached the public, the i386 had already
been released.
12 Newer operating systems use the ”copy-on-write” memory strategy frequently, by marking pages as read-
only to detect writing, but with this misguided i386 ”feature”, the operating system had to manually check
write permission in system calls, a huge bottleneck. The i486 corrected this by adding a new mode, disabled
by default for compatibility with the i386, that causes write-protection to be fully enforced even in kernel
mode.
13 The Pentium’s ”cmpxchg8b” brought along with it a security bug (”F00F”) that nearly forced Intel to
2.3. INTEL I386 45
Let’s summarize: The latest 64 bit CPUs are extensions of a 32 bit architecture (by Intel)
based on a 16 bit architecture (by Intel) that had been designed to be somewhat compati-
ble with an 8 bit architecture (initially designed by Computer Terminal Corporation). The
floating point design was based on properties of the 8086 interface and slowed down appli-
cations until the advent of SSE2 in in 2001, which superseded the old stack. The latest i386
CPUs use the 8086 instruction encoding, enhanced with prefixes to differentiate between
16/32/64 bit memory and data, and to access the upper eight registers. Even the Athlon
64 and the Pentium 4/EM64T still start up in 16 bit real mode - also the PC-based gaming
console Microsoft Xbox executes its first few cycles in 16 bit code. Nevertheless, the i386
recall the Pentium a second time.
14 The new eight 64 bit registers are, though bad for performance, physically mapped to the 8 floating point
registers, so that operating systems do not have to be patched to save more registers on a context switch
15 neither ”EEAX” nor ”EAXX”
46 CHAPTER 2. INTRODUCTION TO EMULATION, RISC AND CISC
is the most successful architecture on the desktop, mainly because of its DOS/Windows
success.
The 8008/8080/8086 heritage is important in order to understand some of the quirks of the
i386, but in the following chapters, only the 32 bit i386 architecture without the floating
point or SIMD units will be of importance as its instruction set is the common denomina-
tor of the i386 architecture and it is used by virtually all currently available applications
designed for the 8086 line.
2.3.4 Registers
The i386 has eight 32 bit registers which are more or less general purpose: EAX, ECX,
EDX, EBX, ESP, EBP, ESI, EDI16 . Table 2.7 summarizes the original purposes of the
specific registers on the 8086.
16 The order is not alphabetical, as this is their internal numbering: The registers were most probably not
designed to be A, B, C and D; instead, four registers have been designed and names have been assigned to
them later. Clever naming lead to A, B, C and D, which did not conform to the original order.
2.3. INTEL I386 47
On the i386, these registers are pretty much general-purpose. Some deprecated instruc-
tions like ”enter”/”leave” and ”loop” use specific registers, but all instructions that are still
typically used today work with any of the registers. There are two exceptions to this:
”mul”/”div” still always imply EAX/EDX, and some complex ”scale indexed” addressing
modes do not work with the stack pointer ESP. Although ESP can be used in any other
instruction or addressing mode (it can contain the source value for a multiplication or be
shifted right, for example, which is not typically useful for a stack pointer), it is not practi-
cal to use it for any other purpose. In some modes of the CPU, exceptions and interrupts for
example will always push the flags onto the stack using ESP as a stack pointer, and some
operating systems also assume ESP to point to a stack.
The lower 16 bits of each of the eight 32 bit registers can be accessed by omitting the ”E”
from the register name. For EAX, EBX, ECX and EDX, it is also possible to access the
upper and the lower 8 bits of the lower 16 bits separately: AH, AL, BH, BL, CH, CL, DH,
DL. (There is no way to access the upper 16 bits of a registers directly or to split the upper
16 bits as it is possible with the lower 16 bits.) So the i386 can effectively natively work
with 8, 16 and 32 bit values.
There are two more special purpose user mode registers: The 32 bit program counter EIP
(instruction pointer), which cannot be addressed in any other way than by explicit control
flow instructions, and the 32 bit EFLAGS registers, of which only four bits (sign flag,
overflow flag, zero flag, carry flag) are typically used in user mode17 . EFLAGS can only
be accessed by placing it onto or taking it from the stack, or by copying the lower 8 bits to
or from the AH register.
17 EFLAGS also stores the current mode of the CPU, which makes virtualisation of an i386 pretty tricky,
2.3.5.1 Characteristics
Depending on the type of the instruction, it can have zero, one, two or three operands, but
”imul” is the only instruction that supports three operands. This means that there can be
no operation with two sources and one destination (like a = b + c), instead, one operand is
both one source and the destination (like a += b). The same applies to unary operations:
Shifts, for example, work on a single register, overwriting the old value.
Instructions with one operand can have, depending on the instruction, an immediate value,
or a register or a memory address (”r/m”) as an operand. Instructions with two operands can
have register/register, immediate/register, immediate/memory, register/memory or mem-
ory/register as operands. This is all combinations except for memory/memory. All these
addressing modes are encoded using the ”r/m” encoding.
b(register1,register2,a)
register1 and register2 are any 32 bit registers (register2 cannot be ESP), the scaling factor
a is either 1, 2, 4 or 8, and the displacement b can be 8 or 32 bits wide. This addresses
the memory location calculated by register1 + register2 ∗ a + b. All components of this
format are optional (of course at least register1, register2 or b must be present), so seven
addressing modes are possible, as shown in table 2.8.
Some of these addressing modes are very powerful. The following example is supposed to
illustrate this: A program stores its data at a (flexible) memory base address, which is stored
in register EBX. A data structure of this program is an array of 32 bit integer variables. The
offset of this array from the beginning of the flexible memory region has been hard-coded
by the compiler, it is 0x10000. The program now wants to increment the value in the array
whose index is stored in ECX. The following assembly instruction does exactly this:
incl 0x10000(%ebx,%ecx,4)
This instruction has to do a shift and an addition of three values to calculate the address,
but nevertheless all modern i386 CPUs calculate the address in a single step, that is, the
instruction is just as fast as incl 0x10000 or incl (%ebx).
2.3. INTEL I386 49
This loads the ”address” EBX+ECX to EAX - so this effectively adds EBX and ECX and
stores the result in EAX. This is another form of lea abuse:
2.3.8 Endianness
The i386 is one of quite few architectures that stores values that occupy more than 8 bit in
little endian order and have no option to operate in big endian mode.
9|LOCAL DATA |
|---------------|
8|PARAMETERS |
...
7|PARAMETERS | =-8(EBP)
6|return address |
5|saved EBP |<- EBP
4|LOCAL DATA | = 4(EBP)
...
3|LOCAL DATA |<- ESP
2|saved registers|
...
1|saved registers|
+---------------+
push $2
push $1
call function
add $8, %esp
This sequence pushes two parameters, ”1” and ”2” onto the stack in reverse order, calls the
function and removes the parameters from the stack afterwards by adding 8 to the stack
pointer. The function would look like this:
push %ebp
mov %esp, %ebp
sub $0x10, %esp
push %ebx
push %esi
...
mov 8(%ebp), %eax
mov 12(%ebp), %ebx
mov %eax, -4(%ebp)
mov %ebx, -8(%ebp)
...
pop %esi
pop %ebx
add $0x10, %esp
pop %ebp
ret
First EBP gets saved and the ESP gets copied into EBP. 0x10 is subtracted from the stack
pointer to make room for local variables on the stack. As the function might change
52 CHAPTER 2. INTRODUCTION TO EMULATION, RISC AND CISC
EBX and ESI, it pushes these registers. The body of the function can now access the
first operand at ”8(%ebp)” and the second one at ”12(%ebp)”. Space for local variables is
from ”-4(%ebp)” down to ”-10(%ebp)”. At the end, the function restores EBX and ESI and
removes the stack frame by adjusting the stack pointer and restoring the base pointer.
2.4 PowerPC
”The PowerPC is a hybrid RISC/CISC” - Cristina Cifuentes [43]
The PowerPC architecture is a formal definition for 32 and 64 bit CPUs, released in 1991
by Apple, IBM and Motorola.
2.4.1 RISC
The PowerPC architecture is an example of a RISC (”Reduced Instruction Set Computing”)
CPU. RISC is a concept developed at IBM, Stanford and UC Berkeley in the late 1970s
and early 1980s to overcome the typical deficiencies of CPUs of the 1970s [60] [52].
RISC has two main characteristics:
1. Fewer and simpler instructions: Compilers usually generate code that only uses a
small part of the instruction set of a CISC processor. Bart Trzynadlowski’s statis-
tics [61] show that the most frequent five instructions18 are more than 70% of all
instructions executed on a Motorola M68K. The idea of the ”quantitative approach
to computer design” [31] is to make the common case fast. Complex instructions are
removed from the instruction set, which makes the microcode interpreter obsolete
and makes decoding easier, even more so as all instructions have the same width. All
remaining instructions can be executed in one clock cycle, which makes pipelining
and superscalar execution easy. The design is a lot simpler, work has been moved
from hardware to software.
2. More registers, load/store architecture: The lack of a microcode interpreter and a
complex instruction decoder frees up transistors that can be used otherwise: better
pipelining logic, multiple ALUs and, most importantly, many registers. If the CPU
has more registers, more operations can be done on registers, so less memory ac-
cesses are needed, which can be a significant performance improvement. The term
”load/store architecture” refers to the fact that memory accesses are only possible
through load and store instructions; all other operations only work on registers.
The time required for a program to execute is calculated using this formula:
time time cycles instructions
= · ·
program cycle instruction program
CISC CPUs keep the number of instructions per program low, while the time/cycle is high.
Because RISC CPUs do not have complex instructions, the number of instructions per
18 tst, beq, bne, move, cmp, cmpi
2.4. POWERPC 53
program is higher, but this is compensated by a lower number of cycles per instruction. In
practice, 32 bit PowerPC code is less than 10% larger than 32 bit i386 compiled code19 ,
but the fact that today’s compilers generate RISC-like code also for CISC CPUs must be
taken into account, so CISC code could be a lot smaller.
2.4.2 History
In 1990, IBM releases the RS/6000 (RS stands for ”RISC System”, later renamed to ”IBM
pSeries”) UNIX workstation based on their new ”POWER” (”Performance Optimization
With Enhanced RISC”) multi-chip CPU, which they would like to see being used in the
desktop market as well [62], PowerPC-arstechnica. At the same time, Apple is looking for a
new CPU for their Macintosh computer line, as they want to switch away from the Motorola
M68K CISC architecture which has not been able to keep up with other architectures on
the market. Motorola has released the M88K RISC CPU in 1988, but the market showed
little interest.
So Apple, IBM and Motorola (”AIM”) collaborate to design a CPU architecture. IBM
contributes the instruction set and the design of the POWER CPU, Motorola contributes
the hardware interface as well as the memory management and cache system of the M88K,
and Apple contributes compiler construction knowledge used to refine the instruction set.
In 1991, the ”PowerPC Architecture Definition” (for 32 and 64 bit processors) is released.
”Architecture Definition” means the exact software specification of the minimum function-
ality of a PowerPC CPU, independent of a possible implementation, as it has been done
with the SPARC and MIPS architectures before. In 1992, the MPC601, the first PowerPC
CPU is released, which is a hybrid between the POWER and PowerPC architectures that
implements some instructions that were unique to the POWER architecture (the remaining
ones could be trapped and emulated in software) in addition to being fully PowerPC com-
pliant, so that it can be used in RS/6000 machines as well. In 1994, the first Macintosh
with a PowerPC processor is released, the Power Macintosh 6100, introducing Motorola’s
PowerPC 601.
Aside from Macintosh systems, The PowerPC has since become very important in the
embedded market, especially in automotive and multimedia applications. The Nintendo
GameCube gaming console is driven by a PowerPC 750 (G3) by IBM, and the successor
to Microsoft’s Xbox will include three 64 bit IBM PowerPC 970 (G5) cores.
Since the formal definition of the PowerPC architecture, it has hardly changed. The def-
inition already included 32 and 64 implementations, and a 64 bit PowerPC has not been
available until 2003 with the IBM PowerPC 970 (G5). The only addition was the AltiVec
(called ”VelocityEngine” by Apple) SIMD unit, which was introduced with Motorola’s G4
CPUs in 1999.
In the following chapters, only 32 bit implementation of the PowerPC without the floating
point and AltiVec units will be of importance.
19 ”/bin/bash” of OpenDarwin 7: the PowerPC code is 598048 bytes, the i386 part 552891 bytes (8.2%);
”/usr/bin/unzip”: the PowerPC code (Mandrake Linux 9.1) is 111500 bytes, the i386 code (SuSE Linux 8.2)
is 109692 bytes (1.6%)
54 CHAPTER 2. INTRODUCTION TO EMULATION, RISC AND CISC
2.4.3 Registers
The PowerPC has 32 general purpose registers, r0 to r31, each of them 32 bits wide. All
of these registers can equally be used for logical and arithmetic operations. There is one
exception to this: Some instructions cannot use r0 as a source, and use the encoding for an
operand of the value ”0” instead. Furthermore, there are eight 4 bit condition registers cr0
to cr7, which are typically written to by compare instructions and evaluated by conditional
jumps.
The PowerPC has two more special purpose registers: In case of a function call, the link
register ”lr” gets set to the address after the call instruction, and the return instruction just
jumps to the address stored in lr. This eliminates two stack accesses for the leaf function in
a call tree. The count register ctr is mostly used in conditional jumps: The conditional jump
instruction can optionally decrement the count register and make the branch additionally
dependent of whether ctr reached zero. ctr is also used for calculated jumps by copying the
address into ctr and using the ”bctr” instruction.
Unlike on the i386, user and supervisor registers are strictly separated, which makes the
PowerPC easy to virtualize, and also simplifies recompilation of user mode code.
ˆ[0-9A-F][0-9A-F] .*" 386intel.txt | wc -l (returns 585) is a rough estimate for the number of pos-
sible opcodes. grep ˆ[0-9A-F][0-9A-F] .*/r" 386intel.txt | wc -l (returns 91) gives an estimate
of instructions that support ”r/m” addressing, and are thus available in 7 different addressing modes.
2.4. POWERPC 55
and logic instructions can optionally write information about the result (positive, negative,
zero, overflow) into the first condition register (cr0). This is indicated by the suffix ”.” of
the mnemonic. The fact that this is optional is handy for both the hardware implementation
of the CPU (less dependencies in the pipeline) and for the implementation of an emulator
(less unnecessary work).
The compare instructions can write the result in any of the eight condition registers. There
is a compare instruction for signed and for unsigned data, so that the condition register
will contain the precise information about the result instead of flags like carry, sign and
zero, i.e. the condition register does not have to be interpreted later, and the conditional
jump instruction need not be aware of whether the compared values have been signed or
unsigned. These conditional jump instructions can make the jump depend on either of the
32 (eight times four) condition register bits and/or whether the counter register ctr, which
can optionally be decremented by the conditional jump instruction, has reached zero.
2.4.6 Endianness
The PowerPC CPUs can work both in big endian and in little endian mode, big endian
being the native mode though. This means that data storage works in big endian mode by
default, and a little endian mode can be simulated. The CPU does not reverse the byte order
of all data when accessing memory, but leaves word accesses unchanged and only changes
the effective memory address in case of non-word memory accesses. This behavior is
transparent to the user, but it is not fully compatible with a real little endian mode when
accessing memory mapped hardware.
All major PowerPC operating systems (Mac OS/Mac OS X, Linux, Nintendo GameCube
OS, Xbox 2 OS) operate the PowerPC in big endian mode.
...
12|PARAMETERS |
...
11|PARAMETERS |
10|LR |
9|previous SP |
|---------------|
8|saved registers|
...
7|saved registers|
6|LOCAL DATA |
...
5|LOCAL DATA |
4|PARAMETERS |
...
3|PARAMETERS |
2|space for LR |
1|previous SP |<- SP
+---------------+
Stack frames and calling conventions on the PowerPC are quite complex. Apple’s descrip-
tion [34, pp 31ff] occupies six pages, for example.
As the general purpose register r0 is somewhat special in that it cannot be used as an in-
dex, the next register, r1, is used as the stack pointer by practically all operating systems,
although any other register other than r0 could be used. The stack grows from bigger
addresses to lower addresses. The PowerPC ”EABI” (”Embedded Application Binary In-
terface”) standard [33] also defines the layout of the stack and calling conventions, that is,
how to pass parameters and return values.
Inside a function, the stack looks like figure 2.14 (bigger addresses on top). The stack
pointer points to an address (1) that contains the previous stack pointer, in this case 9. The
address above (2) is empty and can be used by a subfunction called by this function to save
the link register. So this function saved the link register at address 10. At the top of the
stack frame (7 to 8) volatile registers that the function modifies are saved. Below, local
variables are stored (5 to 6) and just as much space is reserved to fit all parameters accepted
by the subfunction with most parameters (3 to 4). The parameters for this function are
stored in the caller’s stack frame (11 to 12).
Although there is space for all parameters on the stack, most parameters are passed in
registers: If the parameter data is less than 8 times 32 bits, the parameters are stored in r3
2.4. POWERPC 57
to r1022 . If there are more parameters, the caller stores them in its stack frame leaving the
first eight addresses (for r3 to r10) empty.
Results up to 32 bits are returned in r3, 64 bit results in r3 and r4, and larger results are
written to memory pointed to by the implicit parameter r3. The stack pointer (r1), r13-r31
and cr2-cr4 must be preserved during a function call, all other GPRs, condition registers as
well as the count register (ctr) can be overwritten. Due to the nature of function calls on
the PowerPC, the link register will always be overwritten by the function call instruction.
Typical code to create and destroy a stack frame looks like this:
22 The rules are somewhat more complicated when floating point values are passed.
23 The IBM PowerPC 970 (G5) lacks the feature to switch the endianness, which made VirtualPC 6 incom-
patible with the new Apple Macintosh line.
58 CHAPTER 2. INTRODUCTION TO EMULATION, RISC AND CISC
Chapter 3
Design
This chapter describes the ideas, concepts and the concrete architecture of the PowerPC to
i386 recompiler that has been developed in the context of this thesis.
59
60 CHAPTER 3. DESIGN
ning, RISC CPUs have been a lot less complex than CISC CPUs: The first ARM CPU
(RISC, 1986) only had 30,000 transistors, while the Motorola 68000 (CISC, 1979) had
68,000 - the RISC CPU had less than half the transistors, although it was seven years
younger and a lot more powerful. Today, the differences have become smaller due to the
increased importance of other units than the instruction decoder, as table 3.1 demonstrates
(data taken from [64]).
The i386 CPUs need to support many legacy features and its instruction set is certainly
less than optimal. The most important deficiency is the very low number of registers (only
eight), but since 2004, the new lines of AMD and Intel i386 CPUs (AMD64/EM64T) have
16 general purpose registers - which is more than the RISC CPU ARM, which has one
register less1 .
After all, the i386 is not so bad at all nowadays, as it is a lot closer to RISC CPUs like the
PowerPC as it used to be. In practice, modern i386 CPUs achieve about the same perfor-
mance as their PowerPC counterparts. In certain situations, CISC CPUs can even be faster.
Hand optimized i386 code can sometimes encode an operation a lot more efficiently, either
because a microcoded operation is still faster than a function in assembly in some cases
(e.g. i386 memory copy ”movs”) or because more tricks are possible, as the instruction set
is more complex (for example the multiplication by five shown earlier).
So modern RISC and CISC CPUs are very much alike, and both have their unique advan-
tages. Sure, RISC is the cleaner design, but both designs have their right to exist. This is
another motivation for targeting CISC with a recompiler.
• Too few registers: The PowerPC has 31/32 general purpose registers (r0 might not be
regarded as fully general purpose), while the i386 only has 8, some of which are the
only ones that can be used in certain combinations (multiplication, stack). Compiled
PowerPC code always makes optimal use of the registers, often using more registers
that would be necessary, in order to keep dependencies between instructions low.
1 r15 is the program counter, which is a non-GPR on the ARM.
3.1. DIFFERENCES BETWEEN POWERPC AND I386 61
Mapping only some PowerPC registers to i386 registers and mapping the remaining
ones to memory is generally slower than should be necessary.
• Incompatible condition registers: The PowerPC has eight condition registers, com-
pared to the single one of the i386. In addition, they are incompatible: PowerPC
condition codes store a concrete result (greater, lower, equal), while the i386 flags
store data that has to be interpreted later (sign, overflow, carry, zero). So on the
PowerPC, only the compare instruction has to be aware of whether the values are
signed or unsigned, while on the i386, only the conditional jump has to be aware
of it. Explicitly calculating flags can be fatal for an emulator, as every source com-
pare/branch combination (which are expensive by definition) may lead to several
additional compare/branch instructions on the i386, in order to manually test the re-
sult. Furthermore, the PowerPC only updates the condition codes after an arithmetic
instruction if a bit in the instruction code is set, while the i386 always updates the
flags, so the i386 flags can hardly be used to store PowerPC condition codes, as they
would be frequently overwritten.
• Missing link register: Unlike the PowerPC, the i386 has no link register. Any i386
register can take the place of a link register, though, because all it has to support is
load, store, load immediate, and a jump to the address pointed to by the value of the
register. Unfortunately this has drawbacks: i386 registers are few, and this would
occupy yet another register, and jumps to locations pointed to by registers are slower
on the i386 than a simple ”ret” instruction, as the latter case is very optimized: ”ret”
instructions are detected very early so that the target of the jump will also be known
very early to fill the pipeline in time, but jumps to values in registers are typically
only handled by the branch prediction logic, so the pipeline might not be filled in
time.
• Different stack conventions: The stack is very different on the PowerPC compared
to the i386. First, there is no instruction that could be mapped to ”push”: While
values are ”pushed” on the stack (store/sub/store/sub...) on the i386, they are stored
(SP indexed with displacement) on the stack on the PowerPC, decreasing the stack
pointer once afterwards (store/store/store/sub). The i386 can emulate this behavior,
but it bloats the code. Second, the complete conventions are different, for example,
the passing values to functions using the registers is not possible on the i386, instead
they would be stored on the stack. As before, the i386 can emulate this behavior,
but it is not optimal, as its ”native” stack structure would be a lot faster for many
reasons, one of them being that indexing with EBP is faster than with ESP because
of the shorter instruction encoding.
in calculated jumps, it cannot. This is worse when translating from the PowerPC to
the i386: Address calculation is complicated and cannot be preformed in a single in-
struction, so the complete sequence would have to be detected and replaced with i386
code. Translating from the i386 to any architecture would have been simpler: One
single instruction (something like jmp *table address(,%ebx,4) is used) does
the complete address calculation and can therefore be translated more easily.
• Endianness: In most applications, the PowerPC is big endian, while the i386 is
always little endian. If all memory accesses always were aligned 32 bit accesses, this
would be no problem, but the PowerPC supports byte and halfword accesses as well
as unaligned accesses. Every (unaligned or non-word) access by the PowerPC would
have to go along with endianness conversion on the i386.
• high compatibility
• low latency
• high performance
On first sight, these three objectives seem incompatible. Low latency is mutually exclusive
with high performance, because significant improvements in the quality and speed of the
generated code can only be achieved by making the recompiler slower and thus increase
the latency of execution. Similarly, high speed and high compatibility are incompatible:
The higher the accuracy the slower the emulation, as shown earlier.
emulation that is a lot faster possible. For high compatibility really to be achieved,
a dynamic recompiler must be used instead of a static one, and a fallback interpreter
must be included for calculated jumps and jump tables.
• Low latency: A recompiler that produces good code slows down latency. Therefore
the hotspot principle must be applied: An emulation method with low latency will be
used for most of the code, and hot spots will be optimized.
• High speed: High speed can be gained by a recompiler that produces very good code.
Of course this can only be done for hotspots, in order not to increase latency. This
”make the common case fast” principe can also be applied in a second way: Today’s
code is typically compiled code, not hand optimized assembly. Furthermore, as only
user mode emulation has to be done, operating system code like mode switches do
not have to be coped with. Therefore the system is supposed to be optimized for
machine code that has been produced by a compiler. Since compilers produce some
typical sequences, additional optimizations can be done.
Another idea to improve the performance of the generated code is to make maximum use
of the characteristics of the target CPU. Emitting good target code need not be slower than
emitting bad code. A rule of thumb for good target code is: If the produced code looks
like code that would be emitted by a standard compiler, then the code is fast, as CPUs are
optimized for compiled output as well.
Pass 1 Pass 2
BB Cache Block Linker Pass 1 Recompiler Pass 2 Analysis Register Allocation Pass 2 Recompiler
The pass 1 recompiler passes every single instruction to the instruction recompiler and,
optionally, to the disassembler for debugging purposes. The instruction recompiler trans-
lates a single PowerPC instruction into i386 code; PowerPC registers get mapped to i386
registers or memory locations by the register mapper.
Pass 2 logic is the core of pass 2 recompilation. It finds whole functions, passes them to
the register allocator in order to get the optimal register allocation for every function, and
then passes them to the pass 2 recompiler. With the help of pass 2 analysis, register usage
statistics are made, then liveness analysis is done and finally an optimal allocation is found.
The pass 2 recompiler passes every single instruction to the instruction recompiler, just as
in pass 1, but this time the instruction recompiler is in a different mode so that it will use
the dynamic register allocation and optimize stack accesses.
Some of the following sections might not always describe some design accurately at once.
Instead, only the aspects needed for a certain functionality are described, and additions that
are needed for another functionality are described later. This approach has been chosen to
make the design easier to understand, although it makes it harder to make practical use of
the information without reading to the end.
cost o f recompilation
+ cost o f execution
number o f times
This results in high costs if executing only a few times, given recompilation is slow. The
hotspot method eliminates the peak at the beginning by interpreting at the beginning and
recompiling later. For few runs this is faster than recompilation, and for many runs this is
just a little slower than recompilation.
But the two assumptions that have been made are not entirely accurate. Recompilation is
not necessarily much slower than interpretation. A recompiler optimized for compilation
speed might be only slightly slower than an interpreter that has been optimized for speed.
3.3. DESIGN DETAILS 65
100
average emulation time per emulation
80
interpreter
dynamic recompiler
60
hot spot
native execution
40
20
0
1
11
13
15
17
19
21
23
25
27
29
number of times code is emulated
Also, while it is true that a lot of code only gets executed once per run of an application,
most applications are run multiple times, and many of them, such as UNIX tools, very
often. If we do not count a single machine, but a complete network of machines, this is
even more true.
As it has been shown earlier, the part that makes an interpreter so slow is the dispatcher
jump or jumps and decoding. The example needed 28 cycles for the interpretation of one
instruction, 15 of which are caused by the unpredictable jump. The following code is based
on the interpreter, but it has been changed so that the instruction does not get executed, but
i386 code gets emitted instead that matches the PowerPC instruction:
While 3 instructions were necessary for the execution, the recompiling part in this example
consists of 10 instructions for the translation (instructions 19 to 28). One iteration of the
interpreter took 28 cycles - this recompiler code takes 34 (see Appendix A). While the chart
above assumed that recompilation was slower by a factor of 4, in this case, recompilation
is only 26% slower than interpretation.
The recompiler can be this fast because it does not do any optimizations at all and maps
all PowerPC registers to memory locations. ”mapping”, in this case, is a 32 entry table
that contains the addresses in memory used for register emulation. Of course, the code
produced by this recompiler is not optimal:
a1 00 00 01 00 mov 0x00010000,%eax
0b 05 04 00 01 00 or 0x00010004,%eax
a3 08 00 01 00 mov %eax,0x00010008
This code might be slower by a factor of 3 or 4 than the output of a more optimized recom-
piler that uses i386 registers. It is certainly possible to improve the quality of the output
code, but this will slow down the recompiler.
A compromise has to be found between recompilation and execution speed. A sensible goal
is a recompiler that consumes less than double the time of the interpreter. The produced
code should be slightly better than the example code above.
Eliminating the interpreter (except for certain jump constructs) has two important advan-
tages: There is less code in memory during execution, and there is a lot less source code to
maintain. A system that includes a full interpreter and a full recompiler has roughly double
the lines of the recompiler alone.
Of course the code produced by a recompiler that has double the costs of an interpreter
emits better code than shown above, but the code is probably still far from being opti-
mal. Therefore it makes sense to apply the hotspot method, by combining a fast and an
optimizing recompiler. Instead of the interpreter, a fast recompiler already translates the
3.3. DESIGN DETAILS 67
120
average emulation time per emulation
100
80
hotspot
60
rec/rec hotpot
40
20
0
10
25
40
55
70
85
5
10
11
13
14
16
17
19
20
22
23
number of times code is emulated
PowerPC code when a block is first encountered. Subsequent runs are done by executing
the recompiled code. After a certain number of times a block of code has been run, it will
be recompiled by an optimizing recompiler, which is of course a lot slower than the fast
recompiler, but it also produces a lot better code.
The following graph compares the traditional hotspot method with the recompiler/recompiler
hotspot method:
These are the formulas that have been used:
Hotspot Method if n > t then tci + cr + (n − t)ce else nci
Rec/Rec Hotspot Method if n > trr then cr1 + nce1 + cr2 + (trr − n)ce2 else cr1 + nce1
cr1 and cr2 are the costs of pass 1 and pass 2 recompilation. ce1 and ce2 are the costs of pass
1 and pass 2 execution. trr is the threshold for pass 2 recompilation.
In the example shown in figure 3.3, the traditional hotspot method is of course faster by
a factor of 2 when executing an instruction once, because it interprets the instruction. Up
to the threshold of the recompiler/recompiler hotspot method, the recompiler/recompiler
method is faster, because the slower recompilation of the traditional method has not amor-
tized yet. The second pass of the recompiler/recompiler method makes this method slower,
but only slightly. After 100 runs, they are practically equally fast.
There are a lot of numbers to play around with: If a lot of code runs up to four times, it
makes sense to set the threshold of the traditional hotspot method to something higher, like
five. In this case, the recompiler/recompiler method is faster in all cases other than 1 (figure
3.4).
In oder to achieve a translation time that is less than twice the interpretation time, there is
68 CHAPTER 3. DESIGN
160
140
average emulation time per emulation
120
100
hotspot
80
rec/rec hotpot
60
40
20
0
10
25
40
55
70
85
5
10
11
13
14
16
17
19
20
22
23
number of times code is emulated
little choice in methods and algorithms. As the complexity of the interpreter is O (1), this
means that the complexity of the recompiler must be O (1) as well, so all instructions must
be translated separately.
Memory Mapping The most simple method is of course the ”do nothing” method. In-
stead of mapping some source registers to target registers, all source registers are mapped
3.3. DESIGN DETAILS 69
to memory. An array in memory contains all registers, and every register access in the
source language will lead to a memory access in the recompiled code.
”Do nothing” is the fastest method in terms of recompilation speed. All that has to be done
is read the register-to-memory mapping from a table and emit opcodes that have memory
locations as operands. The sequence that does the actual translation does not contain a
single conditional jump:
Of course, the ”do nothing” method produces very inefficient code, because every register
access in the source language will be translated into a memory access in the recompiled
code. The example code that is supposed to translate ”or r1, r2, r3” into i386 assembly,
would produce the follwing code:
Modern CPUs can do one memory access in every three clock cycles without being delayed[37],
so this means that there may be one memory access about every six instructions, in prac-
tice. This solution has a memory access in nearly every instruction, so it might be about
five times slower than the same code using registers instead:
Static Allocation Static register allocation means that a hardcoded set of source registers
will be mapped to target registers, and the remaining registers will be mapped to memory.
The recompiler will be slower, of course. In a naive implementation it would check every
operand register whether it is mapped to a register or to memory, leading to one conditional
jump per operand:
case_rxx:
mov mapping(%ebx), %ebx
cmp $8, %ebx
jb case_rrx
[...]
-----------------------------
case_rrx:
mov mapping(%ecx), %ecx
cmp $8, %ecx
jb case_rrr
[...]
-----------------------------
case_rrr:
[...]
Three register operands, each of which is either register or memory mapped, lead to 8
combinations: The code has to test one operand register after another in order to reach the
correct one of the eight cases. Since not all instructions have three operands, the average
number of non-predictable conditional jumps per instruction is probably around 2.
Using a trick, an implementation in assembly can reduce this to one single non-predictable
jump:
jmp *jumptable(,%edx,4)
jumptable:
.long case_mmm, case_mmr, case_mrm, case_mrr
.long case_rmm, case_rmr, case_rrm, case_rrr
The information whether a source register is register or memory mapped is converted into
one bit (”cmp”), and the three bits are combined in one register (”rcl”), so that the resulting
number of 0 to 7 represents the combination. Using an eight entries jump table, the correct
case can be reached using just one jump.
3.3. DESIGN DETAILS 71
LRU The LRU (”least recently used”) algorithm is an adaptive method. At the beginning
of every basic block, all registers are stored in an array in memory, and all target registers
are empty, i.e no source register is mapped to a target register. If an instruction that has to
be translated needs to write a value into a source register, which is not mapped to a target
register, the recompiler has to do the following:
• If no target register is free, take the target register that has been least recently used.
Emit code to write the old contents to the representation of the register in memory,
and map the new source register to this target register.
If an instruction needs to read a register, the same will be done, and code will be emitted
that reads the source register’s representation in memory into the target register. At the
end of a basic block, code must be emitted that writes all mapped registers that have been
modified back into memory.
The LRU idea is certainly a good one, as it makes sure that registers that are extensively
used within a sequence of code are mapped to target registers. The main problem of this
method is its dependency on the size of basic blocks. Basic blocks in the typical definition
are quite small:
--------------------------------------
00001eb0 cmpwi r3,0x1
00001eb4 or r0,r3,r3
00001eb8 ble 0x1ee4
--------------------------------------
00001ebc addic. r3,r3,0xffff
00001ec0 li r9,0x0
00001ec4 li r0,0x1
00001ec8 beq 0x1ee0
--------------------------------------
00001ecc mtspr ctr,r3
--------------------------------------
00001ed0 add r2,r9,r0
00001ed4 or r9,r0,r0
00001ed8 or r0,r2,r2
00001edc bdnz 0x1ed0
--------------------------------------
00001ee0 or r0,r2,r2
--------------------------------------
72 CHAPTER 3. DESIGN
00001ee4 or r3,r0,r0
00001ee8 blr
--------------------------------------
This iterative Fibonacci implementation, taken from 2.1.4.2, consists of six basic blocks.
The average basic block size in this example is 2.5 instructions. This may not be represen-
tative, but basic blocks are rarely longer than 10 instructions, especially within loops, when
it would be most important [50]. This would mean that this method is barely faster than
the ”do nothing” method, as in practice, most registers would be read, updated and written
again:
Table 3.2: Speed comparison of the three register allocation methods (estimated average
clock cycles)
memory static LRU
recompiler 10 30 50+
code 10 2 1-5
As a solution, blocks would have to be bigger. In this example, the whole algorithm could
be regarded as a single block. The benefit would be register mapped registers within all
loops. Unfortunately, if blocks are bigger than basic blocks, the mapping from source to
target code addresses has to be managed at this level already. Also, it is more likely that
it is found out after the translation that an instruction within the block can be the target of
a jump from outside, so the block or at least a part of it has to be translated again. And
finally, while it is known that basic blocks will always be executed completely, translating
bigger blocks may lead to translated code that will never be executed.
Estimating the time needed per instruction for this algorithm is more complicated. For
every operand register a non-predictable check has to be made whether it is already mapped.
These three checks can be combined into a single check, as before. If the source register
is not mapped, the first free target register has to be found. This loop can be unrolled, but
will always consist of one non-predictable jump. If no target register is free, the oldest time
stamp has to be found in another loop. Again, this loop can be unrolled, and by using the
i386 conditional move instruction (”cmov”), no conditional jump is necessary.
So the recompilation of one instruction will need at least one branch to find out the mapping
situation. One more branch is needed for every first use of a source register. In case many
registers are used, the amount of code that has to be executed for the algorithm might also
be significant.
As said before, the quality of the emitted code depends a lot on the size of the blocks.
In case of small blocks, code quality is barely better than when doing memory mapping,
but it can be very good when using very big blocks. With bigger blocks, code quality is
especially good for loops. No matter what registers are used within the loop, they can be
practically always register mapped. But the bigger the loops are, the more registers are
used, the slower the algorithm will be.
Conclusion Table 3.2 summarizes the estimated average amount of cycles spent for re-
compiling an instruction using one of the three methods described above:
74 CHAPTER 3. DESIGN
Table 3.3: Speed comparison of the three register allocation methods, dispatching and
decoding included (estimated average clock cycles)
memory static LRU
recompiler 35 55 75+
code 10 2 1-5
The ”do nothing” (memory mapping) method consists of few instructions, but both the
recompiler and the produced code include three memory accesses, so they will take about
10 cycles each on a modern CPU. The static method consists of little more code and has
three memory accesses, but includes one branch instruction, which will lead to about 30
cycles for the recompiler. Given 50% of all source register accesses can be mapped to
target registers and an instruction consists of two register operands on average, 50% of the
produced sequences should execute in about one cycle, and the other 50% in three cycles,
as they require a memory access, leading to an average of 2 cycles.
It is a lot more complicated to give numbers for the LRU method. The recompiler has three
memory accesses and one branch as well. If every instruction has two register operands on
average and for every register access, the probability is 50% that the register is not mapped,
then one register has to be mapped per instruction, adding the (optionally unrolled) search
loop and another branch. 50 clock cycles is an optimistic estimate which will only be true
if few registers are used, blocks are small and the emitted code is not optimal and might
need more than 5 cycles to execute. In case of bigger blocks and more translation time, the
code quality can be excellent, and one instruction might be executed in one cycle.
In order to be able to compare these values, 25 cycles have to be added. They represent
the dispatcher and the decoder that are necessary in every iteration of the recompiler (table
3.3).
In this comparison, static register allocation is only slightly slower than memory mapping
when code is executed once (127% of the costs), about just as fast when executing code
twice (107%) and already faster when executing code three times (94%). After 1000 runs,
the limit, static allocation being five times as fast, is practically reached. Compared to an
interpreter (28 cycles per instruction), static allocation needs about twice as many cycles
(196%), which matches the original objective.
On small blocks, LRU is slower and produces slow code. On big blocks, it produces code
that is faster than when using the static method, but translation alone should take about
twice as long in this case. LRU has a lot of potential, but it is inappropriate as a very fast
algorithm.
As the static method is only slightly slower than memory mapping but produces clearly
better code, this method has been chosen.
• How many target registers can be used for mapping source registers to? How many
target registers are needed for other purposes?
3.3. DESIGN DETAILS 75
90
80
70
60
50 memory
static
lru best
lru worst
40 interpreter
30
20
10
0
1 10 20 30 40 50 60 70 80 90 100
• What source registers are supposed to be mapped? What are the most important
source registers?
• Some PowerPC and i386 registers have a special purpose. Are there optimal target
registers for certain source registers?
• How can the stack pointer and the link register be emulated - can they be treated like
any other register?
Number of Mapped Registers The i386 has eight general purpose registers. EAX, EDX
and ESP are not-so-general purpose: All multiplication and division instruction (except for
”imul”) use EAX and EDX as implicit parameters, and the stack pointer cannot be easily
used as a register. At first sight, it would make sense to map source registers to all i386
registers other than EAX and EDX, which should be free for temporary operations, and to
map the PowerPC stack pointer to ESP.
It has to be found out how many scratch i386 registers are needed to translate PowerPC in-
structions that use memory mapped registers. The ”or” instruction with its three operands,
for example, needs one i386 scratch register, there is no possibility to do it without one:
If the PowerPC instruction has three operands, and all of them are memory mapped, only
one scratch register is necessary. But indexed memory accesses need up to two scratch
registers. ”stw r10, 8(r10)” for example would be translated into this:
The last instruction needs two i386 registers, so two scratch registers are needed2 . So it
makes sense to use EAX and EDX as scratch registers.
The i386 stack pointer ESP cannot be easily used as something other than a stack pointer.
In case of an interrupt/exception, or, in user mode, in case of a signal, ESP is always used
as the stack pointer and data is written at the address ESP points to. So ESP could either not
be used and always point to some writable location, or it could be used as the stack pointer.
The PowerPC stack (typically) grows from higher to lower addresses, just like on the i386,
so the stacks are basically compatible. Being a general purpose register, all instructions
and addressing modes (except for the rather esoteric scaling of an index) are also possible
with ESP. There is just one disadvantage: The encoding of ESP-indexed addressing modes
are slightly longer than indexed addressing modes with other registers, but this should only
be a cosmetic problem.
Mapped Source Registers Now that we know that we can map 5 source registers and the
stack pointer to i386 registers, we need to know what source registers should be mapped.
The performance of the recompiled code certainly depends on the choice of the registers.
Unfortunately, compilers tend to use more registers than necessary. Compiled code that
uses 20 variables, stores them in 20 registers, as there are enough registers, and this min-
imizes dependencies in the pipeline of the CPU. So in this case, only a maximum of 5
registers would be register mapped, and 15 registers would be memory mapped.
However, some register are still used a lot more often than others. r3 for example is both
used for the first parameter and for the return value. r1 is the stack pointer, which is also
used often. Quite precise statistics are very easy to make in a UNIX environment like Mac
OS X. The following shell script dumps register usage statistics within all GUI executables:
for i in *; do
i=$(echo $i | sed -e s/.app//g);
echo "/Applications/$i.app/Contents/MacOS/$i";
otool -tv "/Applications/$i.app/Contents/MacOS/$i" >> disass.txt;
done
for i in r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r13 r14 r15 r16 \
r17 r18 r19 r20 r21 r22 r23 r24 r25 r26 r27 r28 r29 r30 r31 \
ctr lr bl.0 bcl \\. cr1 cr2 cr3 cr4 cr5 cr6 cr7; \
do (t=$(grep $i[ˆ0-9] disass.txt| wc -l); echo $t - $i); done
t=$(grep cmp disass.txt| grep -v cr[1-7] | wc -l); echo cr0 - $i
2 Theoretically, it is possible to use only one scratch register, by ”push”ing register+20*4 and popping it
2000000
1800000
1600000
1400000
Number of accesses
1200000
1000000
800000
600000
400000
200000
lr
r10
r11
r12
r13
r14
r15
r16
r17
r18
r19
r20
r21
r22
r23
r24
r25
r26
r27
r28
r29
r30
r31
ctr
cr0
cr1
cr2
cr3
cr4
cr5
cr6
cr7
r0
r1
r2
r3
r4
r5
r6
r7
r8
r9
Register
The condition code registers cr1 to cr7 can easily be counted, as they are explicitly men-
tioned in the disassembly. The use of cr0 is found out by counting all ”cmp” instructions
that do not use any other condition register and all arithmetic and logic instructions that
modify cr0 (instructions with the ”.” suffix). The real number of cr0 accesses should be
about twice as high, as compare/dotted instructions and conditional jumps should always
come in pairs.
Since the link register can be used implicitly (”bl” and ”bcl”), it is being searched for using
”lr”, ”bl” and ”bcl”; the results have to be added afterwards. The representation of ctr is not
completely fair, because conditional jumps that modify ctr are not counted, but ctr is used
rarely anyway. It should also be noted that this script only counts the number of instructions
that access a certain register, not the total register accesses - a single instruction can read
and write the same register for example. Thus, these statistics are not infinitely precise, but
good enough.
Analyzing over 10 million instructions from about 80 Mac OS X GUI applications results
in figures 3.6 and 3.7. r0 and r2 are popular, because they are typically used as scratch
registers by compilers. r1 is the stack pointer and it is of course heavily used to access the
stack. As mentioned earlier, r3 contains both the first parameter of a function and the return
value, so it is no wonder that it is the most frequently accessed register. As the registers
starting with r3 are used as parameters, these are used more often. The same is true for
r31 and the following registers (counting downwards), because these must be preserved by
a called function and thus often contain values that the caller will need after the call. The
78 CHAPTER 3. DESIGN
2000000
1800000
1600000
1400000
Number of accesses
1200000
1000000
800000
600000
400000
200000
0
r3
r1
r0
r4
r2
r7
r5
r9
r6
r8
lr
cr0
cr7
cr6
cr1
cr4
cr3
cr2
cr5
r30
r31
r29
r28
r27
r11
r12
r26
ctr
r10
r25
r24
r23
r22
r21
r20
r19
r18
r17
r14
r15
r16
r13
Register
most frequent condition registers are cr0 and cr7, the other conditional registers are hardly
ever used.
The six most frequently used registers are r0, r1, r2, r3, r4 and lr. About 51% of all register
accesses use these registers. This is a very good quota for the static allocation algorithm:
Although only 18% of the PowerPC registers can be mapped to i386 registers, about 51%
of all accesses will be register mapped. Fortunately, the PowerPC stack pointer r1 is among
these six registers, so the i386 ESP register can be used to map r1 without wasting a register.
Interestingly, cr0 is only on position 10, but cr0 and cr7 combined would be on position
six.
Mapping EAX and EDX are scratch registers. r1 will be mapped to ESP. What about
the other registers? Basically, it does not matter. Two of the remaining five i386 registers
(EBX and ECX) can be used as 8 bit registers at well, so they are more valuable than ESI,
EDI and EBP. For example, if r3 is mapped to EBX, the instruction ”stb r3, -16(r1)” can be
translated into:
But if r3 is mapped to ESI, it would have to be translated, using a scratch register, into:
50000
45000
40000
35000
30000
25000
20000
15000
10000
5000
0
r0 r1 r2 r3 r4
The script shown earlier can be used to find out what registers are most often used for byte
accesses (figure 3.8).
It is not surprising that the stack pointer is virtually never used to load bytes from memory.
The scratch registers r0 and r2 are used more often for this purpose, they do about 54% of
all byte accesses. So r0 and r2 should be mapped to EBX and ECX.
The remaining three i386 registers are the same, none has an advantage or a disadvantage,
so all possible mappings would be basically the same. Though, the produced code is more
readable, if the PowerPC link register is mapped to EBP: In typical i386 code, the EBP
register is practically never used for calculations and only for addresses, and is therefore
similar to the PowerPC link register.
So the resulting mapping looks like this:
EAX scratch
EBX r0
ECX r2
EDX scratch
ESI r3
EDI r4
EBP lr
ESP r1
80 CHAPTER 3. DESIGN
Link Register Emulation An interpreter would emulate the link register just like every
other register. It is just another register that can contain a 32 bit value and can be read
and written. In practice, it always contains the return address to the caller in the PowerPC
address space, although the interpreter need not be aware of this fact.
So recompiled code cannot just do a ”return” sequence as it would be done on the PowerPC
(”jmp *lr”), because the address in the emulated link register points to the address of the
original PowerPC code that called this function instead of to the translated code. The
address would have to be translated first. This could be done by either jumping back to the
recompiler that can look up the address in a table, or by inserted recompiled code that can
read the value from the recompiler’s tables independently.
Both methods are slow, because an address has to be looked up every time a function
returns. If the translated code has to be independent of external data structures and fast, the
link register has to contain a pointer to the i386 code that resumes execution. Given EBP is
the i386 link register, this would lead to the following code that represents a function call
(”bl”):
On a PowerPC, ”bl” writes the address of the next instruction into the link register and
jumps to the given address. There is no i386 instruction that does the same, so two instruc-
tions are necessary. The code above also writes the address of the next instruction, this time
an i386 instruction, into the link register and jumps to the given address. One disadvantage
of this method is that the i386 code is no longer relocatable, as it contains explicit absolute
references to code addresses within the code3 . Furthermore, the produced code is quite big
(11 bytes), because two 32 bit addresses have to be encoded instead of just one.
A return instruction (”blr”) can be encoded a lot more easily:
jmp *%ebp
This i386 instruction jumps to the address that the EBP register points to. This is analog
to the original ”blr” behavior, which jumps to the address that the link register points to.
This method is not identical and fully compatible to the original behavior, though: The
emulated link register contains i386 addresses instead of the original PowerPC addresses.
An application could easily copy the link register into a general purpose register and find
out that it might not even point into PowerPC code. In user mode, though, it is very unlikely
that an application uses the link register for anything other than saving return addresses
without ever looking at its contents.
It could be argued that this introduction of a link register to the i386 architecture can actu-
ally speed up execution and that using a link register is faster than storing the return address
on the stack. Unfortunately, this is not true. i386 CPUs are optimized for call/ret subrou-
tines: They manage return addresses using an internal return stack, so they know the return
3 If relocatability is required, the sequence ”call nextline” ”nextline:” ”pop %ebp” ”add $11, %ebp” ”jmp
function” could be used, with obvious large size and execution time costs.
3.3. DESIGN DETAILS 81
address well before the ”ret” instruction reads the address from the conventional stack [37,
p. 96]. Calculated jumps do not enjoy these optimizations. If the return address is not
known early enough, this will lead to a pipeline stall and thus to a significant delay. The
emulation of a link register is not optimal, but required, in order to keep the stack contents
compatible.
A compare instruction on an i386 sets the EFLAGS register according to table 3.6. ”S”
is the sign flag, ”C” the carry flag and ”Z” the zero flag. The compare instruction does
not care whether the operands have been signed or unsigned, so it sets all three flags. The
”jcc” conditional jumps evaluate the EFLAGS register as shown in table 3.7. Depending
on whether the conditional jump treats the operands of the compare instruction as signed
or unsigned values, the sign or the carry flag is evaluated.
Because the pass 1 of the recompiler translates all instructions independently, the complete
information that is returned by the PowerPC compare instruction must be saved so that
the conditional jump instruction can access it. On an i386, the information returned by a
compare is the result if the values are regarded as unsigned integers, as well as the result
if the values are regarded as signed integers. On the PowerPC, this information consists
of the actual result, and the information whether the comparison regarded the operands
as unsigned or as signed values. When translating from PowerPC to i386, the missing
information, that is, whether the compare has been signed or unsigned, must be saved
together with the result.
There are several solutions for the representation of the PowerPC condition codes on an
i386.
jmp label2
signed:
jle label1
label2:
nop
”lahf” transfers the i386 flags into the ah register. The parity flag, which is otherwise
unused, is used to signal that the compare was unsigned. This value is then stored in a
memory location that represents the first PowerPC condition register (cr0). Instructions
that use other condition registers than cr0 would be translated accordingly.
When the flags need to be evaluated, the value will be read from memory again and copied
into the i386 flags register (”sahf”). If the parity bit is set, it has been a signed compare,
and the signed jcc instructions need to be used (label ”signed”).
This solution needs 3 instructions for the compare sequence (not counting the ”cmp”) and
4-5 instructions for the conditional jump. This makes a total of 7.5 instructions - but the
cascaded conditional jumps are disastrous for any CPU pipeline.
i386 flags stored in memory should always be in a format so that they can be correctly
evaluated with the signed jcc instructions. So in case of an unsigned compare, the resulting
flags have to be converted, either by copying the C flag (bit #0) into the S flag (bit #7) or by
using a 256 bytes table (flags signed to unsigned). In case of a signed compare (assuming
signed compares are more frequent than unsigned ones, which is typically the case4 ), no
such conversion has to be done. To evaluate the flags, the flags value has to be copied back
into the flags register and a signed jcc is enough for the branch.
This solution needs 2 (signed compare) or 3 (unsigned compare) instructions for the com-
parison and 3 instructions for the conditional jumps. This makes a total of 5.5 instructions.
The memory access we need for all unsigned compares (≤50% of all compares) might be
a performance problem, though.
4 This can be proven easily by counting cmpw and cmpwi instructions in a disassembly: 80% of all
compares are signed - and instructions with a ”.” suffix are not even counted yet.
84 CHAPTER 3. DESIGN
This solution converts the i386 flags into PowerPC format, using one of two 256 bytes
tables, one for signed and one for unsigned flag conversion. The result is then stored in
memory. To evaluate it, there is no need to load it back into the i386 flags register; a ”test”
instruction is enough. This instruction tests whether one or more bits are set or cleared.
The combination test/jcc is directly analog to the PowerPC branch, which tests a single bit
and branches in one instruction.
This solution always needs 5 instructions for the compare, and 2 instructions for the condi-
tional jump. This makes a total of 7 instructions, one of which is a memory read.
Table 3.8 contains all five cases and the code needed for them. Just like in method 2, the
flags conversion is only necessary for the less frequent unsigned compared. This solution
needs 2 or 3 instructions for the compare, and two for the conditional jump. This makes a
total of 4 or 5 instructions. A memory read is needed in at most 50% of all cases.
the same effect. The previous solutions used a 256 bytes conversion table in memory and
are therefore slow. A faster sequence of instruction is needed that copies the carry flag (bit
#0) into the sign flag (bit #7):
This method tests whether bit 0 is set, and if it is, bit 7 is set. But conditional jumps are
very expensive.
This method shifts a copy of the value left by 7 and combines the results. This is faster, as
it does not need a conditional jump, and consists of just as many instruction.
Melissa Mears [66] developed the following idea:
The lea shifts EAX left by one without modifying the carry bit — which had been set by
the cmp — and the rcr shifts AH right again, placing the carry flag in bit 7 and restoring
the position of the zero flag. The other 24 bits of EAX are unimportant here, allowing us
to destroy them with the ”lea”. This is most probably the optimal sequence of instructions
to do the job.
This method could even map two condition registers to an i386 register if one the the first
four i386 register is chosen, as these can be divided (BL and BH in this example; the upper
16 bits cannot be accessed separately). Though, as the statistics show that compilers seem
to use cr0 and cr7 about equally, so none of them is used frequently enough to be more
important than the most frequent GPRs. As a consequence, it is more important to map a
general purpose register instead of parts of CR to an i386 register.
with this assumption. But there are other instructions that make direct access to the bits of
the condition registers possible. Using these instructions, it is possible to set more than one
bit, or even clear all bits. In case all three bits are set, this would have the meaning that the
result of the last comparison shows that the first operand was above, below and equal to the
second operand. This makes little sense, but it is possible. One possible use might be to
store an arbitrary 32 bit value in the 8 condition registers.
In practice, this is most probably extremely rare. Nevertheless, it would be no problem
to implement the perfectly compatible method (3.3.3.3) in parallel and allow switching
between the two methods either at compile time or at runtime.
3.3.4 Endianness
A PowerPC is typically big endian, and i386 CPUs can only do little endian, so the en-
dianness problem has to be addressed somehow. In some cases, this problem is easy to
solve: If the interpreter of the instruction decoder read from memory, the endianness can
be easily converted by a function that abstracts memory accesses. Accesses done by the
recompiled code can be handled the same, but a function call for every memory access is
certainly a performance problem. There are three basic ways to handle it the endianness
problem efficiently:
3.3.4.1 Do Nothing
The simplest solution is certainly to do nothing. This is possible if there is no endianness
problem, i.e. for example, if both CPUs have the same byte order (or the target CPU can
be switched to the byte order of the source CPU), if the source CPU does not have a byte
order at all, or if the code in the source language is written in a way that no endianness
problems can ever happen (only aligned machine words accesses). Otherwise, this solution
is not compatible and code execution will likely be faulty.
But even if the source language is written to avoid endianness problems, the code segment
and the (predefined) data segment must be byte swapped when loading. This method is
mainly useful during development, to keep the translated code easy to read and to debug.
before 12 34 56 78
after 78 56 34 12
When reading a 32 bit value from memory, it must be byte swapped like this:
When writing a register to memory, it has to be byte swapped before, and, because the
”bswap” instruction overwrites the original value, it has to be byte swapped again after-
wards to restore the original value:
bswap %eax
mov %eax, 0x12345678
bswap %eax
Alternatively, the register can be copied into a temporary register and swapped, so it does
not have to be restored afterwards:
16 bit values can easily be byte swapped using the ”xchg” instruction, which exchanges the
two bytes in the lower 16 bits of a register. When reading from memory, it looks like this:
Since only the lower four i386 register (EAX, EBX, ECX, EDX) support the splitting into
8 bit registers, only the second option is possible when storing one of the other registers
into memory.
Some Read-Modify-Write operations (like memory increment), as they are supported by
the i386, are not possible any more if memory is in a different endianness then the CPU.
Instead of writing
incl 0x12345678
Fortunately, this is never necessary. Being a RISC architecture, the PowerPC has no Read-
Modify-Write instructions that operate on memory. PowerPC registers can be mapped to
i386 memory locations, though, so the i386 might still have to do Read-Modify-Write on
memory. But in this case, endianness is no problem: Neither register mapped nor memory
mapped registers have to be converted, as registers are internal to the CPU and are not even
endianness-aware.
The ”bswap” method to cope with the endianness problem might not produce very beautiful
code. Every memory access is accompanied by additional ”bswap” instructions, which also
slow down the code to some extent. The ”dynarec” group [25] measured a performance
penalty of about 5% in a dynamic recompiler that translates M68K to i386 code [65]. But
loads and stores are a lot rarer on RISC CPUs than they are on CISC systems, as they
are typically only necessary when modifying data structures in memory, so most of the
translated code is no different than code that ignores the endianness problem.
32 Bit Accesses As it doesn’t matter in what format data is stored in words in memory if
they are only read and written word-wise, endianness can be just ignored when doing 32
bit accesses:
This code stores the value of 23 in memory and reads it again. Both the PowerPC and the
i386 version read back the same value again, although it is stored differently in memory.
8 Bit Accesses But this does not work when mixing 32 and 8 bit accesses:
This time, a 32 bit value is stored in memory, and an 8 bit value is read from the same
location. In big endian, this is the uppermost 8 bits, and in little endian, this is the lowermost
8 bits. So the PowerPC would read ”0” and the i386 would read ”23”. So all 8 and 16 bit
accesses and unaligned accesses must be converted. The following code corrects this:
90 CHAPTER 3. DESIGN
The uppermost 8 bits of the 32 bit value are stored at the address + 3, so the i386 has
to read from a different address. Table 3.9 summarizes the differences of the byte offsets
when addressing bytes within an int.
If the i386 stores all 32 bit values in its native format, byte accesses have to be adjusted to
that the same byte is read that would have been read by a PowerPC. The last two bits of
every address have to be taken into account: A value of 0 becomes 3, 1 becomes 2 and so
on, as illustrated in tables 3.10 and 3.11.
So the conversion is basically an XOR with 0b11. When doing byte accesses, all addresses
have to be XORed with 3. This can be done at compile time if the address is well-known
(modify the constant address) or at runtime (modify the address register, do the access, then
modify it back, or copy address register into a scratch register and modify it). The example
above would be translated into the following i386 code:
The recompiler cannot just XOR the displacement with 3, as done in the manual translation
before, as in general, the compiler cannot find out whether r3/ESI is aligned in this case -
adding 3 to a value is only the same as XORing it with 3 if the value is divisible by 4.
16 Bit Accesses 16 bit accesses look the same, if they are aligned:
If the address is read from a 32 bit boundary, the resulting 16 bits have to be read from the
memory location + 2, and the two bytes have to be swapped afterwards. Again, this code
is only possible, because it is known that r3/ESI is aligned.
In general, it is more complicated, as tabe 3.12 illustrates. Again, if the address is a con-
stant, the recompiler can emit code that adjusts the address and swaps the data. But if the
address is a variable, the test has to be done at runtime.
Unaligned 32 Bit Accesses Unfortunately, 32 bit accesses can only be ignored if they are
aligned. The following code demonstrates this:
This code writes the values of 23 and 17 into two adjacent 32 bit fields into memory and
reads from the address of the first value + 1. This is the order of the byes in memory on a
PowerPC:
XX XX XX XX
| 0| 0| 0|23| 0| 0| 0|17|
XX XX XX XX
92 CHAPTER 3. DESIGN
Reading from the start address + 1 reads the bytes 0, 0, 23 and 0, resulting in a value of
23*256. On the i386, memory looks like this:
XX XX XX XX
|23| 0| 0| 0|17| 0| 0| 0|
XX XX XX XX
.align 4
.globl main
main:
pushfl
label1:
popl %eax
or $0x00040000, %eax
pushl %eax
popfl
movl (label1), %eax // label1 is unaligned
ret
This code sets a flag in the i386 EFLAGS register to enable misalignment exceptions.
While Windows does not hand down these exceptions to the application, both Linux and
FreeBSD do, as SIGBUS signals. The example terminates with a ”bus error” if the ”or”
is enabled, and terminates cleanly if the ”or” is disabled. So it should be possible to trap
misaligned accesses, handle them, and return to the original code.
3.3.4.4 Conclusion
During development of the recompiler, as long as the possibility of endianness problems
can be excluded, ”do nothing” is possible - but the code segment still has to be byte swapped
either already by the loader, or by the instruction decoder. This method produces the most
readable code that is well-suited to debug the rest of the system.
While ”bswap” is easier to implement and more portable between host operating sytems,
the ”swap memory” method most probably achieves a higher performance. The perfor-
mance penalty of the ”swap memory” method compared to the ”do nothing” method is
hard to estimate, but probably not zero. The difference in overall performance of an appli-
cation, if the ”swap memory” and the ”bswap” method are compared, will probably not be
worth the effort to implement the more complicated method. As the endianness strategy
is no fundamental design decision that has effects on the overall design, ”bswap” can be
implemented now, and ”swap memory” can be added as a compile time option later.
3.3. DESIGN DETAILS 93
1. The dispatcher detects the type of instruction and jumps to the appropriate part of the
decoder.
2. The decoder extracts the operands from the instruction and translates the instruction
into one or more operations.
3. The i386 converter maps source registers to target registers and converts one opera-
tion into a sequence of i386 operations.
4. The instruction encoder emits the optimal i386 machine code encoding of an i386
operation.
The conversion of PowerPC code to i386 code in two steps between 2 and 4 is similar to a
conversion with intermediate code. In this case, there is no explicit intermediate code, but
all intermediate information is represented by the direct data flow between the steps.
3.3.5.1 Dispatcher
Every instruction of PowerPC binary code has a 6 bit ”primary opcode” field, and depend-
ing on the value of this field, it can optionally have another 9 or 10 bit ”extended opcode”
field. The function of the dispatcher is to decode the opcode field or fields and execute the
piece of code of the decoder that is responsible for the instruction that has been detected.
All information that has to be passed to the decoder is the 32 bit instruction code and the
current program counter.
For example, the PowerPC instruction ”or” has a primary opcode of 31 and an extended op-
code of 444. The dispatcher extracts the opcode fields, detects that it is the ”or” instruction
and executes the ”or” handler of the decoder.
3.3.5.2 Decoder
The function of the decoder is to convert the instruction into a more general form and hand
the information down to the i386 converter. It first extracts all operands, that is registers,
constants and flags from the instruction code. Then it checks whether the instruction has
a special combination of operands that have a special (simplified) meaning. Depending on
the result of this check, the instruction is decomposed into several operations, if applicable,
and these operations are then passed to the i386 converter. If the recompiler has to be very
fast, for example in pass 1, the check for simplified combinations can be omitted, which
will produce code of lower quality.
The instruction code of ”or” for example has three register operand fields, rA, rS and rB,
and one flag field ”Rc”, which states whether the the condition register 0 is supposed to be
94 CHAPTER 3. DESIGN
altered by the result. After these four fields have been extracted, it will be tested for two
special cases: If all three register numbers are the same, this corresponds to the instruction
”or rX, rX, rX”, which is a no-operation instruction. But in case the Rc flag is set (”or. rX,
rX, rX), the instruction effectively tests whether the value of register rX is positive, negative
or zero and sets the condition register 0 accordingly. So the decoder has to check whether
Rc is set and either pass nothing (Rc=0) or the command ”set cr0 according to value of rX”
to the i386 converter.
If the two source registers fields are the same (”or rX, rY, rY”), the instruction is effectively
a ”register move” (”mr”) instruction. The information ”move register rY to rX” is passed
to the i386 converter. If Rc is set, ”set cr0 according to rY” is passed in addition.
If all three register numbers are different, it is a conventional or instruction, which will be
passed to the i386 converter. Again, if Rc is set, another command, ”set cr0 according to
result” will be sent.
a3 78 56 34 12 mov %eax,0x12345678
instead of
89 05 78 56 34 12 mov %eax,0x12345678
The byte sequence of 0xa3, 0x78, 0x56, 0x34, 0x12 will be written into the i386 code
buffer.
a time. All code following the code flow through jumps and calls, up to the next condi-
tional jump will definitely be executed. But it only makes sense to translate consecutive
instructions in the code, because otherwise it would be difficult to reuse blocks.
It is also possible to translate more consecutive instructions than are guaranteed to be exe-
cuted. This way, the recompiled code will have to jump back to the recompiler less often,
but some code will be unnecessarily translated. And the larger the blocks are, the higher
is the probability that a new jump target will be found in the block, so the block has to be
split and retranslated.
In this project, basic blocks are very similar to the classic basic block. They start at any
point where concrete execution has to begin. They are ended by the first control flow
instruction. There is no analysis before that finds out possible jump targets within the basic
block, so a basic block is not necessarily an atomic unit of execution.
This sequence loads the address of the next PowerPC instruction into the first scratch reg-
ister and returns to the recompiler. The recompiler’s main program will then hand the fol-
lowing instruction (at address 0x1eec in this case) to the interpreter and continue execution
at the effective target address of the control flow instruction.
There is one exception to this behavior: As an optimization, blocks that end with a return
(”blr”) can include the translation of this instruction. In i386 code, this will look like this:
jmp *%ebp
This i386 instruction jumps to the address that EBP, which is the link register, is pointing
to. There is one problem to this, though: It must be made sure that the EBP register always
points to i386 code. When a function is called, the correct address of the i386 instruction
following the function call must be placed into the EBP register. This address can only be
available if the block following the function call instruction is already translated.
One solution would be to allow function call instructions in basic blocks, that is, to end
basic blocks with any control flow instruction other than ”bl”, but this has the problem that
at the time the i386 equivalent of the function call instruction has to be emitted, the first
block of the function that is supposed to be called might not be translated yet. This would
make the recompiler more complicated.
A simpler solution is to end basic blocks if a ”bl” is encountered and just always translate
the following block immediately, as a separate block. In the first run, the function call
instruction will be interpreted, but the second time, the three blocks will be linked.
jmp block_next
A conditional jump will be translated into the according sequence, as discussed earlier, plus
a jump to the non-taken block:
98 CHAPTER 3. DESIGN
In case of a ”bl”, the code has to load the address of the block that succeeds the call into
EBP (the link register) and jump to the first block of the function:
Every time a block is supposed to be executed or just has been executed (the block might
already have been jumped to by another block, so it might never be started separately), the
basic block logic must test whether the block can be linked to its successor or sucessors.
So if a certain sequence of code is executed some times, all code around it should be
translated, and all code should be linked. Unfortunately, there is a special case. In the
following example, the basic block on the bottom will not be linked to the following block
during all iterations of the loop:
li r0, 100
mtspr r0, ctr
li r1, 0
loop:
addi r1, r1, 1
bdnz loop
[...]
If this code is executed for the first time, the block after the loop has never been executed,
so it is not yet translated. So one of the targets of the ”bdnz” instruction is still not available
in i386 code even after many iterations of the loop. Therefore ”bdnz” cannot be translated
into native code, and it has to be interpreted in every iteration, accompanied by two mode
switches per iteration. One solution would be to do at least half of the linking already,
as one target is already known. But there is a simpler solution: Every time a block has
been translated that ends with a branch to a lower address, the succeeding block must
be translated as well. The next time the first block is executed, it will be linked with its
succeeding blocks. Branches to lower addresses are almost certainly loops.
The possible disadvantage is that there is a chance that the block after the loop may never
be reached, because some statement within the loop branches to a different location to exit
the loop, so the block might have been translated unnecessarily. But as so often, simplicity
should be in favor over tiny optimizations.
3.3.7 Environment
The recompiler engine needs to be supported by some additional components that provide
the emulation environment, such as the loader, the memory management, the disassembler
and the interface to the recompiled code (execution/context switches).
3.3.7.1 Loader
The recompiler requires a loader in order to get the executable file into memory. For the
recompiler, it does not matter what format the executable file has, so the loader can be
replaced with another one that supports a different executable format at any time. All it
has to do is move code (and data) into memory and pass the information where it has been
loaded, to the actual recompiler.
The loader does not need to put the code from the file at the same position in memory
where it would be located if the application was run natively. Not all host environments
might make it possible to use arbitrary memory locations to store data, so the system is
more independent of the host environment if the code segment is loaded to memory that
has been conventionally allocated. If a host operating system supports reserving memory
at defined locations, this behavior can still be optimized.
Memory for the target code will also just be allocated; in this case, the address does not
matter at all. As the system will try to execute code in memory that has been allocated
for data, a problem might arise with newer CPUs and operating systems. Intel, AMD and
Transmeta either announced or already introduced i386 CPUs that support the ”NX” (”not
executable”) feature, which can disable the possibility to execute code in protected regions,
like the stack and the heap, if the operating system supports it. As soon as these CPUs are
common, additional code might have to be introduced to fix this.
Although some PowerPC registers are kept in i386 registers, there must still be an array in
memory that can contain all PowerPC registers. Memory mapped registers will always be
stored in this array, but there also needs to be space for register mapped registers: Between
basic blocks, when the system finds the next basic block to execute, all i386 registers have
to be written back to memory and registers to support the recompiler environment must be
loaded (context switch), so all PowerPC registers that have been mapped to i386 registers
will be saved in this array.
The link register also has a representation in memory for this purpose. As shown earlier, it
will always point to i386 code. Additionally, the PowerPC program counter will always be
stored in memory, although it is only updated every time unknown code is encountered.
Stack space will just be allocated on the heap and the stack pointer will be set pointing
to the top of that memory area. User mode applications should not care about the actual
location of the stack. Operating systems usually also do not guarantee a certain location of
the stack, but most environments place it at some typical location.
100 CHAPTER 3. DESIGN
3.3.7.3 Disassembler
A PowerPC disassembler is very useful for debugging purposes. If the recompiler is in
verbose mode, it can print every instruction on the screen when it is working on it, and
print additional information about translation, to make it easier for debugging to find out
what source instruction has produced what target instruction(s). So the disassembler should
only print one instruction at a time, so that it can be called by the recompiler loop.
No i386 disassembler is necessary for this project. The GCC toolchain comes with ”ob-
jdump” that can disassemble even raw files into i386 AT&T syntax. The easiest way to
disassemble the output is certainly to make the recompiler write the output code into a file
and call ”objdump”.
3.3.7.4 Execution
As described earlier, in order to execute a basic block, all register mapped registers have
to be loaded from the array in memory. But the i386 registers cannot just be overwritten,
instead they must be saved.
So executing a basic block looks like this:
Recompiled basic blocks end with a jump back into the recompiler. At this point, the
following has to be done:
While the recompiled code is executing, the recompiler can have no influence on execution,
as it is completely disabled. The recompiled code must actively return to the recompiler.
So if the recompiled code is in an infinite loop, the recompiler has no way to interrupt
it, because the recompiled code has effectively taken over the personality of the process.
Fortunately, this is no problem: If the program was natively executed, it would be in the
same infinite loop, unless it is caused by flawed recompilation, in which case the system
should still be in the process of being debugged.
• 50% of all source register accesses will be translated into memory accesses
3.3. DESIGN DETAILS 101
These problems are now to be addressed and solutions are to be provided. As some of
these optimizations assume that the code complies with certain conventions (like the stack
convention), they will most probably only work with compiled code.
• If the register usage statistics of a function are very different from global statistics,
another set of registers could be register mapped for this function.
• If a function uses n registers, it does not necessarily mean that it really needs n
registers. Compilers typically use more registers than necessary in order to keep
register dependencies in the CPU low.
This example is supposed to branch if r4 (ESI) has reached the value of zero. The PowerPC
behaves like this, but in the i386 code, the flags are overwritten between the ”dec” and the
”jz” by the ”xor” instruction. Modifying the condition codes is optional for every PowerPC
instruction, but this is not the case on the i386. In the example, the ”xor %ebx, %ebx”
102 CHAPTER 3. DESIGN
could be replaced by ”mov $0, %ebx”, which does not alter the flags, but there is no general
solution to this problem - there is no way to carry out an or instruction without changing
the flags5 .
So the i386 flags have to be saved somewhere - but they can be saved in a register instead of
memory. This will certainly only be reasonable if none of the six most frequent GPR is be
used more frequently than the condition register, so the condition code will not push out an
important GPR. Dynamic register allocation can treat the condition register as an ordinary
register and assign an i386 register to it, if used frequently.
If the i386 uses call and ret, these sequences can be completely omitted. So in pass 2,
bl/blr is supposed to be translated into call/ret, and the impact on the stack is supposed to
be handled. In addition, the obsolete sequences have to be detected during translation so
that no code will be produced.
3. Retargetability: If intermediate code exists, the front end and the back end of the
compiler are less dependent on each other and can therefore be replaced more easily.
4. Metadata: Intermediate code can extend the source code with metadata, such as
liveness information.
These general advantages are now to be inspected in the context of this recompiler, in order
to evaluate if they also conform with the aims of this project.
Intermediate code is a representation whose properties are between the properties of the
source and the target language. The more the source and the target language are alike
the less sense it makes to have an intermediate language. The ARM to i386 recompiler
ARMphetamine [21] for example make use of intermediate code, because despite being
a RISC CPU, the ARM has very complex instructions, for example, any instruction can
include a shift and conditional execution. ARMphetamine decomposes ARM instructions
and encodes i386 instructions by matching patterns of the intermediate code [67].
The PowerPC is unlike the ARM CPU though, and more similar to the i386 than the ARM
is. PowerPC instructions can rarely be decomposed into anything else than the actual op-
eration and the update of the condition register. Most PowerPC instructions can be directly
translated into one i386 instruction, or, if the PowerPC instruction has three different reg-
isters as operands, into two i386 instructions. Also, there are few i386 instructions that
could combine PowerPC instructions. The following example is an implementation of the
iterative calculation of the Fibonacci number sequence. On the left, there is the original
implementation in PowerPC assembly, and on the right, there is the optimal translation of
the PowerPC code into i386 code:
Both programs have the same number of instructions, every PowerPC instruction corre-
sponds to one i386 instruction. One thing is special about this example though: The bdnz
104 CHAPTER 3. DESIGN
instruction (11) on the left is actually a composed instruction which decrements the count
register and does a conditional jump afterwards. The i386 has a very similar instruction,
”loop”. An instruction based recompiler would have to translate ”bdnz” into a decrement
and a conditional jump instruction, because ”loop” can only decrement the ECX register,
and the decision what register (or memory location) ctr gets mapped to has already been
made. This is the single line in this example that might have benefitted from intermediate
code, although it would have taken a lot of additional work to be able to map ctr to ECX.
There is one typical kind of code though that could be optimized using intermediate code:
The PowerPC cannot load a 32 bit constant into a register, because the whole instruction is
only 32 bits long, so it has to be loaded in two steps, like this:
lis r27,0x98ba
ori r27,r27,0xdcfe
lis r30,0x98ba
ori r27,r30,0xdcfe
Given, in the second example, r30 is dead after the second instruction, these instructions
can be combined and translated into this:
Unfortunately, a lot more logic is necessary to do this in practice, as the two instructions
that belong together might not be subsequent:
lis r30,0x98ba
lis r10,0x1032
lis r2,0x6745
lis r4,0xefcd
ori r27,r30,0xdcfe
ori r23,r10,0x5476
ori r29,r2,0x2301
ori r28,r4,0xab89
A full data flow analysis would be necessary to optimize this code. All in all, code like this
is probably not frequent enough so that an optimization would have a noticeable impact on
the overall performance.
The second general advantage of intermediate code is that it makes it possible to work on
code, and optimize it in several passes. But this is mostly a point when translating high level
languages, because a lot can typically be simplified and optimized in this case. Compiled
code already has all redundant code eliminated.
Although it might be an advantage for many designs to be able to easily replace the front
ends and back ends with other CPU architectures, this is not necessary for this project, as
the whole idea is to design an interface optimized for RISC to CISC translation.
3.3. DESIGN DETAILS 105
The fourth advantage mentioned above stated that intermediate code can be enriched with
information that is not available in the source code but has been reconstructed by an analyz-
ing pass. But this is just one more thing that can be done if you already have intermediate
code; it is no argument why intermediate code should be used. This information can just
as well be stored in additional data structures.
Intermediate code between the PowerPC and the i386 is not completely useless, but it
only makes sense if maximum performance has to be achieved, no matter how much more
complex the recompiler gets. The central idea of this project however is a fast recom-
piler combined with an optimizing recompiler, so intermediate code is not suited for this
application.
• liveness analysis and dynamic register allocation, including condition code optimiza-
tion
Unfortunately, the detection of atomic functions might not be possible if parts of the pro-
gram are written in assembly instead of being compiled code, because external code might
jump into functions, or functions are not contiguous. Therefore pass 2 is only suited for
compiled code.
As it is true for all dynamic register allocation strategies, the size of blocks is also important
for this method. If basic blocks are analyzed, this method makes no sense, because, as with
the LRU method, all registers used have to be read from memory at the beginning of the
block, and all registers written have to be written back to memory at the end. Compilers for
high level languages use functions as blocks. It makes sense to apply the register allocation
algorithm on functions in this project as well, because function calls are supposed to be
converted as well, so pass 2 has to do some analysis of the structure of the program in
concerns of functions.
So every function has its own mapping of source registers to destination registers. These
different mapping have to be synchronized between functions, which means that all im-
portant source registers have to be passed in the register array in memory than in target
registers.
The registers in question are r0, r2 to r31, ctr and cr0 to cr7. Compares and instructions
with the ”.” suffix write into condition registers, and conditional branch instructions read
them.
As the i386 ESP register cannot be practically used for anything else than pointing to a
stack, the PowerPC register r1 will be mapped to ESP even when doing dynamic register
allocation, so no information will be gathered on r1. The eight (four bit) condition registers
cr0 to cr7 are treated like all other 32 bit registers.
3.3. DESIGN DETAILS 107
This analysis can be done by the run over the code that finds all instructions that belong
to the function, so no separate run is necessary. In order to gather this information, all
instructions have to be dispatched and decoded.
The only parameter of this function is r3. If r3 is 0 or 1, the same value will be returned,
else r3 + (r3 — 1) will be returned. In this example, there is no way to find out the original
meaning of the return value. r3 or r4 or both could contain return values. If instruction
4 is omitted, r3 does not get written any more - but r3 can still be a return value, as the
following example shows:
1 blr
This function does nothing - but it does preserve all registers, all registers could be param-
eters and all registers could be used for return values. Or the function has no parameters or
return values at all. The following C function will be translated into ”blr” for example:
int test(int a) {
return a;
}
There are two ideas about what to write back to the register array in memory: Either write
back all registers that are parameters or modified within the function, or filter this set using
the PowerPC calling conventions: So only r3 and r4 can contain return values, but this
might not be compatible with hand optimized assembly code that uses more registers than
r3 and r4 to return data to the caller. Whether all or only some registers are written back is
no basic design decision, but an implementation detail, which can be turned on and off at
compile or run time.
So all register mapped source registers that are parameters have to be read into target reg-
isters, and all register mapped source registers that are used for return values have to be
written back into the register array in memory.
3.3.9.6 Retranslation
With the dynamic register allocation and the signature of the function known, the function
can be translated. At the very beginning, code has to be emitted that reads all register
mapped inputs into i386 registers. If r3 is mapped to ECX, for example, this simply looks
like this:
The pass 1 instruction encoder can be used for this. For the translation of the function
body, the complete pass 1 recompiler can be used, as most of the work that has to be done
is identical. There are just two exceptions: The i386 converter has to use the dynamic reg-
ister mapping that has been gathered for this function, instead of the static pass 1 mapping.
And there are some differences for certain instructions. Every time a ”blr” instruction is
encountered, the i386 ”call” instruction can now be used, but all register mapped registers
have to be written back to memory before, and read back from memory afterwards. Actu-
ally, not all registers have to be saved, but only registers that are not used by the function.
As most functions will use all of the few i386 registers and a recursive algorithm has to be
used to find out what registers are used in a function and its subfunctions, it is probably not
worth bothering.
3.3. DESIGN DETAILS 109
• 2 pass translation: The code gets translated twice. In the first pass, no code gets emit-
ted, but the mapping table is created. In the second pass, the mapping is completely
known and all code can be emitted.
• backpatching: The code gets translated once. All jump target addresses that are not
yet known will be left blank and for all of these, the address of the instruction as well
as the source address that could not be mapped will be noted. When translation is
complete, with the help of the notes, the blank addresses will be filled.
The backpatching method is considerably faster, and its implementation is only marginally
more complex, therefore it makes sense to favor this method.
In the case of unsigned values, the register move can be integrated into the ”lea” instruction:
cmp $5, %esi
lahf
lea (%eax,%eax), %ecx
rcr $1, %cx
[...]
test $0xc000, %ecx
jnz label1
nop
With the condition codes mapped this way, branches should not be significantly slower on
the i386 than in native code.
• store the return address in the ”old stack pointer” field on stack
The first solution would detect all instructions in the function that access values in the
caller’s stack frame and add 4 to the displacement. It is not trivial to find out what stack
accesses target the caller’s stack frame: These are accesses that have a displacement larger
than the size of the current stack frame, so the size of the current stack frame has to be
found out.
The second solution does not need to adjust stack accesses. Instead, the return address will
not be placed in addition into the stack, but it overwrites the uppermost value, by adding
4 to the stack pointer before the call, and subtracting 4 afterwards. The value that gets
overwritten is the saved old stack pointer, as shown in figure 3.9.
The previous stack pointer is rarely used: Most code does not destroy the stack pointer by
reading the old value from the stack, as shown here:
lwz r1,0(r1)
3.3. DESIGN DETAILS 111
+---------------+
8|saved registers|
...
7|saved registers|
6|LOCAL DATA |
...
5|LOCAL DATA |
4|PARAMETERS |
...
3|PARAMETERS |
2|space for LR |
1|previous SP |<- SP
+---------------+
but by adding the size of the stack frame to the stack pointer, which saves a memory access:
addi r1, n
Some applications still use the first method, in which case this instruction has to be replaced
with an ”add” during translation. For this, the size of the stack frame has to known, just
like above.
All in all, the complexity of both methods is about the same. The first method looks cleaner
and is faster, the second one looks more compatible. For the sake of code speed, the first
method will be implemented.
When translating this code into i386 code, the instructions 1, 3, 6 and 7 need not be trans-
lated, as there is no link register, so nothing has to be saved or restored. It is easy to omit
the ”mfspr” and ”mtspr” instructions as they are easy to detect. It is not as easy to detect
the instructions 3 and 6 in this example. The link register will be placed at the position
”stack pointer + 8” - but only if the stack frame has not been created yet. Otherwise, the
address will be ”stack pointer + stack frame size + 8”, which is often the case. The same
problem exists when restoring the link register: The stack frame could either still exists or
already be destroyed. If the stack frame still exists, for example, the address ”stack pointer
+ 8” does not point to the the saved link register.
So when translating a function, it must be known at every time whether the stack frame
exists or not. In this example, instruction 3 will be detected as the lr save instruction,
because there has been no ”stwu r1, -n(r1)” before. And because the stack frame has
already been destroyed in instruction 5 (”lwz r1,0(r1)” - could also have been ”addi r1, n”),
instruction 6 will be detected as the instruction that loads the link register. The tracking
of the existence of the stack must be done during the actual translation, i.e. the instruction
recompiler must provide this information.
There is one problem to this: A function may have multiple exits and therefore multiple
sequences to destroy the stack frame. After the first ”blr”, the stack frame is considered
nonexistent, but it probably exists. In this case, later restore sequences will not be properly
recognized, but this is no major problem, as it just leads to unnecessary code.
Because the instructions 1, 3, 6 and 7 will never be translated, no use/def information
should be gathered from these instructions. Otherwise some registers would be unneces-
sarily marked alive and the dynamic register allocation would be less optimal.
Chapter 4
In the context of this project, a recompiler, called ”ppc to i386”, has been implemented
that is based on the design described in the previous chapter. The implementation has been
targeted for inclusion as the recompiler engine in the SoftPear [28] project. It runs on
GNU/Linux and FreeBSD. The current status is that not all instructions have been imple-
mented, so only specific test programs will run instead of real world applications. But the
infrastructure is basically complete and well-tested, so completing the recompiler should
only be a matter of implementing the remaining opcodes.
113
114 CHAPTER 4. SOME IMPLEMENTATION DETAILS
speed of the produced code) is certainly lower than it could theoretically be and as it has
been described earlier. As recompilation is a very low-level topic, the design must take into
account and be very close to implementation. Many concepts and implementation details
have therefore already been described along with the design. Other details are self-evident
and need no detailed description - for example, nothing can be said about the pass 1 recom-
piler loop other than what has already been said. This chapter will therefore only describe
the interesting aspects of implementation as well as those parts that slightly differ from the
design.
4.2 Loader
Because of the SoftPear project [28], which is supposed to do user level emulation of
PowerPC Darwin/Mac OS X on i386 Darwin/Linux/BSD, it was chosen to implement a
loader for Darwin Mach-O executables. This loader has been completely implemented by
Axel Auweter and Tobias Bratfisch, who also work on the SoftPear project. While the
implementation of a loader is not trivial, as it has to handle the quite complex Mach-O
executable file format, its purpose and its interface to the recompiler is simple: All sections
get loaded into memory, the entry point address gets read, and all this data is passed to the
main function. The exact implementation of the loader is irrelevant in the scope of this
project—and not very exciting either.
4.3 Disassembler
The first thing that has been done as soon as the loader was complete, was to write a dis-
assembler for PowerPC binary code. Its purpose is to be used by the translation function
to print the disassembly of the current instruction on the screen. Before emulator code ex-
isted, the code base could be used as a disassembler; later the interface of the disassembler
was changed to always decode and print one instruction at a time. Although the PowerPC
instruction encoding is very orthogonal, the disassembler does not use tables of strings, but
uses actual code to do the work. It consists of a dispatcher for the opcode and a lot of small
functions that decode and print one specific instruction. This is necessary because it is the
easiest and most flexible way to handle simplified instructions: In case of certain operand
combinations, an instruction can have a special meaning, and a different mnemonic should
be printed. So the handler for each instruction decodes the parameters and prints the cor-
responding disassembly. The dispatching and decoding system is identical with that of the
instruction recompiler, so it will be described there.
4.4 Interpreter
The specified system has no interpretive pass, instead it always recompiles all code. Yet
there exists an interpreter in the implementation, for several purposes:
4.5. BASIC BLOCK CACHE 115
• Before the implementation of the actual recompiler was started, an interpreter for the
most important instructions has been developed. It was used to test the environment
(loader, memory etc.) and to study the behavior of PowerPC code in practice, and was
invaluable for understanding certain PowerPC characteristics, such as stack handling.
• The interpreter can be easily used for speed comparisons. Although it is implemented
in C, it is quite efficient and should not be significantly slower than an optimized im-
plementation in assembly language. Nevertheless, its use is restricted to very simple
programs.
• The system still needs an interpreter for all control flow instructions other than ”blr”.
These are initially not translated and must therefore be interpreted. Calculated jumps
(”bctr”) will currently never be recompiled, so all programs that include jump tables
depend on the interpreter.
The implementation is simple. Like the disassembler, it shares the dispatcher and the de-
coder with the recompiler. The specific function for one instruction implements the behav-
ior of the instruction without simplification, working on the array of registers in memory.
Even though the interpreter is mostly only used for a few instructions, the complete in-
terpreter exists in the recompiler code, as interpretation, that is, dispatching, will not be
slowed down by the existence of additional instruction implementations.
• The block has at least one successor blocks (field of the node).
• The first target has been recompiled (the node has to be found in the linked list).
• If there is a second target: The second target has been recompiled (as above).
If all these conditions are met, a small linker function gets called, in which a simple C
switch construct does the dispatching, decoding and encoding. The i386 instructions will
be written at the address of the end of the basic block as found in the basic block node,
overwriting the old jump back to the recompiler loop.
4.7.1 Dispatcher
The dispatcher (recompile()) extracts the upper 6 bits of the instruction code and jumps to
the specific handlers using a jump table. Some of these handlers just print error messages
for undefined opcodes, some are functions of the decoder, and the handlers for the opcodes
19 and 31 are again functions that jump to more handlers using a jump table, as the primary
opcodes 19 and 31 have an extended opcode field. So there are three jump tables, one for
the primary opcode, one for the extended opcode if the primary opcode was 19, and one
for the extended opcode if the primary opcode was 31. These tables occupy roughly 2 KB,
and should not be a problem for the CPU cache. For optimization purposes, no explicit
parameters are passed to the decoder functions, as all data that is relevant to the decoder is
in global variables.
4.7.2 Decoder
The decoder consists of one function for every PowerPC instruction, whose name is with
rec , followed by the name of the PowerPC instruction. Decoding has been implemented
in a readable and efficient way by using preprocessor macros. The PowerPC architecture
defines names for certain bit ranges in the instruction code, such as rA for the first register
operand and UIMM for an unsigned immediate value. The macro to extract a bit range
from a 32 bit value looks like this:
There is one macro for every named bit range in the instruction code. These look like this:
4.7. INSTRUCTION RECOMPILER 117
#define rA bits(c,11,15)
#define rB bits(c,16,20)
#define UIMM bits(c,16,31)
#define SIMM ((signed short)bits(c,16,31))
They always pass ”c” to the bits() macro, which is a global variable that contains the in-
struction code. It is now easy to access any field in the instruction code just by using its
name, like this (taken from the implementation of ”or”):
The compiler will fill rA, rS, rB and Rc with the specific macro bodies and make sure that
only fields are extracted that are actually used. Apart from extracting, the decoder has to
detect special operand combinations and, depending on the result, to call functions of the
i386 converter. Detecting special combinations of operands is easy, as can be seen in the
example above. This sequence checks whether all three operands of the or operation are
the same. If they are, the Rc bit is tested. If it is cleared, the instruction is effectively a
no-operation, and no code will be emitted. Otherwise, the i386 converter will be called to
produce code for a register test of the given register operand.
• a vreg (virtual register) is an integer pointer (int*). It can either have a value from 0 to
7, representing an i386 register, or be a pointer into memory, representing a memory
location that contains the value of a memory mapped register.
• a treg (target register) is an integer that has a value from 0 to 7, representing an i386
register. The numbering is the native numbering inside the CPU.
All register numbers extracted from the instruction code are sregs. They are converted into
vregs when the decoder calls the i386 converter. vregs in turn are either converted into
integer pointers or tregs between the i386 converter and the instruction encoder. So the
register conversion actually takes place between the layers, so that the decoder can always
work with sregs, the i386 converter with vregs, and the instruction encoder with tregs. This
makes code easier to read, as the conversion is done quite transparently by making macros
convert the registers on the function call:
118 CHAPTER 4. SOME IMPLEMENTATION DETAILS
v_put_test_register(VREG(rA));
This code taken from a decoder calls a function of the i386 converter. The register rA is con-
verted from an sreg into a vreg by the macro VREG(), which effectively converts the Pow-
erPC register number into an i386 register number or a memory location, according to the
static or dynamic mapping table, and returns it as a vreg. The function v put test register()
takes a vreg as an argument, as the signature shows:
Conversion from vregs to tregs or memory locations is easier. The vreg already contains
all information. If a vreg contains an i386 register number and is therefore in the range of
0 to 7, it can just be casted into a treg. If it is not, it can be casted into an integer pointer
(int*). The test, whether a register is memory mapped, that is, a vreg is a treg, is done
by the macro VREG IS TREG(), which checks whether the value is below 8 and returns
either true or false.
inline void v_put_move_register(vreg r1, vreg r2, treg temp, int flags) {
if (VREG_IS_TREG(r1) && VREG_IS_TREG(r2)) {
put_move_register((treg)r1, (treg)r2);
if (flags) put_do_flags((treg)r1);
} else if (VREG_IS_TREG(r1) && !VREG_IS_TREG(r2)) {
put_store((treg)r1, r2);
if (flags) put_do_flags((treg)r1);
} else if (!VREG_IS_TREG(r1) && VREG_IS_TREG(r2)) {
put_load(r1, (treg)r2);
if (flags) put_do_flags((treg)r2);
} else {
put_load(r1, temp);
put_store(temp, r2);
if (flags) put_do_flags(temp);
}
}
This example has two register parameters and must therefore decide between four cases.
In case both registers are register mapped, they are casted into tregs and the function
put move register() of the instruction encoder is called with will emit a single i386 register
4.7. INSTRUCTION RECOMPILER 119
move instruction. In case the first register, which is the source register of the move, is reg-
ister mapped, but the second one is not, this corresponds to a memory store, so put store()
will be called. If the first operand is memory mapped and the second one is register mapped,
it is a memory load. In case both registers are register mapped, the first one will be read into
a temporary register and the temporary register will then be stored into the second register
in memory. This temporary register is a parameter passed by the decoder. If the decoder
translates a complex sequence, which uses both temporary registers, it can make sure that
the parts of the i386 converter it calls do not overwrite each other’s temporary registers.
The boolean parameter ”flags” has also been passed by the decoder. If it is true, the i386
converter has to make sure the emulated condition codes are updated, and therefore calls
the instruction encoder function put do flags(). This whole function is very similar to many
other functions in the i386 converter, but unfortunately, it cannot be generalized, because
the i386 is very complex and there are many exceptions. This could be ignored in many
cases, but the code will be a lot better if it is not, and translation will not be slower.
Nevertheless some functions help recode duplicate code: It is often necessary to have the
value of a PowerPC register in an i386 register, no matter if it is memory or register mapped,
for example, in order to use it as an index in an indexed addressing mode. The function
v get register() takes a vreg and an i386 scratch register as an argument, and always re-
turns a treg: If the PowerPC register is register mapped, it will return the vreg from the
parameter, casted to a treg, and if it is not, the value will be read into the scratch register
and the scratch register will be returned. Functions that write into a register can use the
function v find register() which again returns either the given vreg or the scratch register.
Afterwards, the function v do writeback() writes the value into memory, if the PowerPC
register is memory mapped, or just does nothing, as the code already has written the value
into the correct i386 register.
The following code, taken from the implementation of the 3 register addition, illustrates
the usage of v find register() and v do writeback():
vreg r3_temp = v_find_register(r3, TREG_TEMP1);
v_put_move_register(r1, r3_temp, TREG_NONE, NO_FLAGS);
v_put_add_register(r2, r3_temp, TREG_NONE);
v_do_writeback(r3, r3_temp);
r1 and r2 are supposed to be added, and the result is supposed to be written into r3. After
the first instruction, r3 temp will either be assigned to the index of the i386 register that r3
is mapped to, or to the index of the the first scratch register (EAX), if r3 is memory mapped.
Then r1 gets copied into r3 temp and r2 gets added to r3 temp. Both calls start with v put ,
so they produce valid code no matter whether r1 and r2 are register or memory mapped.
The fourth instruction will then write EAX back into memory, if r3 is memory mapped, or
just do nothing if r3 is register mapped and the last two operations already wrote into the
correct i386 register.
only a small subset of the i386 addressing mode capabilities are used when translating
from PowerPC code, the instruction encoder has a manageable number of functions. The
instruction encoder detects special cases of an i386 instruction that can be encoded more
efficiently: If an addition of a register with an immediate value of 1 is supposed to be
encoded, the instruction encoder’s function to increment a register is called instead. If
the encoding is not to be routed to another function, the optimal encoding of the concrete
i386 instruction will be emitted. For example, there are some instructions that have shorter
(redundant) encodings if EAX/AX/AL is used as an operand.
All functions of the instruction recompiler are marked as ”inline”, to make sure that no
function will ever be called, instead, every function of the decoder will contain all code
necessary to completely translate a PowerPC instruction. The C compiler can do many
optimizations: As there are no function calls, there will be no parameter passing; and
some calculations like treg tests can be combined into one step. Although every step of
the recompiler can have many ”if” instructions, some of these can be omitted by the C
compiler. The following (although theoretical) example illustrates this: If the decoder calls
the i386 converter, which is supposed to translate an addition of the constant register r3 and
the constant value of 1, there will be no check, whether the PowerPC register is register
or memory mapped: The C compiler has already omitted this check, as the register r3 is
constant. The instruction encoder would then test whether the immediate is 1, but this is
known at compile time as well. So the C compiler would effectively only produce code to
emit ”inc %eax”.
4.7.7 Execution
The function that is supposed to pass control to recompiled code and that provides a reentry
point for the recompiled code into the recompiler loop is a bit tricky. If it is not completely
correct, it will just not work at all. This function has to save all i386 registers and load the
mapped ones. On reentry, it has to do the reverse. The i386 registers are saved using the
pusha instruction1 , then ESP is stored in a variable in memory. After the mapped registers
have been loaded from the register array in memory, this function jumps into the recompiled
code, at the address that was passed as a parameter. When the recompiled code wants to
jump back, it jumps to the instruction in this function that is just after the jump into the
recompiled code. Now the EAX register, which holds the address where the recompiled
code stopped execution, will be written into a variable, all mapped registers will be stored
into the register array, the stack pointer will be restored, and the remaining i386 register
will be read from the stack. The function then returns to the caller.
As it has been said earlier, the main purpose of the implementation is the verification of the
design. The implementation shows how well the design works in practice and how good the
produced code is. It is not the purpose of the implementation to provide a platform for mea-
surements of the translation speed, because the two objectives ”verification” and ”speed”
are somewhat mutually exclusive, unless there is a lot of time for the implementation.
It might be argued that it makes little sense to write a recompiler whose new ideas target
high speeds, but not optimize the implementation for speed. But this point ignores that
there are two kinds of speed: the speed of translation, and the speed of the translated code.
The maximum speed of the pass 1 translation has been theoretically calculated during the
design phase; an optimized implementation in assembly language should basically reach
this speed. The speed of the pass 2 recompiler does not really matter: The idea is to run
it only on code that is heavily used, so that even complex pass 2 algorithms will amortize
quickly.
The performance of the recompiled code is more interesting. This speed can either be
measured, or it can be evaluated by inspecting the quality of the code. This chapter will
therefore concentrate on the quality of the code produced by the pass 1 and 2 algorithms,
as well as on speed measurements of the translated code in pass 1.
5.1 Pass 1
On the PowerPC, all programs in this chapter have been compiled with ”gcc (GCC) 3.3
20030304 (Apple Computer, Inc. build 1495)” on Mac OS X 10.3. On the i386, the test
programs have been compiled with ”gcc (GCC) 3.3 20030226 (prerelease) (SuSE Linux)”
on SuSE Linux 8.2.
All programs in this chapter have been compiled with -O3 on both platforms. The PowerPC
disassemblies in this chapter have been done with Mac OS X ”otool -tv” . The recompiled
machine code gets written into a headerless file by the recompiler, which can be disassem-
bled in Linux using ”objdump -m i386 -b binary -D”.
The test program has been written in C and only contains a small, artificial function that
takes many minutes to execute. This way, the recompiler does not have to be able to
translate many PowerPC instructions, and the measurement will not be influenced by the
123
124 CHAPTER 5. EFFECTIVE CODE QUALITY AND CODE SPEED
recompilation speed too much. The test program is a recursive implementation of the
Fibonacci function. The C code looks like this:
int main() {
return fib(50);
}
The ”noinline” attribute makes sure that the first recursion of fib() does not get embedded
into main(), which leads to more readable code. The complete PowerPC version looks like
this:
_fib:
00001eb0 cmplwi r3,0x1
00001eb4 mfspr r2,lr
00001eb8 stmw r29,0xfff4(r1)
00001ebc stw r2,0x8(r1)
00001ec0 or r30,r3,r3
00001ec4 stwu r1,0xffb0(r1)
00001ec8 ble+ 0x1ee4
00001ecc addi r3,r3,0xffff
00001ed0 bl 0x1eb0
00001ed4 or r29,r3,r3
00001ed8 addi r3,r30,0xfffe
00001edc bl 0x1eb0
00001ee0 add r3,r29,r3
00001ee4 lwz r4,0x58(r1)
00001ee8 addi r1,r1,0x50
00001eec lmw r29,0xfff4(r1)
00001ef0 mtspr lr,r4
00001ef4 blr
_main:
00001ef8 li r3,0x32
00001efc b 0x1eb0
is visible in all absolute addresses, in this case all addresses loaded into the emulated link
register.
00001ef8 li r3,0x32
1b: be 32 00 00 00 mov $0x32,%esi
00001efc b 0x1eb0
20: b8 fc 1e 00 00 mov $0x1efc,%eax
25: ff 25 ec d5 05 08 jmp *0x805d5ec
This is the main() function. As it only gets executed once, it never gets linked to fib().
Instead, the code loads the address of the instruction that has to be interpreted into EAX
and returns to the interpreter.
The first five instructions of the i386 version of this basic block implement the comparison.
As it is an unsigned comparison, the i386 flags have to be modified so that signed condi-
tional jump instructions will work. The instruction at address 0x44 stores the flags in the
memory address that corresponds to cr0. Instruction 0x4a copies the link register (EBP)
into r2 (ECX). It will be stored into the stack in instruction 0x6a. The code from 0x4d to
126 CHAPTER 5. EFFECTIVE CODE QUALITY AND CODE SPEED
0x66 implements the stmw instruction, which stores r29 to r31 onto the stack and must be
unrolled on the i386. But this should only be longer, not slower. Unfortunately, r29 to r31
are memory mapped, so every register has to be read from memory and then stored onto
the stack.
Instruction 0x6e corresponds to a register move. As r30 is memory mapped, EAX, which is
r3, will be stored in memory. The ”stwu” instruction must also be done in two steps on the
i386. The conditional branch is expanded into a ”testb” and two branches. It can be seen
that this block has been linked to the two succeeding blocks, which have been translated to
0x96 and 0xcf.
The following PowerPC basic block only consists of two instructions. The first one, the
addition of -1, is translated into a very effective ”dec” instruction. The ”bl” (function call)
is translated into a sequence that moves the address of the subsequent block into EBP and
jumps to the function. The successor block is located at 0xeb.
00001ed4 or r29,r3,r3
eb: 89 35 b4 d5 05 08 mov %esi,0x805d5b4
00001ed8 addi r3,r30,0xfffe
f1: 8b 35 b8 d5 05 08 mov 0x805d5b8,%esi
f7: 83 c6 fe add $0xfffffffe,%esi
00001edc bl 0x1eb0
fa: bd 15 01 16 40 mov $0x40160115,%ebp
ff: e9 37 ff ff ff jmp 0x3b
The first line copies r3 (ESI) into the memory mapped register r29. 0x1ed8 subtracts 2
from r30 and stores the result in r3. The i386 code loads the memory mapped register into
ESI and subtracts 2 from it. Note that an immediate ”add” instruction has been emitted
instead of a ”sub $2”, because this would neither be smaller nor faster, and just make the
recompiler slower. The value of ”-2” is stored as an 8 bit operand that will be sign extended
to 32 bit by the CPU at runtime. Instructions 0xfa and 0xff encode a ”bl” instruction again.
The rest of the function is another basic block. The addition of a memory mapped register
to the register mapped r3 can be done by an ”add” instruction that has a memory address
as an operand. 0x1ee4 reads from the stack into a register, and is translated into the corre-
sponding instruction at 0x11b, just like the addition of 0x50 to the stack pointer (0x1ee8,
0x11f). The ”lmw” instruction reads three registers from the stack, and must again be done
in six instructions on the i386, because all registers are memory mapped. 0x1ef0 copies
the link register back, and 0x1ef4 returns to the caller. On the i386, this is a jump to the
address stored in the emulated link register, EBP.
Because the branch instruction at 0x1ec8 has 0x1ee4 as a target, which is the same as the
last basic block without the first instruction, the basic block has to be translated again, from
the second instruction on:
5.2 Pass 2
Pass 2 dramatically improves the code quality. Unfortunately, the implementation does not
produce correct code in all cases yet. Therefore another example has been chosen — the
iterative implementation of the Fibonacci function:
if (f<2) return f;
a = 0;
b = 1;
while (--f) {
temp = a + b;
a = b;
b = temp;
}
return temp;
}
5.2. PASS 2 129
As before, the translated i386 code will be presented inline with the original PowerPC
instructions.
The register pushes are done to save all registers that are modified by this function. There is
no equivalent to this in the PowerPC code. Because of the different stack layout, accesses
to the caller’s stack frame would have to be adjusted, as described earlier, but this function
does not access the caller’s stack.
The function accepts one parameter (r3), which will be passed in the register array in mem-
ory. This i386 instruction reads the parameter into a register.
This is a compare sequence again. This time it is a signed compare, so the flags do not have
to be converted. As mapping of PowerPC condition registers to i386 registers has not been
implemented yet, the pass 2 recompiler produced code to store the flags in memory.
130 CHAPTER 5. EFFECTIVE CODE QUALITY AND CODE SPEED
00001eb4 or r0,r3,r3
230: 89 fe mov %edi,%esi
00001eb8 ble 0x1ee4
232: f6 05 70 ea 05 08 c0 testb $0xc0,0x805ea70
239: 0f 85 2f 00 00 00 jne 0x26e
As can be seen, code for condition jumps still works like in pass 1. But this time, there is
no jump to the next block in the case the condition jump has not been taken. Instead, the
following instruction (at address 0x23f) will be executed.
The ”addi.” instruction has been translated into a ”dec”, followed by a two instructions
sequence to save the flags.
00001ec0 li r9,0x0
247: 31 db xor %ebx,%ebx
00001ec4 li r0,0x1
249: be 01 00 00 00 mov $0x1,%esi
00001ec8 beq 0x1ee0
24e: f6 05 70 ea 05 08 40 testb $0x40,0x805ea70
255: 0f 84 11 00 00 00 je 0x26c
00001ecc mtspr ctr,r3
25b: 89 f9 mov %edi,%ecx
Loading the value of zero is done using the more efficient ”xor” instruction that clears
a register on the i386. The count register has been mapped to an i386 register, as the
instruction in address 0x25b shows.
This is the loop inside the algorithm. The addition and the register moves were translated
quite efficiently, as all these instructions work on registers. Yet, the addition could also have
been done using ”lea”, which would have been faster, but this is not implemented yet. The
5.2. PASS 2 131
”bdnz” instruction, which decrements the count register and branches if it has not reached
zero has been translated into two instructions; no condition codes are involved when using
”bdnz”.
00001ee0 or r0,r2,r2
26c: 89 ee mov %ebp,%esi
00001ee4 or r3,r0,r0
26e: 89 f7 mov %esi,%edi
00001ee8 blr
270: 89 3d ac 61 05 08 mov %edi,0x80561ac
The last instruction writes the result back into the register array in memory, so that the
caller can read it into the i386 register it has this PowerPC register mapped to.
Finally, the caller’s registers must be restored, and the function can return. Unfortunately,
pass 2 is not yet advanced enough to link a complete program together so that it can be
run, so the inspection of the generated code must be enough. All that can be done now
is compare the translated code to the code GCC produces when compiling the C program.
The important part is the loop, which looks like this:
.L9:
leal (%edx,%ebx), %ecx
decl %eax
movl %edx, %ebx
movl %ecx, %edx
jne .L9
There are two differences here. The first difference is that GCC makes use of ”lea”, which
could be done by the recompiler as well. What is impossible for the recompiler in its current
design is the other optimization GCC makes: The decrement and the the conditional jump
instruction are separated, which gives the CPU more time to predict the conditional jump
and fill the pipeline. The code generated by the recompiler and the code generated by
GCC are very similar though and should have roughly the same performance. This means:
The speed of the recompiled code can be very close to the speed of native code, using
the relatively simple pass 2 optimizations. But this example might not be typical at all.
Real measurements would have to be made, but for that, the pass 2 would have to be fully
implemented.
132 CHAPTER 5. EFFECTIVE CODE QUALITY AND CODE SPEED
Chapter 6
Future
This thesis has addressed many problems of RISC to CISC recompilation, and has pre-
sented many concepts and ideas to solve these. But there are still some aspects of RISC
to CISC recompilation that have not been observed at all, and the implementation of the
recompiler lead to some new ideas which have not been thoroughly tested or compared to
other methods, and for which there was not enough time for a complete implementation.
Therefore this chapter describes missing aspects, further ideas and other applications of the
technology that has been developed.
133
134 CHAPTER 6. FUTURE
This section summarizes the most important ideas which can be a base for future research
on this field.
• Do certain side effects like the summary overflow flag have to be emulated at all1 ?
• What registers are used most? In this environment, libraries will be included into the
statistic, and code that is run more often will be counted more often.
It should be possible to only make statistics on code that runs in user mode. The data gath-
ered can be used for the design of simplifications and in order to optimize the recompiler
for the most frequent cases.
6.2.3 Dispatcher
A PowerPC instruction dispatcher first has to look at the primary opcode. If it is not 19
or 31, this opcode corresponds to a certain instruction. If it is, the extended 9/10 bit op-
code field has to be evaluated. An implementation of a dispatcher will therefore need two
non-predictable branches until the handler for a 19/31 instruction is reached—and these
instructions form the bulk of the instruction set, which makes PowerPC interpretation and
recompilation slower than necessary.
However, a hand-optimized sequence in assembly language can reduce the number of non-
predictable branches to one, by using the ”cmov” instruction, which is available on all i686
compliant CPUs (i.e. since Intel Pentium Pro (1995) and AMD K7 (1999)).
// fetch
mov (%esi), %eax
add $4, %esi
// extract opcodes
mov %eax, %ebx
mov %eax, %edi
and $0xfc000000, %eax
and $0x000007fe, %ebx
// add values
add %ecx, %edx
At first, the primary and the extended opcode get extracted—although only the primary
opcode might be necessary. In the default case, if there is no extended opcode, the primary
opcode and the value of 0 will be copied into the two destination registers. If the primary
opcode is 19, the value of 64 and the extended opcode will be written into the destination
registers. In case of opcode 19, the value will be 64+2048. At the end, the two values in the
destination registers are added, so the result will be a value from 0 to 64+2*2048-1 (4159).
Primary opcodes get mapped to the same values, in case of opcode 19, the extended opcode
will be mapped to values from 64 to 2111, and in case of opcode 31, the extended opcode
will be mapped to values from 2112 to 4159.
The jump table will be a little over 16 KB in size, which can be halved if all handler
functions fit into 64 KB, using this code instead of the jump at the end:
The costs of this this dispatcher are an estimated 22 cycles on a modern i386 CPU, which
includes the pipeline stall caused by the jump.
As this idea has been pretty much last minute, it has neither been thoroughly tested, nor
compared with alternatives. The implementation still does the conventional method with
two sequential jumps.
recompilation can mark basic blocks that belong together and route the information about
the stack frame size to the block that includes the destruction code.
This is also an idea that should not slow down recompilation, but should dramatically
improve code quality.
Standard signature reconstruction of the function has to be done first. The output registers
(for now these are all registers that have been written to as well as all parameter registers)
and the input registers will be the ”def” respectively the ”use” set of the function call
instruction in the caller. After liveness analysis has been done for all callers, the real return
registers can be found out by merging the sets of registers that are live just after each
function call instruction.
The disadvantage of this optimization is that if new code is detected that calls an existing
function, it might be the first caller to evaluate a certain return register, so the function
will have to be retranslated to write this return value back into memory. Additionally, this
method can only be done with global knowledge about the program. Again, the project log
[29] contains more details.
The register allocation algorithm in this project has been applied to functions. But it can
also be applied to a complete program. Liveness of a register can then range across a
function call. This effectively completely changes the interface and even softens the borders
between functions. In order for this method to be efficient, it is important to keep the lives
of a register apart, as described earlier, because every register certainly has many lives
within a program.
This idea is probably not very realistic: Global register allocation can only be applied if all
code is known, which cannot be guaranteed. Otherwise, there would be no possibility to
merge new code other than doing register allocation and translation again.
When mapping PowerPC condition codes to an i386 register, two condition codes can be
mapped to a single register if the i386 register is one of EAX, EBX, ECX or EDX, as these
can be accessed as two 8 bit registers each. As EAX and EDX are used as scratch registers,
the possible registers can only be EBX and ECX. The register usage statistics showed that
94% of all condition code accesses target cr0 or cr7. So the register allocation algorithm
could always regard cr0 and cr7 as an entity and merge them in the interference graph,
so that they will be mapped to the same register. It must be made sure though that the
condition registers are mapped to EBX or ECX only, but this is easy: After the allocation
has been done, the register that was mapped to EBX can be swapped with what cr0 and cr7
were mapped to.
Interestingly, combining cr0 and cr7 in the register usage statistics would lead to position
6 in the list of most frequently used registers, pushing the link register on position seven.
The static register allocation might benefit from this method as well.
Condition codes can even be optimized on a higher level: With the help of liveness in-
formation, it is possible to find out, what condition register write instruction (compare
or ”.”-instruction) belongs to which condition register read instruction (conditional jump).
The conversion of the i386 flags into the signed representation can then be omitted if the
conditional jump is changed to do an unsigned evaluation.
140 CHAPTER 6. FUTURE
Interrupting execution at any point and saving the registers is impossible for a recompiler
with basic block accuracy, because basic blocks are atomic and there is no state of the
source CPU inside during the execution of a translated basic block. In particular, there is no
way to reconstruct the contents of the source registers inside a basic block, unless register
mapped registers are written to memory and the function that tests whether an interrupt has
occurred is called after every instruction, which pretty much lowers the execution speed to
the level of a threaded code recompiler.
Instead, registers could be written back and the interrupt check function could be called
after every block of execution only, but this delays interrupts. Therefore, basic block re-
compilers typically are not cycle exact3 . In most cases, this level of accuracy is sufficient.
Memory Emulation When doing user mode emulation, a lot of memory addresses are
variable. The stack for example, can be located at an arbitrary address, and the application
won’t complain. A system emulator cannot just put the stack anywhere, but must always
emulate the exact content that the emulated register defines for the stack pointer. As this
address can point anywhere, it is generally not possible to put the emulated stack at the
same position.
The same is true with most other addresses: The emulated operating system may make
full use of the complete addressing space. So it is not possible for example to translate an
instruction that accesses memory at the address which is stored in a register into the equiv-
alent in the target language, because this address might not be available to the emulator as
it might be occupied by the operating system. Instead, addresses have to be adjusted every
time before they are used.
Furthermore, operating environments that use virtual memory are even more complicated
to emulate, as the complete memory management unit of the system has to be emulated
as well. For every memory access, the virtual address used by the application has to be
converted into a physical address. If this is done in software, it typically takes a minimum
of 50 cycles [55]. Doing it in hardware by configuring the host MMU to do the address
translation can be very complex or even impossible for certain combinations.
Simplifications Link Register emulation may not be simplified by placing target ad-
dresses into the emulated link register as it has been done in this project, because the
operating system will read and write to the link register and might even perform tests or
arithmetic on it. The same is true for the condition register. Emulation needs to be a lot
more exact and less simplifications are possible.
native instruction set, is cycle exact although it translates blocks of code. This is achieved by a hardware
based rollback functionality that can undo all operations of the last basic block. The block is then interpreted
up to the instruction, after which the interrupt must be processed.
142 CHAPTER 6. FUTURE
on whether the parameters are register mapped or memory mapped, and finally emits the
block of code with the corresponding registers and memory addresses inserted.
Most of this design can be quite easily implemented in hardware as well. The input of the
unit would be the instruction code. After extracting the opcode field(s), a table can be used
to look up the parameter fields of this instruction. Another lookup in the register mapping
table would decide whether the register operands are register or memory mapped. Yet
another table, indexed by the opcode and the mapping type of the registers, can contain the
blocks of i386 code and the information where to fill the registers and memory locations.
Detection of special meanings could be added as well, but this is more complicated. The
basic block logic and cache however should be easier to implement.
6.3 SoftPear
”Most people have two reasons for doing anything — a good reason, and
the real reason.” - fortune-mod
As the main purpose of this project for me was to develop the basis of a recompiler for
the Open Source SoftPear project, which aims to make the Mac OS X operating system
available on i386 hardware, this will not be the end of this recompiler project. Although
”ppc to i386” is not very usable yet, it is a solid base for future development.
Appendix A
Chapter 2 refers to the maximum speed of an interpreter loop or a recompiler loop, and the
costs of the dispatcher. The following assembly program has been used to conduct these
measurements:
.global main
main:
pusha
mov $constant, %esi
mov $1200000000, %ebp
interpreter_loop:
mov (%esi), %eax
xor $4, %esi
mov %eax, %ebx
and $0x0000ffff, %ebx
cmp $1, %ebx
ja opcode_1_is_not_31
mov %eax, %ebx
and $0x0000ffff, %ebx
jmp *dispatch_table(,%ebx,4)
target1:
nop
target2:
mov %eax, %ebx
shr $21, %ebx
and $31, %ebx
mov %eax, %ecx
shr $16, %ecx
and $31, %ecx
mov %eax, %edx
shr $11, %edx
and $31, %edx
mov registers(,%ebx,4), %eax // execute
143
144 APPENDIX A. DISPATCHER SPEED TEST
or registers(,%edx,4), %eax
mov %eax, registers(,%ecx,4)
dec %ebp
jne interpreter_loop
opcode_1_is_not_31:
popa
ret
.data
registers:
.long 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.long 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
constant:
.long 0
.long 1
dispatch_table:
.long target1
.long target2
This loop basically does the same as the interpreter loop and therefore has the same speed,
but it is simpler. Instead of adding 4 to the program counter every time, it just jumps
between two fake instructions, which will be dispatched to target1 and target2. The jump
will therefore not be predictable, just like in a real interpreter loop. The number of iterations
has been set to the clock frequency of the CPU in Hz (1200 MHz in this case), so the time
that the program needs to execute will be the number of cycles per iteration. As is, the
program takes 28 cycles per iteration on a Duron 1200 MHz.
The program can be changed to measure the costs of the jump, by simply removing it.
The measured costs of the program without the jump are 13 cycles, so the jump costs 15
cycles. When changing the ”xor” instruction to modify another register than ESI, the same
instruction gets interpreted every time, which has the same costs as if the jump were not
there, as the CPU can do branch prediction.
The costs of the dispatcher alone, or the costs of a recompiler loop can also be measured
by removing the three execution instructions, respectively by replacing them with emitter
code.
Appendix B
This appendix contains the statistical data that has been used for the register usage fre-
quency graphs in chapter 3.
145
146 APPENDIX B. STATISTICAL DATA ON REGISTER USAGE
[1] Apple Computer, Inc.: Technical Note PT39: The DR Emulator, http://
developer.apple.com/technotes/pt/pt_39.html
[6] James C. Dehnert et al.: The Transmeta Code Morphing Software: Using Spec-
ulation, Recovery, and Adaptive Retranslation to Address Real-Life Challenges,
http://citeseer.ist.psu.edu/dehnert03transmeta.html
[7] Anton Chernoff, Ray Hookway: DIGITAL FX!32 — Running 32-Bit x86 Applications
on Alpha NT, http://citeseer.ist.psu.edu/462062.html
[9] Leonid Baraz et al.: IA-32 Execution Layer: a two-phase dynamic translator designed
to support IA-32 applications on Itanium-based systems, http://citeseer.ist.
psu.edu/645329.html
[10] Sun Microsystems: The Java HotSpot Virtual Machine, v1.4.1, d2 — A Techni-
cal Whitepaper, http://java.sun.com/products/hotspot/docs/whitepaper/
Java_Hotspot_v1.4.1/JHS_141_WP_d2a.pdf
147
148 BIBLIOGRAPHY
[24] Cristina Cifuentes et al.: The University of Queensland Binary Translator (UQBT)
Framework, http://experimentalstuff.sunlabs.com/Technologies/uqbt/
uqbt.pdf
[27] Graham Toal: An Emulator Writer’s HOWTO for Static Binary Translation, http:
//www.gtoal.com/sbt/
[31] John L. Hennessy, David A. Patterson: Computer Organization and Design, 2nd ed.
Morgan Kaufmann, 1997
[34] Apple Inc.: Mach-O Runtime Architecture for Mac OS X 10.3, http:
//developer.apple.com/documentation/DeveloperTools/Conceptual/
MachORuntime/MachORuntime.pdf
[37] Advanced Micro Devices, Inc.: AMD Athlon TM Processor x86 Code Op-
timization Guide, http://www.amd.com/us-en/assets/content_type/white_
papers_and_tech_docs/22007.pdf
[38] Intel Corporation: IA-32 Intel Architecture — Optimization Reference Manual, ftp:
//download.intel.com/design/Pentium4/manuals/24896611.pdf
[39] Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman: Compilers — Principles, Techniques,
and Tools, Addison-Wesley, 1986
[42] Jochen Liedtke et al.: An Unconventional Proposal: Using the x86 Architecture
As The Ubiquitous Virtual Standard Architecture, http://i30www.ira.uka.de/
research/documents/l4ka/ubiquitous-vs-arch.pdf
[43] Cristina Cifuentes, Vishv Malhotra: Binary Translation: Static, Dynamic, Retar-
getable?, http://citeseer.ist.psu.edu/cifuentes96binary.html
[45] Georg Acher: JIFFY - Ein FPGA-basierter Java Just-in-Time Compiler fr einge-
bettete Anwendungen, http://tumb1.biblio.tu-muenchen.de/publ/diss/in/
2003/acher.pdf
[50] Stealth, Halvar Flake, Scut: Spass mit Codeflow Analyse - neuer Schwung für Mal-
ware, http://www.ccc.de/congress/2002/fahrplan/event/392.de.html
[52] Jon Stokes: RISC vs. CISC: the Post-RISC Era — A historical approach to the debate,
http://arstechnica.com/cpu/4q99/risc-cisc/rvc-6.html
[54] Tom Thompson: Building the Better Virtual CPU — Two different designs achieved
the same goal: a faster 680x0 emulator for the Mac, http://www.byte.com/art/
9508/sec13/art1.htm
[59] Raymond Chen: Why does the x86 have so few registers?, http://blogs.msdn.
com/oldnewthing/archive/2004/01/05/47685.aspx
[60] Crystal Chen, Greg Novick, Kirk Shimano: RISC Architecture, http://cse.
stanford.edu/class/sophomore-college/projects-00/risc/
[63] John Mashey: Re: RISC vs CISC (very long), comp.arch newsgroup posting, mes-
sage ID <2419@spim.mips.COM>, http://groups.google.com/groups?selm=
2419%40spim.mips.COM&output=gplain
[65] Gwenole Beauchesne: Re: [dynarec] softpear, Yahoo Groups posting, http://
groups.yahoo.com/group/dynarec/message/518
BIBLIOGRAPHY 151