Mambo - A Full System Simulator For The Powerpc Architecture
Mambo - A Full System Simulator For The Powerpc Architecture
Mambo - A Full System Simulator For The Powerpc Architecture
Patrick Bohrer Mootaz Elnozahy Ahmed Gheith Charles Lefurgy Tarun Nakra
James Peterson Ram Rajamony Ron Rockhold Hazim Shafi Rick Simpson
Evan Speight Kartik Sudeep Eric Van Hensbergen
Lixin Zhang
Abstract To fill our needs, our design stresses modularity and con-
figurability. Modularity is achieved by an internal struc-
ture that features a modern, multithreaded simulation core.
Mambo is a full-system simulator for modeling PowerPC-
This in turn is enhanced with various programming constructs
based systems. It provides building blocks for creating sim-
that support a modular and highly maintainable design. The
ulators that range from purely functional to timing-accurate.
constructs implement higher-level abstractions to express the
Functional versions support fast emulation of individual Pow-
usual characteristics of simulated systems, such as pipelined
erPC instructions and the devices necessary for executing op-
execution units, and programmers use these abstractions to
erating systems. Timing-accurate versions add the ability to
quickly model different system behaviors. Due to this modu-
account for device timing delays, and support the modeling
larity, our team is able to experiment with simulator enhance-
of the PowerPC processor microarchitecture. We describe our
ments and performance improvement, and quickly introduce
experience in implementing the simulator and its uses within
them into the simulator with minimal perturbation to the pro-
IBM to model future systems, support early software devel-
duction mode operation.
opment, and design new system software.
The second feature stressed in the Mambo design is config-
urability. Mambo is designed as a collection of configuration
features that can be selected to easily define a variety of pro-
1 Introduction cessors and devices. Compile time and runtime parameters
allow users to configure nearly every feature of the system
Full system simulators have emerged during the past decade being simulated. Compile time options define major features
as viable tools for low-level system software development and (such as 32-bit or 64-bit support), while runtime options set
performance evaluation. Earlier, our team adapted the SimOS fine-grained parameters such as amount of memory, number
simulator platform [8] to support the PowerPC architec- of processors, cache geometry, etc. A partial list of selectable
ture [5]. While our experience was successful, it also showed features includes:
the need for an industry-strength implementation that is more
configurable and amenable to the rigors of the software en-
gineering life cycle. Therefore we started Mambo, a modu- 32-bit or 64-bit processor design.
lar full system simulator that is designed from the ground up Floating point registers and instructions.
to simulate the PowerPC line of processors [6]. The imple- Vector Multimedia Extension (VMX) registers and in-
mentation supports different simulation modes, ranging from structions.
functional simulation of the PowerPC instructions, to cycle- Hardware Multi-Threading (SMT) [12].
accurate simulation of an entire system. Mambo also includes PCI bus.
trace collection and debugging interfaces to allow detailed
analysis of the simulated hardware and software. Seven pro- IDE disks.
cessors of the PowerPC line are supported, including the 32- Network.
bit embedded 405GP [7] and the 64-bit 970 PowerPC used in Caches (L1, L2, L3, and victim).
Apple’s new G5 system [1]. The processor support includes Bus.
interrupts, debugging controls, caches, busses, and a large Memory.
number of architectural features. In addition, Mambo models
memory-mapped I/O devices, consoles, disks, and networks UART and console support.
that allow the simulated operating systems to boot and run Hypervisor support.
programs. Address translation (ERAT, SLB, TLB) [6].
Uniprocessor or multiprocessor. feature may uncover errors, missing functionality or areas that
were not well understood. Traditionally, such problems are
not uncovered until a detailed VHDL model of the hardware
The simulator runs on the x86 and PowerPC platforms run- is built, or even after system software has been implemented
ning a range of operating systems including Linux, AIX, on the finalized hardware platform. For instance, in the early
OS/X, and Windows R . It uses Tcl/Tk to provide a command design of a PowerPC processor, Mambo revealed a race con-
language and graphical user interface and DiskSim [3] to pro- dition that required changing the semantics of several bits in
vide timing-accurate disk models. a control register. Also, the hardware features of a hypervisor
design had to be updated based on the implementation of the
We have used Mambo successfully for a variety of purposes,
operating system on the modeled hypervisor.
including support of operating system development, sys-
tem bringup, characterization of application performance and The second category of using Mambo is in application char-
power consumption, performance tuning, and pre-hardware acterization. Mambo produces a variety of statistics, both in
application development. In Section 2 we describe our ex- summary and detailed form, allowing the performance and
perience with Mambo in more detail. We then describe in operation of a program to be understood and evaluated for
Section 3 the implementation of the simulation and conclude a new hardware architecture. By associating performance-
the paper in Section 4. affecting hardware events (e.g., cache misses, TLB shoot
downs, and memory references) with the program instruction
stream, it is possible to identify under-performing portions
2 Experience with Mambo of a program and correlate the performance problems with
resource usage. This may allow significant performance im-
Like other full system simulators, Mambo has proved useful provement by changing a data structure or the position of an
in software development and application characterization. In inner loop to reflect the cache architecture. These features
some cases, the simulator served as a platform to enable soft- provide an infrastructure for characterizing application and
ware development before the hardware is available. As an ex- system behavior and performance.
ample, a team of researchers at IBM was able to develop the
We have extended the characterization to the emerging field of
software for Blue Gene/L [11] [4] [2] so that when the hard-
power-aware computing [9]. With the help of power estimates
ware became available, programs were running on the first
for the various tasks associated with execution of instructions,
day, and the system was usable within a week. Similar uses
an analysis of the total power consumed in the core and mem-
are also underway for several architectures and systems under
ory subsystem can be carried out. Then, one can use this
development.
information to identify opportunities for reducing processor
It is noteworthy that Mambo is useful for software develop- speed (e.g., during memory-intensive instructions) or modi-
ment even if the hardware is available. For example, devel- fying the application structure to reduce power consumption.
oping low-level system software such as operating systems
on the bare hardware is time consuming. Mambo includes
an interface to gdb, allowing source-level debugging from the
very first instruction of the operating system. gdb attaches to 3 Implementation
Mambo so that developers can use the normal gdb interface to
debug the simulated operating system. The simulator can sin- 3.1 Operating System Adaptation
gle step through code that cannot normally be traced in this While Mambo is capable of booting unmodified operating
way, such as an operating system’s first level interrupt han- systems such as Linux, detailed simulation of peripherals is
dler. A team of researchers at IBM has used the simulator to time intensive to implement and slows down simulation. To
support the development of the K-42 operating system [10]. improve run-time when detailed device simulation is not nec-
In their experience, the simulator has advanced their develop- essary, several changes are made to the simulated operating
ment schedule by about a year. system to allow more direct interaction with Mambo. A direct
block driver interface allows disk images on the simulation
Mambo also can enhance the software-hardware co-design host to be used by the simulated operating system, and a vir-
process. For example, new hardware features such as SMT tual Ethernet interface is added that can either communicate
or hypervisor support can be modeled and low-level system to other simulated hosts or to real networks. Other changes
software can be developed to examine the use of such fea- include process tracking hooks, which interact with Mambo
tures before they are finalized into hardware. Our experience statistics gathering infrastructure.
shows that this approach has several benefits that straddle
software and hardware. For example, our experience shows Figure 1 shows a screenshot of Mambo booting Linux on a
that using Mambo early in the hardware design process to PowerPC 750 system. The UART0 window shows the simu-
model the new feature forces the designers to define the fea- lated console and the xterm window shows the Mambo com-
ture well enough to be programmed. The feedback from the mand line. Other windows show the GUI interface and a
model implementation and the software experience with the statistic gathering tool. The GUI ensures ease of use and
Figure 1: Mambo Graphical User Interface during a Linux boot.
quick identification of performance bottlenecks. of operation provides good accuracy at the expense of longer
simulation time. A cycle accurate model of the 405GP pro-
3.2 Timing Models cessor was validated to be within 0.6% of real hardware, but
Mambo provides a variety of timing models for software de- ran four times longer than the functional model, which was
velopment and for hardware and software performance eval- off by 26% against the real hardware [9]. For more complex
uation. The simplest timing model assumes each instruc- processors, the slowdown of the cycle accurate model com-
tion requires one cycle to execute. Memory accesses are pared to the functional model can be 10 times or more.
synchronous and instantaneous. This is a purely functional
model, and is useful for software development and debug- A compromise between the fast, but inaccurate, functional
ging when a precise measure of execution time is not impor- model and the slower, but accurate, cycle-accurate model is
tant. Even in this mode, some system features require tim- the cycle-approximate model. This model uses probabilistic
ing support. For example, I/O interrupts and timer interrupts measures to improve timing estimates. For example, a mem-
are scheduled to provide at least a crude sense of the passing ory reference may (or may not) hit in the cache. A cache hit
of time. These inaccuracies are tolerated given the intended takes a different amount of time than a cache miss. In the
use of the functional model. This use trades accuracy for in- cycle-accurate model, it is necessary to model the cache, al-
creased processing speed. For instance, a functional model lowing Mambo to determine exactly if a particular reference
of the 405GP processor executing on a 3.2GHz, x86 system is in the cache. The cycle-accurate model knows if there is
can simulate an average of 4 million PowerPC instructions a cache hit or miss. The cycle-approximate model does not
per second. model the cache (hence providing a faster simulation), but
probabilistically determines the time for the access from user-
For accurate performance evaluation, Mambo provides a supplied cache hit ratios as well as a predetermined time for a
cycle-accurate timing model. A cycle-accurate timing model cache hit and cache miss. We are currently adding this model
requires a complete modeling of the operation of the proces- to the infrastructure.
sor including its pipeline and functional units. Each operation
takes a number of cycles to complete and must consider both 3.3 Multithreaded Simulator Structure
processing time (the time to search a cache, for example) and To simplify the development effort while still accurately mod-
resource constraints (e.g., an instruction cannot be issued to eling hardware events, we structured Mambo as an internal
an add unit if that add unit is already in use). This mode thread programming model, allowing instruction execution
code to simply pause in place (delay) as necessary. For in- thread and returns asynchronously to the caller. Counters can
stance, the main function of a cache refill request simply looks be used to synchronize different aspects of the interaction be-
as follows: tween the caller and the worker.
References
[1] Apple Computer Inc. Apple Power Mac G5, 2004.
[2] L. R. Bachega, J. R. Brunheroto, L. DeRose,
P. Mindlin, and J. E. Moreira. The BlueGene/L Pseudo Cycle-
accurate Simulator. In Proceedings of the IEEE International
Symposium on Performance Analysis of Systems and Software
(ISPASS), March 2004.
[3] J. S. Bucy and G. R. Ganger. The disksim simulation
environment version 3.0 reference manual. Technical Report
CMU-CS-03-102, Carnegie Mellon University, 2003.