Measuring Experimental Error in Microprocessor Simulation
Measuring Experimental Error in Microprocessor Simulation
Subcluster assignment
Branch predictors (L, G, C)
Line predictor update
Fetch
Slot
2-way I-cache
Line predictor
Way predictor
Map
Load-use predictor
Store-wait predictor
Scoreboard
Issue
Regread
Jump squash
Store commit
Arch. reg. commit
D-cache access
Load-use rollback
Execute
Rename table
Branch predictor update
W-back
Retire
X-cluster bypass
instruction
latency
integer ALU
integer multiply
integer load (cache hit)
FP add, multiply
FP divide/sqrt (single precision)
FP divide/sqrt (double precision)
FP load (cache hit)
unconditional jump
1
7
3
4
12/18
15/33
4
3
3 Microbenchmark suite
had been run on simple tests but not validated. We call that
early version sim-initial. The fth and sixth columns
contain the IPC values and errors for our most current, validated version of sim-alpha. The right-most two columns contain the IPC values generated by the
SimpleScalar simulator, which we discuss later. All errors
are computed as a percentage difference in CPI.
The microbenchmarks running on sim-initial
show a mean error of 74.7% compared to the reference
machine. The mean errors are computed as the arithmetic
mean of the absolute errors. While some of the errors are
negligible, in particular simple timing cases like E-F and
E-D1, most of the microbenchmarks show errors of 20%
or greater. The C-Ca, C-Cb, and C-R microbenchmarks all
underestimate performance by over 100%. While most of
the microbenchmarks underestimate performance, E-DM1
and C-S1 overestimate performance by 85.7% and 31.2%,
respectively.
We used a variety of strategies to discover and eliminate the sources of error in sim-initial. We rst made
all resources in the pipeline perfect, and then searched for
performance bottlenecks. In addition to measuring total
execution time, we also monitored event counts, such us
mispredictions requiring rollback in various predictors.
Below, we discuss some of the errors we discovered and
xed in sim-initial to improve its accuracy with
respect to the reference machine.
Instruction Fetch: Since the front end of the 21264 is
the most complicated component of the processing core,
with many interacting state machines, it is unsurprising
that most of the errors occurred there. We addressed the CC and C-R errors rst, since they were the largest in magnitude. The most signicant contributor to the C-C and CR error was an excessive branch misprediction penalty.
sim-initial waited until after the execute stage to discover a line misprediction and initiate a full rollback. We
determined that there is an undocumented adder that
resides between the fetch and slot stages. That adder computes the target for PC-relative branches early, and is used
by the branch predictor to override the line predictor in
some cases.
We experimentally determined the rules for the interaction between the line predictor and the branch predictors.
The branch predictor will overrule the line predictor on
conditional or unconditional branches (not jumps) if it predicts taken, can compute the target early, and the target
computation disagrees with the line prediction. Furthermore, we chose the initialization bits for the line predictor
(01) that minimized error, and adjusted the line predictor
state machine to minimize error as well. When testing the
macrobenchmark eon, we noticed that performance was
extremely poor compared to our reference. That benchmark exhibits an unusually high number of way mispre-
The rst of the execution core benchmarks, executeindependent (E-I), simply adds the index variable to eight
independent, register-allocated integers twenty times each
within a loop. The absence of memory operations, control
hazards, or data dependences should allow close to ideal
throughput on this microbenchmark. The second execute
microbenchmark, execute-oat-independent (E-F), performs the same computation as E-I, except on oatingpoint variables. The third execute microbenchmark, execute-dependent-n (E-Dn), implements n dependent chains
of register-allocated integer additions within a loop. Each
arithmetic instruction in the loop is dependent on the
instruction n positions earlier. E-DM1 is simply E-D1
using multiply instructions instead of adds.
Alpha
21264
benchmark
C-Ca
C-Cb
C-R
C-S1
C-S2
C-S3
C-CO
E-I
E-F
E-D1
E-D2
E-D3
E-D4
E-D5
E-D6
E-DM1
M-I
M-D
M-L2
M-M
M-IP
Mean
IPC
1.80
1.87
2.65
0.56
0.85
0.95
1.75
4.00
1.01
1.03
2.16
2.72
2.79
3.30
3.11
0.15
2.98
1.66
0.36
0.07
1.75
Initial simulator
(sim-initial)
IPC
0.38
0.52
0.89
0.81
0.82
0.87
0.53
3.31
1.01
1.04
2.15
2.99
2.89
3.23
3.31
1.04
2.39
1.25
0.34
0.07
0.89
Validated simulator
(sim-alpha)
% error
-498.1%
-260.4%
-198.4%
31.2%
-3.6%
-8.5%
-273.6%
-20.9%
-0.1%
0.3%
-0.0%
9.3%
3.6%
-2.1%
6.1%
85.7%
-24.2%
-32.9%
-4.0%
-8.2%
-97.9%
74.7%
IPC
1.87
1.87
2.66
0.60
0.86
0.95
1.74
3.99
1.01
1.04
2.15
3.07
2.80
3.50
3.15
0.15
2.99
1.66
0.35
0.08
1.76
% error
4.3%
0.6%
0.3%
6.4%
2.1%
0.5%
-0.6%
-0.4%
0.2%
0.4%
0.0%
11.5%
0.3%
5.8%
1.3%
-0.3%
0.6%
0.4%
-0.9%
4.2%
0.5%
2.0%
SimpleScalar 3.0b
(sim-outorder)
IPC
3.17
3.00
3.54
0.88
1.33
1.64
2.05
3.99
1.01
1.04
2.21
3.19
4.00
4.00
4.00
0.15
3.00
1.26
0.55
0.07
1.22
% difference
28.2%
37.8%
25.2%
36.1%
36.5%
42.2%
3.0%
-0.4%
0.2%
0.4%
2.6%
14.8%
30.2%
17.6%
22.2%
-0.3%
0.7%
-31.1%
35.6%
-0.3%
-43.1%
19.5%
The 11.5% error in E-D3 is due to a minor approximation in the way we implement bypassing; when an instruction is bypassed, we subtract the latency that is saved as a
result of the bypass from the latency of the execution. That
simplication results in different issue orders than might
occur in the actual 21264 core. Future versions of simalpha will contain a scheduler that accurately represents
the 21264 instruction issue policies.
Memory: We originally noticed an unusually high
number of load traps in sim-alpha resulting from multiple loads to the same address executing out of order, and
thus violating the coherence requirements built into the
21264 memory system. We hypothesized that the simulator was too conservative because it masked out the lower
three bits of the addresses before comparing them in the
load-trap identication logic. The error dropped dramatically in M-D when the entire memory address was used to
detect these conicts. We also found that the L2 latency
shown in M-L2 was a cycle longer than that specied in
the Compiler Writers Guide. That anomaly was found to
be a modeling error in which the simulator charged too
many cycles for the register read stage on loads that
missed in the cache. We were also charging one cycle too
few for recovery upon load-use mis-speculation, which we
discovered with the M-D benchmark. Finally, we did not
initially implement the store-wait table, expecting it to
make only a small difference in performance. However,
when we observed the large number of store replay traps
in the C-R benchmark, we implemented that table and
noticed a precipitous drop in error. The results in Table 2
for sim-initial include the store-wait table.
A substantial challenge for any microprocessor simulator is to replicate the behavior of the DRAM and virtual
memory systems. Access latency in modern DRAMs, such
as synchronous and Rambus DRAM, is highly dependent
on the stream of physical addresses presented to them,
which in turn depends on the virtual to physical page mappings. Like many other microprocessor simulators, such as
SimpleScalar [4] and RSim [17], sim-alpha does not
simulate past the system call boundary, thus replicating the
the page mappings of the native system is difcult if not
impossible. Complete system simulators, such as SimOS
[19] or SimICS [14], suffer from the same problem as the
page mappings in the native machine depend on the set of
allocated pages prior to starting and measuring the
selected program. These mismatches between simulated
and native page mappings can cause non-cache-resident
benchmarks to experience error due to the variable DRAM
access time. Our challenge, then, is to match the specications and observable behavior of our memory system as
closely as possible with that of the native system, and
5 Macrobenchmark validation
In this section, we attempt to quantify two important
properties of any experimental simulation framework:
accuracy and stability. To quantify accuracy, we execute
ten of the SPEC2000 benchmarks on the DS-10L workstation, and compare the resultant performance against the
performance of a number of simulators congured like the
DS-10L. We measure how the addition and removal of ten
gzip
Alpha 21264 IPC
sim-alpha IPC
% error
sim-stripped IPC
% difference
sim-outorder IPC
% difference
1.53
1.28
-22.01
1.07
-51.52
2.28
28.56
vpr
1.02
0.99
-4.63
0.74
-44.12
1.62
34.04
gcc
1.04
0.90
-18.07
0.84
-42.33
1.89
37.20
parser
1.18
0.97
-23.09
0.89
-42.01
2.00
37.05
eon
twolf
mesa
art
1.21
1.21
-0.92
0.96
-34.10
2.08
38.29
1.10
1.07
-6.07
0.84
-42.09
1.76
32.25
1.57
1.17
-38.37
1.04
-62.10
2.59
36.80
0.48
0.82
43.04
0.82
39.75
2.14
76.89
equake
1.02
0.94
-10.94
0.83
-32.71
1.69
34.60
lucas
1.57
1.37
-14.74
1.44
-9.96
1.79
11.54
mean
1.05
1.05
18.19
0.92
40.07
1.95
36.72
which ush the pipeline on MSHR conicts and concurrent references to two blocks that map to the same
place in the cache.
5.1 Accuracy
In Table 3, we compare the performance of the DS-10L
against three simulators. We used ten of the SPEC2000
benchmarks, which were compiled with the native Alpha
compilers, using -arch ev6 -non_shared -O4 for each
benchmark. We ran all of the benchmarks to completion
with the standard test input sets from the SPEC distribution. The three simulators that we compare are the following:
sim-alpha IPC
% change
std. deviation
ref
1.05
addr
0.98
-7.78
5.81
eret
1.10
-0.67
1.09
luse
0.99
-5.79
2.52
pref
1.05
-0.29
1.27
spec
0.99
-5.92
5.07
stwt
1.00
-4.25
5.60
vbuf
1.05
-0.37
1.07
maps
1.07
2.11
2.85
slot
1.05
0.36
1.64
trap
1.05
0.31
0.99
In Table 4, we show the effects that each of the ten individual features from the previous subsection have on overall performance. The ref column corresponds to simalpha with all of the features while the rest of the columns represent sim-alpha minus only the feature listed
in the column heading. In the rst row, we list the harmonic mean of the macrobenchmark IPC values for each
conguration. In the second row, we show the mean percent change in performance compared to sim-alpha,
which results from removing each feature. The third row
displays the standard deviation of the changes in performance across the benchmarks for each conguration.
Performance drops signicantly when any of four particular features are disabled, as they each independently
provide more than 4% in performance to sim-alpha.
These features are the jump adder (7.8%), load-use speculation (5.8%), speculative predictor update (5.9%), and
store-wait bits (4.3%). Of the performance-constraining
features, the only one that affects performance more than
1% is map-stage stalling. When those stalls are removed
from sim-alpha, performance increases by 2.1%.
Finally, we note that the variability is high: all of the standard deviationswhich represent the degree to which the
percentage improvements of the optimizations vary across
the benchmarksare greater than one percent. The standard deviation for the jump adder, speculative update, and
store-wait bits is particularly high, more than 5% in each
case.
5.3 Stability
When a feature or idea is evaluated on a simulator, that
feature is stable if it provides similar benets or improvements across other simulators and environments. Detailed,
validated simulators may be unnecessary if new features
are stable across more abstract, yet unvalidated, simulators. Conversely, an added feature that is unstable may
appear to provide benets on an inaccurate simulator,
while on a validated simulator, the benets might disappear or even reverse. In this subsection, we measure the
stability of four different parameter sets across a range of
multiple simulator congurations.
In Table 5, we show the change in performance when
three improvements are made: reducing the L1 D-cache
access latency from three cycles to one, increasing the L1
D-cache size from 64KB to 128KB, and doubling the
number of physical registers. Each column displays the
Optimization
3 to 1-cycle L1 D$
64KB to 128KB L1 D$
40 to 80 physical regs.
simalpha
addr
eret
luse
pref
spec
stwt
vbuf
maps
slot
trap
simstrip.
simout
5.53
2.04
0.63
5.45
1.72
0.91
5.98
2.03
0.53
n/a
1.70
0.63
6.25
2.23
0.98
5.45
1.96
1.07
6.49
2.43
1.44
6.42
2.14
0.55
5.90
2.02
0.88
5.25
1.55
1.27
5.95
1.38
0.95
9.85
1.70
0.64
5.78
0.66
0.23
6 Related work
The effort most similar to our own was the study performed by Black and Shen [2], which validated a performance model of a PowerPC 604 microprocessor. Their
validation efforts beneted from the performance counters
on the 604, which allowed them to track individual
instructions, pairs, and combinations through the pipeline
and compare the cycles consumed to those in their performance simulator. Since DCPI can measure only a few
events in addition to cycle and instruction commit counts,
we were restricted to running assembly tests for numerous
iterations to isolate the behavior of instruction combinations. While the 604 validation study achieved low (4%
mean) errors, it assumed a perfect L2 cache and could only
measure performance of small, cache-resident benchmarks. Our work adds to their efforts by modeling a more
complex microarchitecture, which contains seven full predictors, running memory-intensive benchmarks in addition
to our kernels, and isolating the performance contributions
of distinct microarchitectural features.
Gibson, et al. [9] described a validation of the FLASH
multiprocessor hardware against two software processor
simulators (Mipsy and MXS) coupled with two internal
memory system simulators (Flashlite and NUMA), and
using SimOS to model OS performance effects. The
authors validation focused more on the memory system
than on the microarchitecture, as Mipsy does not model
pipelines, while MXS models a generic pipeline, rather
than the R10000 that was in their machine. The authors
found that TLB behavior had a surprisingly substantial
effect on performance and point out that OS page coloring
can reduce cache misses. As we describe in Section 4,
sim-alpha does not account for the TLB overheads correctly, nor does it model any effects of page coloring.
Reilly and Edmonsons work on a performance model
for the Alpha 21264 was intended to enable quick exploration of the design space, rather than model the microarchitecture in detail [18]. Finally, Bose and Conte discuss
performance evaluation from a design perspective and
suggest the use of microbenchmarks, as well as the com-
10
4
Cruz et al. [7]
1 cycle, full bypass
2 cycle, full bypass
2 cycle, partial bypass
3
2
sim-alpha
1 cycle, full bypass
2 cycle, full bypass
2 cycle, partial bypass
1
0
n
ea
m
H
e5
av
w
p
pp
fp
3d
rb
tu
u
pl
ap
id
im
gr
sw
rl
pe
eg
ijp
c
gc
s
es
pr
m
co
go
7 Conclusions
Because the architecture research community relies so
heavily on simulation, it is disconcerting to think that our
simulators may be highly inaccurate due to abstraction,
specication, or modeling errors. Many of the studies published in our conferences and journals may report results
that are unintentionally erroneous. Architecture ideas may
show promising results merely because of simulator pipeline artifacts, performance bugs, or unrealistic baselines.
It is also possible that our unvalidated simulators are
sufciently accurate, errors balance out, and we can trust
the results we obtain. Injecting new ideas into the literature may be more important than a quantitative evaluation
to the second or third decimal place of precision. In that
case, the community (including the authors of this study)
could certainly benet from fewer late nights spent producing gigabytes of simulation data.
The answer to the question of simulator validation
depends on the conventionality of the research approach.
A fundamentally new computer organization or technology can not be evaluated in the context of a conventional
superscalar processor. Furthermore, no baseline exists
against which the simulation of a radically new idea can be
compared. However, innovations that provide incremental
modications to a conventional pipeline certainly can be
evaluated within a conventional context. The more conventional the frameworkand the smaller the performance gainthe greater the onus on the researchers to
verify that the experimental framework, and thus their
conclusions, are valid.
In this paper, we describe an attempt to verify a highlevel timing simulator against actual hardware, the Compaq DS-10L workstation. Our goal was to measure and
understand the error that researchers (including ourselves)
incur by assumptions that we make in our simulation
11
CADRE program, grant no. EIA-9975286, and by equipment grants from IBM and Intel.
References
[1]
[2]
[3]
choose parameters, such as DRAM latencies, in an adhoc manner. That variance makes comparing results
across papers difcult or impossible. Simulation
parameters should be chosen against either reference
machines, common models, or communal parameter
sets, to maximize consistency across research studies.
In this study, we used the DS-10L workstation as a
source for our parameter choices.
[4]
[5]
[6]
[7]
[8]
[9]
[10]
widely effective, and is not merely a fortunate by-product of coincidence, it should be measured across a
range of processor and system organizations. Achieving better reproducibility would help this goal, as stability could be measured across multiple research
groups and simulation environments.
[11]
[12]
[13]
For many of the reasons highlighted above, results currently produced by architecture researchers are rarely used
by practitioners. The ideas are frequently re-evaluated
with an attempt to reproduce and rene the results in a
companys internal environment. Improved accuracy,
reproducibility, and consistency would greatly improve the
utility of the results we generate for both practitioners and
other researchers.
[14]
[15]
[16]
[17]
Acknowledgments
Thanks to Joel Emer and Steve Root for answering
many questions about the Alpha microarchitecture. We
thank Todd Austin, for providing both the Alpha ISA
semantics le and all of the SimpleScalar wrapper code
that we used, Alain Kgi for his comments and insights, as
well as Bruce Jacob and Vinod Cuppu for providing their
SDRAM code. This work was supported by the NSF
[18]
[19]
[20]
12