Automated Microprocessor Stressmark Generation: Ajay M. Joshi Lieven Eeckhout Lizy K. John Ciji Isen
Automated Microprocessor Stressmark Generation: Ajay M. Joshi Lieven Eeckhout Lizy K. John Ciji Isen
Automated Microprocessor Stressmark Generation: Ajay M. Joshi Lieven Eeckhout Lizy K. John Ciji Isen
Abstract
Estimating the maximum power and thermal characteristics of a processor is essential for designing its power delivery system, packaging, cooling, and power/thermal management schemes. Typical benchmark suites used in performance evaluation do not stress the processor to its limit though, and current practice in industry is to develop artificial benchmarks that are specifically written to generate maximum processor (component) activity. However, manually developing and tuning so called stressmarks is extremely tedious and time-consuming while requiring an intimate understanding of the processor. A synthetic program that can be tuned to produce a variety of benchmark characteristics would significantly help in addressing this problem by enabling the automatic exploration of the large temperature and power design space. This paper demonstrates that with a suitable choice of only 40 hardware-independent program characteristics related to the instruction mix, instruction-level parallelism, control flow behavior, and memory access patterns, it is possible to generate a synthetic benchmark whose performance relates to that of general-purpose and commercial applications. Leveraging this abstract workload modeling approach, we propose StressMaker, a framework that uses machine learning for the automated generation of stressmarks. A comparison with an exhaustive exploration of a large power design space demonstrates that StressMaker is very effective in automatically generating stressmarks in a limited amount of time.
1. Introduction
In recent years, energy, power, power density, thermal hot spots, voltage variation, etc., have emerged as first-class constraints in the design of highperformance microprocessors [5][12][13][14][18][30]. As a result, along with performance, it has become extremely important to measure and analyze power, energy, and temperature related design concerns at all stages in a microprocessor design flow from early-
stage exploration, microarchitecture definition, register-transfer-level (RTL) description, to circuitlevel implementation. In order to design a power- and temperature-aware microprocessor it is not only important to evaluate the designs power, energy and thermal characteristics when executing a typical workload, but also to evaluate its maximum power and operating temperature characteristics. In other words, it is also important to analyze the impact of application code sequences that could stress the processors power and thermal characteristics although these code sequences are infrequent and may only occur in a short burst [13][28][34]. Worst-case maximum power dissipation and operating temperature characterization is essential for evaluating dynamic power and temperature management strategies. Also, large instantaneous and localized power dissipation can cause overheating (hotspots) that can reduce the lifetime of a chip, degrade circuit performance, introduce timing errors, or even result in chip failure [30]. Estimating the maximum power dissipation and operating temperature of a processor is also vital for designing the thermal package (heat sink, cooling, etc.) for the chip and the power supply for the system [34]. As such, characterizing the maximum thermal characteristics and power limits is greatly needed by microarchitects, circuit designers, and electrical engineers. Industry-standard benchmarks though do not stress a processor to its limit, and are not particularly useful when characterizing the maximum power and thermal requirements of a design. Benchmarking committees such as the Standard Performance Evaluation Consortium (SPEC) and the EDN Embedded Microprocessor Benchmark Consortium (EEMBC) have recognized the need for power and energy oriented benchmarks, and are in the process of developing such benchmark suites [21][32]. However, these benchmarks too will only represent typical power consumption and not the worst-case maximum power dissipation. Due to the lack of any standardized stress benchmarks, current practice in industry is to develop hand-coded synthetic max-power benchmarks, or stressmarks, that are specifically written to generate
229
maximum power consumption for a particular processor [2][13][28][34]. Developing stressmarks is both time-consuming and tedious. For example, a max-power stressmark has to generate maximum and simultaneous activity in each of the processor components; similarly, a thermal stressmark not only has to deal with power consumption but also with lateral coupling among microarchitecture blocks, role of the heat sink, etc. [30]. This requires a very detailed knowledge of the processor design [13], and given the complexity of modern day high-performance superscalar microprocessors, writing and tuning a stressmark can take up to several weeks [2]. In addition, given that a stressmark is tied to a specific processor, exploring multiple processor architectures in terms of their maximum power consumption and/or thermal characteristics quickly becomes infeasible and may stretch the time-to-market. In this paper we propose StressMaker, a framework for the automated generation of stressmarks. The key enabler to StressMaker is the ability to generate a synthetic benchmark from an abstract workload model. Stressmaker explores the workload space by turning knobs in the abstract workload model, and uses machine learning for driving the search for stressmarks. In this paper, we make three major contributions: We identify a limited set of hardware-independent program characteristics that collectively represent the abstract workload model. The key program characteristics relate to the instruction mix, instruction-level parallelism, control flow behavior, and memory access patterns. When used to generate a synthetic benchmark, the abstract workload model represents real-world workload behavior. Our experimental results using the SPEC CPU2000 benchmarks and three commercial workloads (SPECjbb2005, DBT2, and DBMS) report an average performance and power deviation of 10% and 7%, respectively, when comparing a real workload against its synthetic clone. We propose StressMaker, a novel approach to automatically generate synthetic stressmarks for microprocessor design studies. StressMaker uses machine learning to explore the workload space by varying the program characteristics in the abstract workload model in search for stressmarks. The important advantage of StressMaker, next to being fully automated, is that it enables generating stressmarks for cases where manually writing a stressmark is difficult because of the complex hardware/software interactions in todays highperformance microprocessors. We demonstrate the
feasibility and value of StressMaker by designing max-power, max-temperature, and dI/dt stressmarks. These stressmarks stress the processor much more than typical workloads, are close to optimal compared to an exhaustive search of the workload behavior space, and could serve as a starting point for detailed power/thermal analysis. We develop a framework, BenchMaker, which is parameterized to generate synthetic benchmarks that can be executed on real hardware, executiondriven simulators, and RTL models. StressMaker is just one of the many useful applications of BenchMaker. The parameterized nature of BenchMaker makes it an invaluable tool for exploring the workload space, and for gaining insight into how performance is affected by highlevel program characteristics. The computer architecture research community has recognized the need for developing parameterized workloads [31], and we believe that BenchMaker is a significant step towards achieving that goal.
230
histogram showing the percentage of memory access instructions with stride values of 0, 1, 2, etc. Branch Transition Rate. In order to model varying levels of control flow predictability we use an attribute called the branch transition rate [15]. The transition rate of a static branch is defined as the number of times it switches between taken and not-taken directions as it is executed, divided by the total number of times that the branch is executed. By definition, branches with very low transition rates are always biased towards either taken or not-taken, and are easy to predict. However, branches that transition between taken and not-taken sequences at a moderate rate are relatively more difficult to predict. To summarize, the 40 workload characteristics constituting the abstract workload model are described in Table 1. These workloads characteristics cover a wide range of program characteristics that affect overall workload behavior. If needed, the abstract workload model can be enhanced to model additional characteristics such as operand data values, hamming distances between consecutive instruction opcodes, etc.
Table 1. Microarchitecture-independent characteristics constituting an abstract workload model.
Category
insn mix
No.
8
Characteristic
fraction integer short-latency insns fraction integer long-latency insns fraction fp short-latency insns fraction fp long-latency insns fraction integer loads fraction integer stores fraction fp loads fraction fp stores 8 probabilities constituting the register dependency distance distribution: dependency distance equal to 1 insn (insn is dependent on the previous insn in the dynamic insn stream), smaller than 2, 4, 6, 8, 16, 32, and greater than 32 insns data footprint distribution of local stride values organized in 10 buckets instruction footprint distribution of branch transition rate organized in 10 buckets avg and stdev of the dynamic basic block size
ILP
1 10 1 10 2
231
2.2.1. Generating Program Spine. A normal distribution function based on the average basic block size and its standard deviation is used to generate a linear chain of basic blocks. This linear chain of basic blocks forms the spine of the synthetic benchmark program. We use the instruction footprint of the program to decide on the length of the spine. After the spine has been instantiated, each basic block is populated based on the instruction mix characteristics, and each instruction operand is assigned a dependency distance this is done using random number generation on the cumulative dependency distance distribution. 2.2.2. Modeling Memory Access Patterns. For each memory access instruction in the synthetic benchmark we assign a stride value from the stride distribution function. The load or store instructions memory access patterns are modeled as a bounded stream of circular references, i.e., each memory operation walks through an array using the stride value assigned to it and then restarts from the first element of the array. The length of each array is simply the ratio of the data footprint of the program and the total number of static load or store instructions in the program. 2.2.3. Modeling Branch Predictability. For each static branch in the spine of the program we assign a transition rate based on the specified transition rate distribution. We achieve this by configuring each basic block in the synthetic stream of instructions to alternate between taken and not-taken directions, such that the branch exhibits the desired transition rate at run time. A counter is incremented on each iteration count, and a modulo operation is used to decide whether the branch is taken or not taken. 2.2.4. Register Assignment. In this step we use the dependency distances that were assigned to each instruction to assign register names. The number of registers that are used to satisfy the dependency distances is typically kept to a small value (typically around 10) to prevent the compiler from generating spill code. 2.2.5. Code Generation. During the code generation phase, the instructions are emitted out with a header written in C, which contains initialization code that allocates memory using the malloc library call for modeling the memory access patterns. Each instruction is then emitted out with assembly code using asm statements embedded in C code. The instructions are targeted towards a specific ISA, Alpha in our case. However, the code generator can be modified to emit instructions for an ISA of interest. The volatile
directive is used to prevent the compiler from reordering the sequence of instructions and changing the program characteristics in the synthetic benchmark. The entire program spine is executed in a loop whose number of iterations can be configured to control the dynamic instruction count of the program. This value is tuned to ensure that the synthetic benchmarks execution characteristics converge to a stable value. Based on our experiments, for the workload characteristics used in this study, the synthetic benchmark execution converges to steady state in a maximum of 10 million dynamic instructions.
The workload space built up by the abstract workload model is extremely large, and by consequence it is impossible to evaluate every design
232
point. Therefore, we use a genetic algorithm to automatically search and prune the workload space to converge on a set of workload attributes that maximize an objective function of interest, such as power, temperature, etc. The goal of the genetic algorithm is to intelligently search the workload space by varying the workload characteristics in the abstract workload description, and optimize those characteristics towards a stressmark. Genetic Search initially randomly selects a set of design points, called a generation: these design points are randomly chosen abstract workload configurations. These design points are subsequently evaluated according to the objective function, also called the fitness function, e.g., maximum average power, maximum temperature, etc. evaluating the fitness function of a design point requires simulating the corresponding synthetic benchmark. A new population, an offspring, which is a subset of these design points, is probabilistically selected by weighting the design points fitness functions, i.e., a fitter design point is more likely to be selected. Selection alone cannot introduce new design points in the search space, therefore mutation and crossover are performed to build the offspring generation. Crossover is performed, with probability pcross, by randomly exchanging parts of two selected design points from the current generation. The mutation operator prevents premature convergence to local optima by randomly altering parts of a design point, with a small probability pmut. The generational process is continued until a specified termination condition has been reached. In our experiments we specify the termination condition as the point when there is little or no improvement in the objective function across successive generations. We use the genetic search algorithm with pcross and pmut set to 0.95 and 0.02, respectively. The end result of the genetic algorithm is an abstract workload configuration of which its synthetic benchmark stresses the objective function the most this is the stressmark.
the sim-outorder pipeline, and use the HotSpot v3.1 tool to estimate the steady-state operating temperature based on average power [30]. The stressmarks are compiled using gcc, and are simulated for 10 million dynamic instructions. This small dynamic instruction count serves the needs in this paper, however, in case longer-running applications need to be considered, e.g., when studying the effect of temperature on (leakage) power consumption, the stressmarks can also be executed in a loop for a longer time. It should also be noted that StressMaker is agnostic to the underlying simulation model, and can be easily ported to a more accurate industry-standard simulators and/or power/temperature models.
4.2. Benchmarks
In order to evaluate the parameterized workload synthesis framework, we consider the SPEC CPU2000 benchmarks and select one representative 100Minstruction simulation point selected using SimPoint [29]. We also use traces from three commercial workloads SPECjbb2005 (Java server workload), DBT2 (OLTP workload), and DBMS (a database management system workload). The commercial workload traces represent 30 million instructions once steady-state has been reached (all warehouses have been loaded), and were generated using the Simics fullsystem simulator.
4. Experimental Setup
4.1. Simulation Infrastructure
For evaluating BenchMaker, we use the simalpha simulator that has been validated against the superscalar out-of-order Alpha 21264 processor [9]. For our StressMaker experiments we use the simoutorder simulator from the SimpleScalar Toolset v3.0. In order to estimate the power characteristics of the benchmarks we use an architectural power modeling tool, namely Wattch v1.02 [5] which was shown to provide good relative accuracy, and consider an aggressive clock gating mechanism (cc3). We use the hotfloorplanner tool to develop a layout for
233
10
measured in Instructions-Per-Cycle (IPC). We observe that the synthetic benchmark performance numbers track the real benchmark performance numbers very well. The average IPC prediction error is 10.9% and the maximum error is observed for mcf (19.9%). Figure 4 shows similar results for the Energy-PerInstruction (EPI) metric. The average error in estimating EPI from the synthetic benchmark is 7.5%, with a maximum error of 13.1% for mcf.
1.8 Instructions-Per-Cycle 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 vortex bzip2 crafty gzip twolf gcc vpr SPECjbb2005
SPECjbb2005
Original Benchmark
data footprint data stream stride distribution instruction footprint branch predictability
5 10 5 10
perlbmk
dbt2
dbt2
mcf
35 30 25 20 15 10 5 0
vortex
bzip2
crafty
gzip
perlbmk
twolf
gcc
mcf
vpr
5. Evaluation of BenchMaker
In this section we evaluate BenchMakers accuracy by using it to generate synthetic benchmark versions of general-purpose (SPEC CPU2000 integer) and commercial (SPECjbb2005, DBT2, and DBMS) workloads we obtain similar results for the SPEC CPU2000 floating-point benchmarks, and refer to [19] for a detailed analysis. We measure the program characteristics of the SPEC CPU2000 and commercial workloads and feed this abstract workload model to the BenchMaker framework to generate a synthetic clone benchmark with a 10M dynamic instruction count; we then compare the performance/power characteristics of the synthetic benchmark against the original workload. Figure 3 evaluates the accuracy of BenchMaker for estimating the pipeline instruction throughput
Parameterization of workload metrics makes it possible to succinctly describe an applications behavior using an abstract model with only a limited number (40) of fundamental coarse-grain program characteristics. This is in contrast to prior work in synthetic benchmark generation, which requires several thousands of fine-grain program characteristics [1][20]. BenchMaker trades accuracy (10.9% average error in IPC compared to less than 6% error in our prior work [20]) for flexibility to enable one to easily alter program characteristics and workload behavior.
6. Evaluation of StressMaker
We now evaluate StressMaker by generating various flavors of power and thermal stressmarks. Specifically, we apply StressMaker to automatically construct stressmarks for characterizing maximum average and single-cycle power, dI/dt stressmarks, thermal hotspots, and thermal stress patterns. We also evaluate the efficacy of StressMaker by comparing it
234
dbms
dbms
This maximum sustainable power search process results in a stressmark that has a maximum average sustainable power-per-cycle of 48.8W. Figure 6 shows the results of an exhaustive search across all the 250K design points. These results show that the power of the stressmark is within 1% of the maximum power from the exhaustive search, i.e., the stressmark obtained through genetic searching achieves 99% of the maximum power observed from an exhaustive stressmark enumeration. In other words, StressMaker is highly effective in finding a stressmark, and also results in a three orders of magnitude speedup compared to exhaustive searching (225 versus 250K simulations). Automatically generating the stressmark on a 2GHz Intel Pentium Xeon processor using a cross compiler for Alpha and the sim-outorder performance model, takes 2.5 hours. Therefore, we believe StressMaker is an invaluable approach for an expert, because it can quickly narrow down a design space, and provide a stressmark that can be hand tuned to exercise worst-case behavior.
StressBench
30 25
Power (Watts)
SPECjbb2005
20 15
perlbmk
gzip
mesa
mesa
gzip
perlbmk
dbt2
mcf
eon
Power (W atts)
art
10
dispatch
bpred
window
lsq
perlbmk
issue
rename
regfile
gzip
Generation
10
11
12
13
14
15
16
Figure 7: Comparison of the power dissipation in the various microarchitecture units using stressmarks versus the max power consumption observed across all SPEC CPU2000 and commercial benchmarks.
Figure 5: Convergence of StressMaker: the maximum average power consumption for the stressmark across the multiple generations of the genetic search algorithm.
60 50
40 30 20 10 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000
Figure 6: Power consumption for all 250K points in the workload design built up from Table 2.
Figure 7 shows the maximum power dissipation of different microarchitecture units using the stressmark, along with the maximum power dissipation of that unit across all SPEC CPU2000 and commercial benchmarks the benchmark labels in Figure 7 state which benchmark achieves the highest max power per microarchitecture unit, e.g., art achieves the highest power consumption (18W) in the issue logic across all benchmarks whereas the stressmark consumes 28W. The stressmark exercises all the microarchitecture units more than any of these benchmarks. In particular, the stressmark causes significantly higher power dissipation in the instruction window, L1 data cache, clock tree, and the issue logic. The workload characteristics of the max power stressmark are: (1) Instruction mix of 40% shortlatency floating-point operations, 40% short-latency
Power (Watts)
235
resultbus
art
integer operations, 10% branch instructions, and 10% memory operations; (2) Mostly register dependency distances of greater than 32 instructions, i.e., very high level of ILP, however there still are some dependencies to fill up the issue queue; (3) 80% of branches having a transition rate of less than 10%, and the remaining 20% branches have a transition rate between 10-20% recall that branches with very low transition rates are highly predictable; (4) Data strides having 95% of the references to the same cache line and 5% with references to the next cache line; (5) Instruction footprint of 1800 instructions; and (6) Data Footprint of 100K bytes. These workload characteristics suggest that the stressmark creates a scenario where the control flow of the program is highly predictable and hence there are no pipeline flushes, the functional units are kept busy, the issue logic does not stall due to large dependency distances, and the locality of the program is such that the data and instruction cache hit rates are extremely high. The characteristics of this stress benchmark are similar to the hand-crafted tests [2][13] that are tuned to maximize processor activity by fully and continuously utilizing the instruction issue logic, all of the execution units, and the major buses. However, the advantage over current practice in building hand-coded max-power stressmarks is that StressMaker provides an automatic process, resulting in substantial savings in time and effort. Also, the automated search process through a large workload space increases confidence in the results.
percentage of long-latency operations), and data cache misses (large footprint and strides). Therefore, it is not surprising that the average power consumption of this stressmark is only 32W. However, the overlapping of various events creates a condition where all units are simultaneously busy within a single cycle. Interestingly, the stressmark that maximizes the average sustainable power (Section 6.1) only has a maximum single-cycle power of 59.5W, and cannot be used to estimate maximum single-cycle power. Also, the maximum single-cycle power requirement of a SPEC CPU benchmark, mgrid, is only 57W. This demonstrates that the sequence of instructions resulting in maximum single-cycle power is very timing sensitive even benchmarks that run for billions of cycles may not probabilistically hit upon this condition. To further validate StressMaker, the maximum instantaneous power consumption assuming all units are 100% active is 85W this was computed by summing the power consumption of all the individual microarchitecture units and reflects the theoretical maximum. The 72W attained by the single-cycle maxpower stressmark achieves almost 85% of this maximum theoretical power consumption.
236
SPECjbb2005
StressBench
perlbmk
Temperature (deg C)
gcc
eon
120 100 80 60 40 20 0
mesa
mesa
gzip
140
perlbmk
icache
dcache
bpred
issue
regfile
map
alu
lsq
fetch
L2
dbt2
mcf
L2_left
Figure 9: Comparison of the hotspots created by the stressmarks versus the SPEC CPU2000 and commercial benchmarks.
We apply StressMaker to generate stressmarks that can create hotspots across different microarchitecture units on the floorplan. Figure 9 compares hotspots generated by StressMaker with the hotspots generated by the SPEC CPU2000 and commercial benchmarks. As compared to these benchmarks, the stressmarks are very effective in creating hotspots in the issue, register file, execution, and register remap units.
Power (Watts)
237
L2_right
mcf
the automatically generated stressmark, and the stressmarks key behavioral characteristics.
Table 4. Developing thermal stress patterns using StressMaker.
Pair of Units L2 & IFetch T Diff (C) 44.6 Workload characteristics of the stressmarks (1) Small data footprint and short local strides that result in high L1 D-cache hit rates and no stress on L2, and (2) 80% short-latency insns with high ILP and highly predictable branches keeping fetch busy without pipeline stalls. (1) 40% memory operations, large data footprint, and long local strides to miss in L1 and stress L2, and (2) 40% memory operations with very large dependency distances that put minimal stress on the register remap (1) No memory operations, so no stress on L2, and (2) 40% short latency integer operations and 40% short latency floating-point operations that stress the execution unit. (1) 80% branches with transition rate equally distributed across all buckets (0-10%, , 90100%) a mix of difficult and easy to predict branches that stress the branch predictor, and (2) No memory operations, no stress on L2 (1) 80% memory operations with small data footprint and short local strides stressing the load/store queue, and (2) limited activity in issue queue.
temperature units.
differentials
across
microarchitecture
Tests for Performance & Functional Validation. Automatic test case synthesis for functional verification of microprocessors [3] has been proposed and there has been prior work on hand-crafting microbenchmarks for performance validation [4][9]. Statistical Simulation and Benchmark Synthesis. The primary objective of prior work in statistical simulation [10][25][26] and workload synthesis [1][17][20] is to reduce simulation time by cloning the performance of a program in a synthetic trace or benchmark, respectively. The key idea of these techniques is to capture the behavioral characteristics of a program execution in a statistical profile, and generate a synthetic trace or benchmark to reproduce the performance of the program. In contrast to this prior work, BenchMaker generates a synthetic benchmark from an abstract workload model consisting of a limited number of program characteristics. This enables exploring the workload space in search for stressmarks in the StressMaker framework.
L2 & Register Remap L2 & Exec Branch Predictor & L2 Issue & LSQ
48.4
44.4
41.3
61.0
8. Conclusions
Characterizing the maximum power dissipation and thermal characteristics of a microarchitecture is an important problem in industry. However, due to the complexity of modern microprocessors, and the need to construct synthetic test cases for various complex power and temperature phenomena, it is extremely tedious to manually develop and tune stressmarks for different stress criteria and microarchitectures. In this paper, we developed BenchMaker, a framework for constructing parameterized synthetic benchmarks from an abstract workload model. One of the key results from this paper is that it is possible to fully characterize a workload from a limited number of microarchitecture-independent program characteristics, and still maintain good accuracy with respect to real workloads. We subsequently leveraged BenchMaker by proposing a novel approach for automating the development process of stressmarks. StressMaker is a stressmark generation framework that uses BenchMaker and machine learning to automatically synthesize a stressmark from fundamental program characteristics by exploring the workload design space. We showed that StressMaker is very effective (1% deficiency) in constructing stress benchmarks for measuring max-power dissipation. And we provided case studies in which we constructed stressmarks for maximum average and single-cycle power
7. Related Work
Characterizing Power Consumption of CMOS circuits. A lot of work has been done in the VLSI community to develop techniques for estimating the power dissipation of a CMOS circuit. The primary approach in these techniques is to use statistical approaches and heuristics and to develop a test vector pattern that causes maximum switching activity in the circuit [8][16][22][24][27][28][33]. Although the objective of this paper is the same, there are two key differences compared to our work. First, our technique aims at developing an assembly test program (as opposed to a test vector) that can be used for maximum power estimation at the microarchitecture level. Second, developing stressmarks provides insights into the interaction between workload behavior and power/thermal stress, which is not possible with a bit vector. Manually Developed Stressmarks. [11][12][13][34] refer to hand-crafted synthetic test cases developed in industry that have been used for estimating maximum power dissipation of a microprocessor. The Alpha Toast and Thumper hand-crafted stressmarks [11] stressed total power and dI/dt, respectively. In [23], stress benchmarks have been developed to generate
238
consumption, dI/dt stressmarks, temperature hotspot stressmarks and thermal stress patterns. We believe StressMaker is a promising first step towards the automated generation of stressmarks. As part of our future work, we plan on evaluating StressMaker in an industrial environment and compare the stressmarks generated by StressMaker against manually developed stressmarks. Also, we will continue fine-tuning the abstract workload model in order to capture additional workload characteristics such as bit toggling in data values and instruction opcodes, as well as interactions between co-executing threads and programs in multi-threaded and multi-core processors.
Acknowledgements
The authors would like to thank the anonymous reviewers for their valuable feedback. Ajay Joshi was supported by an IBM Fellowship. Lieven Eeckhout is supported by a Postdoctoral Fellowship with the Fund for Scientific Research in Flanders (Belgium). This work is also supported in part through the NSF award numbers 0429806 and 0702694, an IBM Faculty Partnership Award, the UGent-BOF project 01J14407, the FWO project G.0255.08, and HiPEAC.
References
[1] R. Bell Jr. and L. John. Improved Automatic Test Case Synthesis for Performance Model Validation. In ICS, 2005. [2] Personal communication with Aparajita Bhattacharya (Senior Design Engineer) and David Williamson (Consulting Engineer), ARM Inc. [3] P. Bose. Performance Test Case Generation for Microprocessor. In the IEEE VLSI Test Symposium, 1998. [4] P. Bose and J. Abraham. Performance and Functional Verification of Microprocessors. In the IEEE VLSI Design Conference, 2000. [5] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A Framework for Architecture-Level Power Analysis and Optimization. In ISCA, 2000. [6] D. Brooks and M. Martonosi. Dynamic Thermal Management for High-Performance Microprocessors. In HPCA, 2001. [7] D. Burger and T. Austin. The SimpleScalar ToolSet, version 2.0. University of Wisconsin-Madison Tech Report #1342, 1997. [8] T. Chou and K. Roy. Accurate Power Estimation of CMOS Sequential Circuits. IEEE Transactions on VLSI Systems, 1996. [9] R. Desikan, D. Burger, and S. Keckler. Measuring Experimental Error in Microprocessor Simulation. In ISCA, 2001. [10] L. Eeckhout and K. De Bosschere. Hybrid AnalyticalStatistical Modeling for Efficiently Exploring Architecture and Workload Design Spaces. In PACT, 2001. [11] Personal communication with Joel Emer, Intel, on the Alpha Toast (max power) and Thumper (dI/dt) stress tools.
[12] W. Felter and T. Keller. Power Measurement on the Apple Power Mac G5. IBM Tech Report RC23276, 2004. [13] M. Gowan, L. Biro, D. Jackson, Power Considerations in the Design of the Alpha 21264 Microprocessor. In DAC, 1998. [14] S. H. Gunther, F. Binns, D. M. Carmean and J. C. Hall, Managing the Impact of Increasing Microprocessor Power Consumption. Intel Technology Journal, Q1 2001. [15] M. Haungs, P. Sallee and M. Farrens. Branch Transition Rate: A New Metric for Improved Branch Classification Analysis. In HPCA, 2000. [16] M. Hsiao, E. Rudnick, and J. Patel. Peak Power Estimation of VLSI Circuits: New Peak Power Measures. IEEE Transactions on VLSI Systems, 2000. [17] C. Hsieh and M. Pedram. Microprocessor Power Estimation using Profile-Driven Program Synthesis. IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems, 1998. [18] R. Joseph, D. Brooks, and M. Martonosi. Control Techniques to Eliminate Voltage Emergencies in High Performance Processors. In HPCA, 2003. [19] A. Joshi, Constructing Adaptable and Scalable Synthetic Benchmarks for Microprocessor Performance Evaluation, PhD thesis, The University of Texas at Austin, 2007. [20] A. Joshi, L. Eeckhout, R. H. Bell Jr., L. K. John. Performance Cloning: A Technique for Disseminating Proprietary Applications as Benchmarks. In IISWC, 2006. [21] D. Kanter. EEMBC Energizes Benchmarks. Microprocessor Report. July 2006. [22] C. Lim, W. Daasch, and G. Cai. A Thermal-Aware Superscalar Microprocessor. In ISQED, 2002. [23] K. Lee, K. Skadron, and W. Huang. Analytical Model for Sensor Placement on Microprocessors. In ICCD, 2005. [24] F. Najm, S. Goel, and I. Hajj. Power Estimation in Sequential Circuits. In DAC, 1995. [25] S. Nussbaum and J. E. Smith. Modeling Superscalar Processors via Statistical Simulation. In PACT, 2001. [26] M. Oskin, F. Chong, and M. Farrens. HLS: Combining Statistical and Symbolic Simulation to Guide Microprocessor Design. In ISCA, 2000. [27] Q. Qui, Q.Wu, and M. Pedram. Maximum Power Estimation Using the Limiting Distributions of Extreme Order Statistics. In DAC, 1998. [28] S. Rajgopal. Challenges in Low-Power Microprocessor Design. In VLSI Design, 1996. [29] T. Sherwood, E. Perelman, G. Hamerley, and B. Calder. Automatically Characterizing Large Scale Program Behavior. In ASPLOS, 2002. [30] K. Skadron, M. Stan, W. Huang, S. Velusamy, K, Sankaranarayanan, and D. Tarjan. Temperature-Aware Microarchitecture. In ISCA, 2003. [31] K. Skadron, M. Martonosi, D. August, M. Hill, D. Lilja, and V. Pai, Challenges in Computer Architecture Evaluation, IEEE Computer, 2003. [32] http://www.spec.org/specpower/ [33] C. Tsui, J. Monteiro, M. Pedram, A Despain, and B. Lin. Power Estimation Methods for Sequential Logical Ciruits. IEEE Transactions on VLSI Systems, 1995. [34] R. Vishwanath, V. Wakharkar, A. Watwe, V. Lebonheur. Thermal Performance Challenges from Silicon to Systems. Intel Technology Journal, 2000.
239