A super-scalar processor is one that is capable of sustaining an instruction-execution rate of more than one instruction per clock cycle. Maintaining this execution rate is primarily a problem of scheduling processor resources (such as functional units) for high utilization. A number of scheduling algorithms have been published, with wide-ranging claims of performance over the single-instruction issue of a scalar processor. However, a number of these claims are based on idealizations or on special-purpose applications.
This study uses trace-driven simulation to evaluate many different super-scalar hardware organizations. It uses general-purpose benchmark programs executed with a typical RISC instruction set. Highly-optimized versions of the benchmark programs are used, to avoid measuring concurrency that is due to a lack of compiler optimization. However, the compiler performs no optimizations specifically for the super-scalar processor, to provide the fairest measure of super-scalar hardware performance. In contrast to previous studies, this study examines a wide range of cost and performance tradeoffs, rather than focusing on one specific processor organization or scheduling algorithm. Furthermore, the results are not based on idealizations; for example, they include the effects of realistic functional-unit latencies, instruction and data caches, and multi-tasking.
Within this framework, super-scalar performance is limited primarily by instruction-fetch inefficiencies caused by both branch delays and instruction misalignment. Because of this instruction-fetch limitation, it is not worthwhile to explore highly-concurrent execution hardware. Rather, it is more appropriate to explore economical execution hardware that more closely matches the instruction throughout provided by the instruction fetcher. This study examines techniques for reducing the instruction-fetch inefficiencies and explores the resulting hardware organizations.
This study concludes that a super-scalar processor can have nearly twice the performance of a scalar processor, but that this requires that four major hardware features: out-of-order execution, register renaming, branch prediction, and a four-instruction decoder. These features are interdependent, and removing, any single feature reduces average performance by 18% or more. However, there are many hardware simplifications that cause only a small performance reduction.
Cited By
- Forsell M (2002). Architectural differences of efficient sequential and parallel computers, Journal of Systems Architecture: the EUROMICRO Journal, 47:13, (1017-1041), Online publication date: 1-Jul-2002.
- Conte T, Banerjia S, Larin S, Menezes K and Sathaye S Instruction fetch mechanisms for VLIW architectures with compressed encodings Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture, (201-211)
- Gibbons P and Merritt M Specifying non-blocking shared memories (extended abstract) Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures, (306-315)
- Conte T (1992). Tradeoffs in processor/memory interfaces for superscalar processors, ACM SIGMICRO Newsletter, 23:1-2, (202-205), Online publication date: 10-Dec-1992.
- Conte T Tradeoffs in processor/memory interfaces for superscalar processors Proceedings of the 25th annual international symposium on Microarchitecture, (202-205)
- Bray B and Flynn M Strategies for branch target buffers Proceedings of the 24th annual international symposium on Microarchitecture, (42-50)
Index Terms
- Super-scalar processor design
Recommendations
Simple super-matrix processor
Data-parallel applications are growing in importance and demanding increased performance from hardware. Since the fundamental data structures for a wide variety of data parallel applications are scalar, vector, and matrix, this paper proposes a simple ...