article

A study of source-level compiler algorithms for automatic construction of pre-execution code

Authors:

Dongkeun Kim,

Donald YeungAuthors Info & Claims

ACM Transactions on Computer Systems (TOCS), Volume 22, Issue 3

Pages 326 - 379

https://doi.org/10.1145/1012268.1012270

Published: 01 August 2004 Publication History

Get Access

Abstract

Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of the main computation. This article investigates several source-to-source C compilers for extracting pre-execution thread code automatically, thus relieving the programmer or hardware from this onerous task. We present an aggressive profile-driven compiler that employs three powerful algorithms for code extraction. First, program slicing removes non-critical code for computing cache-missing memory references. Second, prefetch conversion replaces blocking memory references with non-blocking prefetch instructions to minimize pre-execution thread stalls. Finally, speculative loop parallelization generates thread-level parallelism to tolerate the latency of blocking loads. In addition, we present four "reduced" compilers that employ less aggressive algorithms to simplify compiler implementation. Our reduced compilers rely on back-end code optimizations rather than program slicing to remove non-critical code, and use compile-time heuristics rather than profiling to approximate runtime information (e.g., cache-miss and loop-trip counts).We prototype our algorithms on the Stanford University Intermediate Format (SUIF) framework and a publicly available program slicer, called Unravel [Lyle and Wallace 1997]. Using our prototype, we undertake a performance evaluation of our compilers on a detailed architectural simulator of an 8-way out-of-order SMT processor with 4 hardware contexts, and 13 applications selected from the SPEC and Olden benchmark suites. Our most aggressive compiler improves the performance of 10 out of 13 applications, reducing execution time by 20.9%. Across all 13 applications, our aggressive compiler achieves a harmonic average speedup of 17.6%. For our reduced compilers, eliminating program slicing and relying on back-end optimizations degrades performance minimally, suggesting that effective pre-execution compilers can be built without program slicing. Furthermore, without cache-miss profiles, we still achieve good speedup, 15.5%, but without loop-trip count profiles, we achieve a speedup of only 7.7%. Finally, our results show compiler-based pre-execution can benefit multiprogrammed workloads. Simultaneously executing applications achieve higher throughput with pre-execution compared to no pre-execution. Due to contention for hardware contexts, however, time-slicing outperforms simultaneous execution in some cases where individual applications make heavy use of pre-execution threads.

References

[1]

Abraham, S. G., Sugumar, R. A., Rau, B. R., and Gupta, R. 1993. Predictability of load/store instruction latencies. In Proceedings of the 26th Annual International Symposium on Microarchitecture (Austin, Tex.). ACM, New York, 139--152.

Abstract

References

Cited By

Index Terms

Recommendations

Speculative pre-execution assisted by compiler (SPEAR)

An evaluation of speculative instruction execution on simultaneous multithreaded processors

Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations