Article

Design and evaluation of compiler algorithms for pre-execution

Authors:

Donald YeungAuthors Info & Claims

ASPLOS X: Proceedings of the 10th international conference on Architectural support for programming languages and operating systems

Pages 159 - 170

https://doi.org/10.1145/605397.605415

Published: 01 October 2002 Publication History

Abstract

Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of the main computation. This paper investigates a source-to-source C compiler for extracting pre-execution thread code automatically, thus relieving the programmer or hardware from this onerous task. At the heart of our compiler are three algorithms. First, program slicing removes non-critical code for computing cache-missing memory references, reducing pre-execution overhead. Second, prefetch conversion replaces blocking memory references with non-blocking prefetch instructions to minimize pre-execution thread stalls. Finally, threading scheme selection chooses the best scheme for initiating pre-execution threads, speculatively parallelizing loops to generate thread-level parallelism when necessary for latency tolerance. We prototyped our algorithms using the Stanford University Intermediate Format (SUIF) framework and a publicly available program slicer, called Unravel [13], and we evaluated our compiler on a detailed architectural simulator of an SMT processor. Our results show compiler-based pre-execution improves the performance of 9 out of 13 applications, reducing execution time by 22.7%. Across all 13 applications, our technique delivers an average speedup of 17.0%. These performance gains are achieved fully automatically on conventional SMT hardware, with only minimal modifications to support pre-execution threads.

References

[1]

M. Annavaram, J. Patel, and E. Davidson. Data Prefetching by Dependence Graph Precomputation. In 28th International Symposium on Computer Architecture, June 2001.

Digital Library

[2]

D. Binkley and K. Gallagher. A Survey of Program Slicing. Academic Press, 1996.

[3]

D. Burger and T. Austin. The SimpleScalar Tool Set, Version 2.0. CS TR 1342, University of Wisconsin-Madison, June 1997.

Digital Library

[4]

R. Chappell, S. Kim, S. Reinhardt, and Y. Patt. Simultaneous Subordinate Microthreading (SSMT). In 26th International Symposium on Computer Architecture, May 1999.

Digital Library

[5]

T.-F. Chen and J.-L. Baer. Effective Hardware-Based Data Prefetching for High-Performance Processors. Transactions on Computers, 44(5):609-623, May 1995.

Digital Library

[6]

J. Collins, D. Tullsen, H. Wang, and J. Shen. Dynamic Speculative Precomputation. In 34th International Symposium on Microarchitecture, December 2001.

Digital Library

[7]

J. Collins, H. Wang, D. Tullsen, C. Hughes, Y.-F. Lee, D. Lavery, and J. Shen. Speculative Precomputation: Long-range Prefetching of Delinquent Loads. In 28th International Symposium on Computer Architecture, June 2001.

Digital Library

[8]

R. Cytron. Doacross: Beyond Vectorization for Multiprocessors. In International Conference on Parallel Processing, August 1986.

[9]

M. Dubois and Y. Song. Assisted Execution. CENG TR 98-25, University of Southern California, October 1998.

[10]

J. Dundas and T. Mudge. Improving Data Cache Performance by Pre-executing Instructions Under a Cache Miss. In International Conference on Supercomputing, July 1997.

Digital Library

[11]

S. Liao, P. Wang, H. Wang, G. Hoflehner, D. Lavery, and J. Shen. Post-Pass Binary Adaptation for Software-Based Speculative Precomputation. In ACM SIGPLAN Conference on Programming Language Design and Implementation, June 2002.

Digital Library

[12]

C.-K. Luk. Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors. In 28th International Symposium on Computer Architecture, June 2001.

Digital Library

[13]

J. Lyle and D. Wallace. Using the unravel program slicing tool to evaluate high integrity software. In 10th International Software Quality Week, May 1997.

[14]

J. Lyle, D. Wallace, J. Graham, K. Gallagher, J. Poole, and D. Binkley. Unravel: A CASE Tool to Assist Evaluation of High Integrity Software. NISTIR 5691, National Institute of Standards and Technology, August 1995.

[15]

D. Madon, E. Sanchez, and S. Monnier. A Study of a Simultaneous Multithreaded Processor Implementation. In EuroPar '99, August 1999.

Digital Library

[16]

T. Mowry. Tolerating Latency in Multiprocessors through Compiler-Inserted Prefetching. Transactions on Computer Systems, 16(1):55-92, February 1998.

Digital Library

[17]

D. Padua, D. Kuck, and D. Lawrie. High-Speed Multiprocessors and Compilation Techniques. IEEE Transactions on Computers, C-29(9):763-776, September 1980.

Digital Library

[18]

D. Padua and M. Wolfe. Advanced Compiler Optimizations for Supercomputers. Communications of the ACM, 29(12):1184-1201, December 1986.

Digital Library

[19]

A. Roth, A. Moshovos, and G. Sohi. Dependence Based Prefetching for Linked Data Structures. In 8th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1998.

Digital Library

[20]

A. Roth and G. Sohi. Speculative Data-Driven Multithreading. In 7th International Conference on High Performance Computer Architecture, January 2001.

Digital Library

[21]

K. Sundaramoorthy, Z. Purser, and E. Rotenberg. Slip-stream Processors: Improving Both Performance and Fault Tolerance. In 9th International Conference on Architectural Support for Programming Languages and Operating Systems, May 2000.

Digital Library

[22]

D. Tullsen, S. Eggers, J. Emer, H. Levy, J. Lo, and R. Stamm. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor. In 23th International Symposium on Computer Architecture, May 1996.

Digital Library

[23]

C. Zilles and G. Sohi. Execution-Based Prediction Using Speculative Slices. In 28th International Symposium on Computer Architecture, June 2001.

Digital Library

[24]

C. Zilles and G. Sohi. Understanding the Backward Slices of Performance Degrading Instructions. In 27th International Symposium on Computer Architecture, June 2000.

Digital Library

Cited By

Naithani ARoelandts JAinsworth SJones TEeckhout L(2023)Decoupled Vector RunaheadProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614255(17-31)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614255
Darabi SMahani NBaxishi HYousefzadeh-Asl-Miandoab ESadrosadati MSarbazi-Azad H(2022)NURAProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35080366:1(1-27)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3508036
Kumar RAlipour MBlack-Schaffer D(2022)Dependence-aware Slice Execution to Boost MLP in Slice-out-of-order CoresACM Transactions on Architecture and Code Optimization10.1145/350670419:2(1-28)Online publication date: 7-Mar-2022
https://dl.acm.org/doi/10.1145/3506704
Show More Cited By

Design and evaluation of compiler algorithms for pre-execution
1. Software and its engineering
  1. Software notations and tools

Recommendations

Design and evaluation of compiler algorithms for pre-execution

Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of ...
Design and evaluation of compiler algorithms for pre-execution
Special Issue: Proceedings of the 10th annual conference on Architectural Support for Programming Languages and Operating Systems

Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of ...
Design and evaluation of compiler algorithms for pre-execution

Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS X: Proceedings of the 10th international conference on Architectural support for programming languages and operating systems

October 2002

318 pages

ISBN:1581135742

DOI:10.1145/605397

Conference Chair:
Kourosh Gharachorloo
Compaq Western Research Lab
,
Program Chair:
David A. Wood

ACM SIGARCH Computer Architecture News Volume 30, Issue 5
Special Issue: Proceedings of the 10th annual conference on Architectural Support for Programming Languages and Operating Systems
December 2002
296 pages
ISSN:0163-5964
DOI:10.1145/635506
Issue’s Table of Contents
ACM SIGOPS Operating Systems Review Volume 36, Issue 5
December 2002
296 pages
ISSN:0163-5980
DOI:10.1145/635508
Issue’s Table of Contents
ACM SIGPLAN Notices Volume 37, Issue 10
October 2002
296 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/605432
Issue’s Table of Contents

Copyright © 2002 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2002

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

ASPLOS02

Sponsor:

ASPLOS02: Tenth International Conference on Architectural Support for Programming Languages and Operating Systems

October 5 - 9, 2002

California, San Jose

Acceptance Rates

ASPLOS X Paper Acceptance Rate 24 of 175 submissions, 14%;

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

92
Total Citations
View Citations
1,074
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)4

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Naithani ARoelandts JAinsworth SJones TEeckhout L(2023)Decoupled Vector RunaheadProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614255(17-31)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614255
Darabi SMahani NBaxishi HYousefzadeh-Asl-Miandoab ESadrosadati MSarbazi-Azad H(2022)NURAProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35080366:1(1-27)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3508036
Kumar RAlipour MBlack-Schaffer D(2022)Dependence-aware Slice Execution to Boost MLP in Slice-out-of-order CoresACM Transactions on Architecture and Code Optimization10.1145/350670419:2(1-28)Online publication date: 7-Mar-2022
https://dl.acm.org/doi/10.1145/3506704
Mehta SElsesser GGreyzck TEgger BSmith A(2022)Software pre-execution for irregular memory accesses in the HBM eraProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517783(231-242)Online publication date: 19-Mar-2022
https://dl.acm.org/doi/10.1145/3497776.3517783
Darabi SSadrosadati MAkbarzadeh NLindegger JHosseini MPark JGomez-Luna JMutlu OSarbazi-Azad H(2022)Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00029(228-244)Online publication date: Oct-2022
https://doi.org/10.1109/MICRO56248.2022.00029
Naithani AAinsworth SJones TEeckhout L(2021)Vector Runahead2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA52012.2021.00024(195-208)Online publication date: Jun-2021
https://doi.org/10.1109/ISCA52012.2021.00024
Ham TAragón JMartonosi M(2019)Efficient Data Supply for Parallel Heterogeneous ArchitecturesACM Transactions on Architecture and Code Optimization10.1145/331033216:2(1-23)Online publication date: 26-Apr-2019
https://dl.acm.org/doi/10.1145/3310332
Kondguli SHuang MBahar IHerlihy MWitchel ELebeck A(2019)BootstrappingProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304052(687-700)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297858.3304052
Kumar RAlipour MBlack-Schaffer D(2019)Freeway: Maximizing MLP for Slice-Out-of-Order Execution2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00009(558-569)Online publication date: Mar-2019
https://doi.org/10.1109/HPCA.2019.00009
Ainsworth SJones T(2018)An Event-Triggered Programmable Prefetcher for Irregular WorkloadsACM SIGPLAN Notices10.1145/3296957.317318953:2(578-592)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3296957.3173189
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents