Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

A study of source-level compiler algorithms for automatic construction of pre-execution code

Published: 01 August 2004 Publication History

Abstract

Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of the main computation. This article investigates several source-to-source C compilers for extracting pre-execution thread code automatically, thus relieving the programmer or hardware from this onerous task. We present an aggressive profile-driven compiler that employs three powerful algorithms for code extraction. First, program slicing removes non-critical code for computing cache-missing memory references. Second, prefetch conversion replaces blocking memory references with non-blocking prefetch instructions to minimize pre-execution thread stalls. Finally, speculative loop parallelization generates thread-level parallelism to tolerate the latency of blocking loads. In addition, we present four "reduced" compilers that employ less aggressive algorithms to simplify compiler implementation. Our reduced compilers rely on back-end code optimizations rather than program slicing to remove non-critical code, and use compile-time heuristics rather than profiling to approximate runtime information (e.g., cache-miss and loop-trip counts).We prototype our algorithms on the Stanford University Intermediate Format (SUIF) framework and a publicly available program slicer, called Unravel [Lyle and Wallace 1997]. Using our prototype, we undertake a performance evaluation of our compilers on a detailed architectural simulator of an 8-way out-of-order SMT processor with 4 hardware contexts, and 13 applications selected from the SPEC and Olden benchmark suites. Our most aggressive compiler improves the performance of 10 out of 13 applications, reducing execution time by 20.9%. Across all 13 applications, our aggressive compiler achieves a harmonic average speedup of 17.6%. For our reduced compilers, eliminating program slicing and relying on back-end optimizations degrades performance minimally, suggesting that effective pre-execution compilers can be built without program slicing. Furthermore, without cache-miss profiles, we still achieve good speedup, 15.5%, but without loop-trip count profiles, we achieve a speedup of only 7.7%. Finally, our results show compiler-based pre-execution can benefit multiprogrammed workloads. Simultaneously executing applications achieve higher throughput with pre-execution compared to no pre-execution. Due to contention for hardware contexts, however, time-slicing outperforms simultaneous execution in some cases where individual applications make heavy use of pre-execution threads.

References

[1]
Abraham, S. G., Sugumar, R. A., Rau, B. R., and Gupta, R. 1993. Predictability of load/store instruction latencies. In Proceedings of the 26th Annual International Symposium on Microarchitecture (Austin, Tex.). ACM, New York, 139--152.
[2]
Anderson, J. M., Berc, L. M., Dean, J., Ghemawat, S., Henzinger, M. R., Leung, S.-T. A., Sites, R. L., Vandevoorde, M. T., Waldspurger, C. A., and Weihl, W. E. 1997. Continuous profiling: Where have all the cycles gone? SRC Technical Note 1997-016a, Digital. July.
[3]
Annavaram, M., Patel, J. M., and Davidson, E. S. 2001. Data prefetching by dependence graph precomputation. In Proceedings of the 28th Annual International Symposium on Computer Architecture (Goteborg, Sweden). ACM, New York, 52--61.
[4]
Binkley, D. and Gallagher, K. 1996. A Survey of Program Slicing. Academic Press, Orlando, Fla.
[5]
Burger, D. and Austin, T. M. 1997. The SimpleScalar Tool Set, Version 2.0. CS TR 1342, University of Wisconsin-Madison, Madison, Wisc., June.
[6]
Chang, F. and Gibson, G. A. 1999. Automatic I/O hint generation through speculative execution. In Proceedings of the 3rd Symposium on Operating Systems Design and Implementation (New Orleans, La.). ACM, New York, 1--14.
[7]
Chappell, R. S., Kim, S. P., Reinhardt, S. K., and Patt, Y. N. 1999. Simultaneous subordinate microthreading (SSMT). In Proceedings of the 26th International Symposium on Computer Architecture (Atlanta, Ga.). ACM, New York, 186--195.
[8]
Chappell, R. S., Tseng, F., Yoaz, A., and Patt, Y. N. 2002. Difficult-path branch prediction using subordinate microthreads. In Proceedings of the 29th Annual International Symposium on Computer Architecture (Anchorage, Ak.). ACM, New York, 307--317.
[9]
Chen, T.-F. and Baer, J.-L. 1995. Effective hardware-based data prefetching for high-performance processors. Trans. Comput. 44, 5 (May), 609--623.
[10]
Cmelik, R. F. and Keppel, D. 1993. Shade: A fast instruction set simulator for execution profiling. TR 93-12, Sun Microsystems. July.
[11]
Collins, J. D., Tullsen, D. M., Wang, H., and Shen, J. P. 2001. Dynamic speculative precomputation. In Proceedings of the 34th International Symposium on Microarchitecture (Austin, Tex.). ACM, New York, 306--317.
[12]
Collins, J. D., Wang, H., Tullsen, D. M., Hughes, C., Lee, Y.-F., Lavery, D., and Shen, J. P. 2001b. Speculative precomputation: Long-range prefetching of delinquent loads. In Proceedings of the 28th Annual International Symposium on Computer Architecture (Goteborg, Sweden). ACM, New York, 14--25.
[13]
Cytron, R. 1986. Doacross: Beyond vectorization for multiprocessors. In Proceedings of the 1986 International Conference on Parallel Processing. (University Park, PA). IEEE Computer Society Press, Los Alamitos, Calif., 836--844.
[14]
Dubois, M. and Song, Y. H. 1998. Assisted execution. CENG Technical Report 98-25, Department of EE-Systems, University of Southern California. October.
[15]
Dundas, J. and Mudge, T. 1997. Improving data cache performance by pre-executing instructions under a cache miss. In Proceedings of the 1997 ACM International Conference on Supercomputing (Vienna, Austria). ACM, New York, 68--75.
[16]
Farcy, A., Temam, O., Espasa, R., and Juan, T. 1998. Dataflow analysis of branch mispredictions and its application to early resolution of branch outcomes. In Proceedings of the 31st International Symposium on Microarchitecture (Dallas, Tex.). ACM, New York, 59--68.
[17]
Ferrante, J., Ottenstein, K., and Warren, J. 1987. The program dependence graph and its use in optimization. ACM Trans. Prog. Lang. 9, 3 (July), 319--349.
[18]
Kim, D. and Yeung, D. 2002. Design and evaluation of compiler algorithms for pre-execution. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, Calif.). ACM, New York, 159--170.
[19]
Liao, S. S. W., Wang, P. H., Wang, H., Hoflehner, G., Lavery, D., and Shen, J. P. 2002. Post-pass binary adaptation for software-based speculative precomputation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (Berlin, Germany). ACM, New York, 117--128.
[20]
Luk, C.-K. 2001. Tolerating Memory Latency through software-controlled pre-execution in simultaneous multithreading processors. In Proceedings of the 28th Annual International Symposium on Computer Architecture (Goteborg, Sweden). ACM, New York, 40--51.
[21]
Lyle, J. R. and Wallace, D. R. May 1997. Using the unravel program slicing tool to evaluate high integrity software. In Proceedings of 10th International Software Quality Week (San Francisco, Calif.).
[22]
Lyle, J. R., Wallace, D. R., Graham, J. R., Gallagher, K. B., Poole, J. P., and Binkley, D. W. 1995. Unravel: A CASE tool to assist evaluation of high integrity software. NISTIR 5691, National Institute of Standards and Technology. August.
[23]
Madon, D., Sanchez, E., and Monnier, S. 1999. A study of a simultaneous multithreaded processor implementation. In Proceedings of EuroPar '99. (Toulouse, France). Springer-Verlag, New York, 716--726.
[24]
Moshovos, A., Pnevmatikatos, D. N., and Baniasadi, A. 2001. Slice-processors: An implementation of operation-based prediction. In Proceedings of the International Conference on Supercomputing (Sorrento, Italy). ACM, New York, 321--334.
[25]
Mowry, T. 1998. Tolerating latency in multiprocessors through compiler-inserted prefetching. Trans. Comput. Syst. 16, 1 (Feb.), 55--92.
[26]
Padua, D. A., Kuck, D. J., and Lawrie, D. H. 1980. High-speed multiprocessors and compilation techniques. IEEE Trans. Comput. C-29, 9 (Sept.), 763--776.
[27]
Padua, D. A. and Wolfe, M. J. 1986. Advanced compiler optimizations for supercomputers. Communi. ACM 29, 12 (Dec.), 1184--1201.
[28]
Pai, V. S. and Adve, S. 1999. Code transformations to improve memory parallelism. In Proceedings of the International Symposium on Microarchitecture (Haifa, Israel). ACM, New York, 147--155.
[29]
Rogers, A., Carlisle, M., Reppy, J., and Hendren, L. 1995. Supporting dynamic data structures on distributed memory machines. ACM Trans. Prog. Lang. Syst. 17, 2 (Mar.).
[30]
Roth, A., Moshovos, A., and Sohi, G. S. 1998. Dependence based prefetching for linked data structures. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, Calif.). ACM, New York, 115--126.
[31]
Roth, A., Moshovos, A., and Sohi, G. S. 1999. Improving virtual function call target prediction via dependence-based pre-computation. In Proceedings of the 13th Annual International Conference on Supercomputing (Rhodes, Greece). ACM, New York, 356--364.
[32]
Roth, A. and Sohi, G. S. 2001. Speculative data-driven multithreading. In Proceedings of the 7th International Conference on High Performance Computer Architecture (Monterrey, Mexico). IEEE Computer Society Press, Los Alamitos, Calif., 191--202.
[33]
Roth, A. and Sohi, G. S. 2002. A quantitative framework for automated pre-execution thread selection. In Proceedings of the 35th Annual International Symposium on Microarchitecture (Istanbul, Turkey). ACM, New York, 430--441.
[34]
Snavely, A. and Tullsen, D. M. 2000. Symbiotic Jobscheduling for a simutaneous multithreading processor. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (Cambridge, Mass.). ACM, New York, 234--244.
[35]
SPEC. 2000. SPEC CPU2000 V1.2 (http://www.specbench.org/osg/cpu2000/).
[36]
Sundaramoorthy, K., Purser, Z., and Rotenberg, E. 2000. Slipstream processors: Improving both performance and fault tolerance. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (Cambridge, Mass.). ACM, New York, 191--202.
[37]
Tullsen, D. M., Eggers, S. J., Emer, J. S., Levy, H. M., Lo, J. L., and Stamm, R. L. 1996. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In Proceedings of the 1996 International Symposium on Computer Architecture (Philadelphia, Pa.). ACM, New York, 191--202.
[38]
Tullsen, D. M., Lo, J. L., Eggers, S. J., and Levy, H. M. 1999. Supporting fine-grained synchronization on a simultaneous multithreading processor. In Proceedings of the 5th International Symposium on High-Performance Computer Architecture. (Orlando, Fla.). IEEE Computer Society Press, Los Alamitos, Calif., 54--58.
[39]
Wang, P. H., Wang, H., Collins, J. D., Grochowski, E., Kling, R. M., and Shen, J. P. 2002. Memory latency-tolerance approaches for itanium processors: Out-of-order execution vs. speculative precomputation. In Proceedings of the 8th International Symposium on High-Performance Computer Architecture (Boston, Mass.). ACM, New York, 187--196.
[40]
Weiser, M. 1984. Program slicing. IEEE Trans. Softw. Eng. SE-10, 4 (July).
[41]
Zilles, C. B. and Sohi, G. S. 2000. Understanding the backward slices of performance degrading instructions. In Proceedings of the 27th Annual International Symposium on Computer Architecture (Vancouver, Canada). ACM, New York, 172--181.
[42]
Zilles, C. B. and Sohi, G. 2001. Execution-based prediction using speculative slices. In Proceedings of the 28th Annual International Symposium on Computer Architecture (Goteborg, Sweden). ACM, New York, 2--13.
[43]
Zilles, C. B. and Sohi, G. 2002. Master/slave speculative parallelization. In Proceedings of the 35th International Symposium on Microarchitecture (Istanbul, Turkey). ACM, New York, 85--96.

Cited By

View all
  • (2022)Dependence-aware Slice Execution to Boost MLP in Slice-out-of-order CoresACM Transactions on Architecture and Code Optimization10.1145/350670419:2(1-28)Online publication date: 7-Mar-2022
  • (2020)Precise Runahead Execution2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00040(397-410)Online publication date: Feb-2020
  • (2019)BootstrappingProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304052(687-700)Online publication date: 4-Apr-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Computer Systems
ACM Transactions on Computer Systems  Volume 22, Issue 3
August 2004
99 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/1012268
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 August 2004
Published in TOCS Volume 22, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data prefetching
  2. memory-level parallelism
  3. multithreading
  4. pre-execution
  5. prefetch conversion
  6. program slicing
  7. speculative loop parallelization

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)3
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Dependence-aware Slice Execution to Boost MLP in Slice-out-of-order CoresACM Transactions on Architecture and Code Optimization10.1145/350670419:2(1-28)Online publication date: 7-Mar-2022
  • (2020)Precise Runahead Execution2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00040(397-410)Online publication date: Feb-2020
  • (2019)BootstrappingProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304052(687-700)Online publication date: 4-Apr-2019
  • (2019)Freeway: Maximizing MLP for Slice-Out-of-Order Execution2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00009(558-569)Online publication date: Feb-2019
  • (2018)An Event-Triggered Programmable Prefetcher for Irregular WorkloadsACM SIGPLAN Notices10.1145/3296957.317318953:2(578-592)Online publication date: 19-Mar-2018
  • (2018)An Event-Triggered Programmable Prefetcher for Irregular WorkloadsProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3173162.3173189(578-592)Online publication date: 19-Mar-2018
  • (2018)A Case for a More Effective, Power-Efficient Turbo BoostingACM Transactions on Architecture and Code Optimization10.1145/317043315:1(1-22)Online publication date: 22-Mar-2018
  • (2015)The load slice core microarchitectureACM SIGARCH Computer Architecture News10.1145/2872887.275040743:3S(272-284)Online publication date: 13-Jun-2015
  • (2015)The load slice core microarchitectureProceedings of the 42nd Annual International Symposium on Computer Architecture10.1145/2749469.2750407(272-284)Online publication date: 13-Jun-2015
  • (2012)Automatic Extraction of Parallelism from Sequential CodeFundamentals of Multicore Software Development10.1201/b11417-14(201-238)Online publication date: 9-Jan-2012
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media