Article

Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors

Author:

Chi-Keung LukAuthors Info & Claims

ISCA '01: Proceedings of the 28th annual international symposium on Computer architecture

Pages 40 - 51

https://doi.org/10.1145/379240.379250

Published: 01 May 2001 Publication History

Abstract

Hardly predictable data addresses in many irregular applications have rendered prefetching ineffective. In many cases, the only accurate way to predict these addresses is to directly execute the code that generates them. As multithreaded architectures become increasingly popular, one attractive approach is to use idle threads on these machines to perform pre-execution—essentially a combined act of speculative address generation and prefetching—to accelerate the main thread. In this paper, we propose such a pre-execution technique for simultaneous multithreading (SMT) processors. By using software to control pre-execution, we are able to handle some of the most important access patterns that are typically difficult to prefetch. Compared with existing work on pre-execution, our technique is significantly simpler to implement (e.g., no integration of pre-execution results, no need of shortening programs for pre-execution, and no need of special hardware to copy register values upon thread spawns). Consequently, only minimal extensions to SMT machines are required to support our technique. Despite its simplicity, our technique offers an average speedup of 24% in a set of irregular applications, which is a 19% speedup over state-of-the-art software-controlled prefetching.

References

[1]

A. Agarwal, B.-H. Lim, D. Kranz, andJ. Kubiatowicz. April: A processor architecture for multiprocessing. In Prm. 17th ISCA, pages 104-114, May 1990.

Digital Library

[2]

H. Akkary and M. Driscoll. A dynamic multithreading processor. In Proc. 31st MICRO, pages 226-236, Nov 1998.

Digital Library

[3]

Alpha Development Group, Compaq Computer Corp. The Asim Manual, 2000.

[4]

M.M. Annavaram, J. M. Patel, and E. S. Davidson. Data prefetching by dependence graph precomputation. In Proc. 28th ISCA, 2001.

Digital Library

[5]

R. Balasubramonian, S. Dwarkadas, and D. Albonesi. Dynamically allocating processor resources between nearby and distant ILP. In Proc. 28th ISCA, 2001.

Digital Library

[6]

D.R. Butenhof. Programming with POSIX Threads'. Addison-Wesley, 1997.

Digital Library

[7]

R.S. Chappel, J. Stark, S. P. Kim, S. K. Reinbardt, and Y. N. Part. Simultaneous subordinate microthreading (SSMT). In Proc. 26th ISCA, pages 186-195, May 1999.

Digital Library

[8]

T.-E Chen and J.-L. Baer. Effective hardware-based data prefetching for highperformance processors. IEEE Transactions on Computers, 44(5), May 1995.

Digital Library

[9]

J. D. Collins, H. Wang, D. M. Tullsen, H. J. Christopher, Y.-E Lee, D. Lavery, and J. R Shen. Speculative precomputation: Long-range prefetching of delinquent loads. In Prec. 28th ISCA, 2001.

Digital Library

[10]

Standard Performance Evaluation Corporation. The SPEC95 benchmark suite. hup://www.specbench org.

[11]

M. Dubois and Y. H Song. Assisted execution. Technical Report CENG Technical Report 98-25, University of Southern California, October 1998.

[12]

J. Dundas and T. Mudge. Improving data cache performance by pre-executing instructions under a cache miss. In Proc. 1997 International Conference on Supercomputiug, 1997.

Digital Library

[13]

J. S. Emer. Simultaneous Multithreading: Multiplying Alpha Performance. Micoprocessor Forum, October 1999.

[14]

J. S. Emer. Relaxing Constraints: Thoughts on the Evolution of Computer Architecture. Keynote Speech for the 7th HPCA. January 2000.

[15]

A. Fatty, O. Temam, R. Espasa. and T. Juan. Dataflow analysis of branch mispredictions and its application to early resolution of branch outcomes. In Proc. 31st MICRO. pages 59-68, Dec 1998.

Digital Library

[16]

J. L. Henning. SPEC CPU2000: measuring cpu performance in the new millennium. IEEE Comlmter, 33(7):28-35, July 2000.

Digital Library

[17]

R. E. Kessler, E. J. McLcllam and D A. Webb. The Alpha 21264 microprocessor architecture. In Proc. hlternutionul Cot!l'rence on Computer Design, October 1998.

Digital Library

[18]

A. Klauser, A. Paithankar. and D. Grunwald. Selective eager execution on the polypath architecture. In Proc. 25th ISCA. pages 250-259, June 1998.

Digital Library

[19]

N. Kohout S. Cboi. and D. Yeung. Mulfi-chain pret;etching: Exploiting memory parallelism in pointer-chasing codes. In ISCA Workshop on Solving the Memory Wall Problem. 2000.

[20]

C.-K. Luk and T. C. Mowry. Compiler-based preltching for recursive data structures. In Proe. 7th ASPLOS. pages 222-233, October 1996.

Digital Library

[21]

C.-K. Luk and T. C. Mowry. Autonmtic compiler-inserted preletching for pointer-based applications. IEEE Transactions on Computer (Special Issue ml Cache Memoo'). 48(2 ): 134-14 I, February 1999.

Digital Library

[22]

T.C. Mowry. Tolerating Latency Through Software-Controlled Data Prefetching. PhD thesis, Stanford University. March 1994.

Digital Library

[23]

T. C. Mowry and C.-K Luk. Predicting data cache misses in non-numeric applications through correlation profiling. In Proc. 30th MICRO, pages 314- 320. December 1997.

Digital Library

[24]

S. Muchnick. Advanced Compiler Design andlmplemenmtion. Morgau Kaufmanta 1997.

Digital Library

[25]

A. Rogers, M Carlisle. J. Reppy, and L. Hendren. Supporting dynamic data structures on distributed memory machines. ACM Transactions. on Programming languages and Systems. 17(2):233-263. March 1995.

Digital Library

[26]

A. Roth, A. Moshovos. and G. Sohi. Dependence based preletching for linked data structures. In Proc. 8th ASPLOS. pages 115-126. October 1998.

Digital Library

[27]

A. Roth and G. Sohi. Effective jump-pointer preletching for linked data struclures. In Proc. 26th ISCA, pages I 11-121. May 1999.

Digital Library

[28]

A. Roth and G. S. Sohi. Speculative dam-driven umhithfeading. In Proc. 7tfi HPCA, 2001.

Digital Library

[29]

G. Sohi, S. Breach, and T. Vijaykumar. Multiscalar processors. In Proc. 22nd ISCA, pages 414-425, June 1995.

Digital Library

[30]

J.G. Steffan and T. C. Mowry. The potential for using thread-level data speculation to facilitate automatic parallellization. In Proc. 4th HPCA. February 1998.

Digital Library

[31]

K. Snndaramoorthy. Z. Purser. and E. Rotenburg. Slipstream processors: Improving both performance and fault tolerance. In Proc. 9th ASPLOS. Nov 2000.

Digital Library

[32]

D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stature. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In Proc. 23rdlSCA, pages 191-202, May 1996.

Digital Library

[33]

S. Wallace, B. Calder, and D. M. Tullsen. Threaded multiple path execution. In Proc. 25th ISCA, pages 238-249, June 1998.

Digital Library

[34]

S. C. Woo, M. Ohara, E. Torrie, J. E Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proc. 22nd ISCA, pages 24-38, June 1995.

Digital Library

[35]

C.B. Zilles and G. S. Sohi. Understanding the backward slices of performance degrading instructions. In Proc. 27th ISCA, pages 172-181, June 2000.

Digital Library

[36]

C. B. Zilles and G. S. Sohi. Execution-base prediction using speculative slices. In Proc. 28th ISCA, 2001.

Digital Library

Cited By

Xue FHan CLi XWu JZhang TLiu THao YDu ZGuo QZhang F(2024)Tyche: An Efficient and General Prefetcher for Indirect Memory AccessesACM Transactions on Architecture and Code Optimization10.1145/364185321:2(1-26)Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1145/3641853
Kumar RAlipour MBlack-Schaffer D(2022)Dependence-aware Slice Execution to Boost MLP in Slice-out-of-order CoresACM Transactions on Architecture and Code Optimization10.1145/350670419:2(1-28)Online publication date: 7-Mar-2022
https://dl.acm.org/doi/10.1145/3506704
Litz HAyers GRanganathan PFalsafi BFerdman MLu SWenisch T(2022)CRISP: critical slice prefetchingProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507745(300-313)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507745
Show More Cited By

Index Terms

Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors

Recommendations

Simultaneous multithreading: maximizing on-chip parallelism
ISCA '95: Proceedings of the 22nd annual international symposium on Computer architecture

This paper examines simultaneous multithreading, a technique permitting several independent threads to issue instructions to a superscalar's multiple functional units in a single cycle. We present several models of simultaneous multithreading and ...
Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors
Special Issue: Proceedings of the 28th annual international symposium on Computer architecture (ISCA '01)

Hardly predictable data addresses in many irregular applications have rendered prefetching ineffective. In many cases, the only accurate way to predict these addresses is to directly execute the code that generates them. As multithreaded architectures ...
Simultaneous multithreading

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '01: Proceedings of the 28th annual international symposium on Computer architecture

June 2001

289 pages

ISBN:0769511627

DOI:10.1145/379240

Chairman:
Per Stenström
Chalmers Univ. of Technology

ACM SIGARCH Computer Architecture News Volume 29, Issue 2
Special Issue: Proceedings of the 28th annual international symposium on Computer architecture (ISCA '01)
May 2001
262 pages
ISSN:0163-5964
DOI:10.1145/384285
Editor:
Per Stenström
Chalmers Univ. of Technology
Issue’s Table of Contents

Copyright © 2001 Author.

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS\TCCA: TC on Computer Arhitecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2001

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

ISCA01

Sponsor:

SIGARCH
IEEE-CS\TCCA

ISCA01: 28th International Symposium on Computer Architecture

June 30 - July 4, 2001

Göteborg, Sweden

Acceptance Rates

ISCA '01 Paper Acceptance Rate 24 of 163 submissions, 15%;

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

235
Total Citations
View Citations
835
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xue FHan CLi XWu JZhang TLiu THao YDu ZGuo QZhang F(2024)Tyche: An Efficient and General Prefetcher for Indirect Memory AccessesACM Transactions on Architecture and Code Optimization10.1145/364185321:2(1-26)Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1145/3641853
Kumar RAlipour MBlack-Schaffer D(2022)Dependence-aware Slice Execution to Boost MLP in Slice-out-of-order CoresACM Transactions on Architecture and Code Optimization10.1145/350670419:2(1-28)Online publication date: 7-Mar-2022
https://dl.acm.org/doi/10.1145/3506704
Litz HAyers GRanganathan PFalsafi BFerdman MLu SWenisch T(2022)CRISP: critical slice prefetchingProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507745(300-313)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507745
Orenes-Vera MManocha ABalkind JGao FAragón JWentzlaff DMartonosi MSalapura VZahran MChong FTang L(2022)Tiny but mightyProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527400(817-830)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527400
Darabi SSadrosadati MAkbarzadeh NLindegger JHosseini MPark JGomez-Luna JMutlu OSarbazi-Azad H(2022)Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00029(228-244)Online publication date: Oct-2022
https://doi.org/10.1109/MICRO56248.2022.00029
Kaushik APekhimenko GPatel H(2021)GretchACM Transactions on Architecture and Code Optimization10.1145/343980318:2(1-25)Online publication date: 9-Feb-2021
https://dl.acm.org/doi/10.1145/3439803
Cavus MSendag RYi J(2020)Informed Prefetching for Indirect Memory AccessesACM Transactions on Architecture and Code Optimization10.1145/337421617:1(1-29)Online publication date: 4-Mar-2020
https://dl.acm.org/doi/10.1145/3374216
Ansari ALotfi-Kamran PSarbazi-Azad HMartínez JDuato JEeckhout L(2020)Divide and conquer frontend bottleneckProceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture10.1109/ISCA45697.2020.00017(65-78)Online publication date: 30-May-2020
https://dl.acm.org/doi/10.1109/ISCA45697.2020.00017
Bera RNori AMutlu OSubramoney S(2019)DSPatchProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3352460.3358325(531-544)Online publication date: 12-Oct-2019
https://dl.acm.org/doi/10.1145/3352460.3358325
Ni FJiang SJiang HHuang JWu XEigenmann RDing CMcKee S(2019)SDCProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330353(82-93)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3330353
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents