Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/379240.379250acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
Article

Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors

Published: 01 May 2001 Publication History
  • Get Citation Alerts
  • Abstract

    Hardly predictable data addresses in many irregular applications have rendered prefetching ineffective. In many cases, the only accurate way to predict these addresses is to directly execute the code that generates them. As multithreaded architectures become increasingly popular, one attractive approach is to use idle threads on these machines to perform pre-execution—essentially a combined act of speculative address generation and prefetching—to accelerate the main thread. In this paper, we propose such a pre-execution technique for simultaneous multithreading (SMT) processors. By using software to control pre-execution, we are able to handle some of the most important access patterns that are typically difficult to prefetch. Compared with existing work on pre-execution, our technique is significantly simpler to implement (e.g., no integration of pre-execution results, no need of shortening programs for pre-execution, and no need of special hardware to copy register values upon thread spawns). Consequently, only minimal extensions to SMT machines are required to support our technique. Despite its simplicity, our technique offers an average speedup of 24% in a set of irregular applications, which is a 19% speedup over state-of-the-art software-controlled prefetching.

    References

    [1]
    A. Agarwal, B.-H. Lim, D. Kranz, andJ. Kubiatowicz. April: A processor architecture for multiprocessing. In Prm. 17th ISCA, pages 104-114, May 1990.
    [2]
    H. Akkary and M. Driscoll. A dynamic multithreading processor. In Proc. 31st MICRO, pages 226-236, Nov 1998.
    [3]
    Alpha Development Group, Compaq Computer Corp. The Asim Manual, 2000.
    [4]
    M.M. Annavaram, J. M. Patel, and E. S. Davidson. Data prefetching by dependence graph precomputation. In Proc. 28th ISCA, 2001.
    [5]
    R. Balasubramonian, S. Dwarkadas, and D. Albonesi. Dynamically allocating processor resources between nearby and distant ILP. In Proc. 28th ISCA, 2001.
    [6]
    D.R. Butenhof. Programming with POSIX Threads'. Addison-Wesley, 1997.
    [7]
    R.S. Chappel, J. Stark, S. P. Kim, S. K. Reinbardt, and Y. N. Part. Simultaneous subordinate microthreading (SSMT). In Proc. 26th ISCA, pages 186-195, May 1999.
    [8]
    T.-E Chen and J.-L. Baer. Effective hardware-based data prefetching for highperformance processors. IEEE Transactions on Computers, 44(5), May 1995.
    [9]
    J. D. Collins, H. Wang, D. M. Tullsen, H. J. Christopher, Y.-E Lee, D. Lavery, and J. R Shen. Speculative precomputation: Long-range prefetching of delinquent loads. In Prec. 28th ISCA, 2001.
    [10]
    Standard Performance Evaluation Corporation. The SPEC95 benchmark suite. hup://www.specbench org.
    [11]
    M. Dubois and Y. H Song. Assisted execution. Technical Report CENG Technical Report 98-25, University of Southern California, October 1998.
    [12]
    J. Dundas and T. Mudge. Improving data cache performance by pre-executing instructions under a cache miss. In Proc. 1997 International Conference on Supercomputiug, 1997.
    [13]
    J. S. Emer. Simultaneous Multithreading: Multiplying Alpha Performance. Micoprocessor Forum, October 1999.
    [14]
    J. S. Emer. Relaxing Constraints: Thoughts on the Evolution of Computer Architecture. Keynote Speech for the 7th HPCA. January 2000.
    [15]
    A. Fatty, O. Temam, R. Espasa. and T. Juan. Dataflow analysis of branch mispredictions and its application to early resolution of branch outcomes. In Proc. 31st MICRO. pages 59-68, Dec 1998.
    [16]
    J. L. Henning. SPEC CPU2000: measuring cpu performance in the new millennium. IEEE Comlmter, 33(7):28-35, July 2000.
    [17]
    R. E. Kessler, E. J. McLcllam and D A. Webb. The Alpha 21264 microprocessor architecture. In Proc. hlternutionul Cot!l'rence on Computer Design, October 1998.
    [18]
    A. Klauser, A. Paithankar. and D. Grunwald. Selective eager execution on the polypath architecture. In Proc. 25th ISCA. pages 250-259, June 1998.
    [19]
    N. Kohout S. Cboi. and D. Yeung. Mulfi-chain pret;etching: Exploiting memory parallelism in pointer-chasing codes. In ISCA Workshop on Solving the Memory Wall Problem. 2000.
    [20]
    C.-K. Luk and T. C. Mowry. Compiler-based preltching for recursive data structures. In Proe. 7th ASPLOS. pages 222-233, October 1996.
    [21]
    C.-K. Luk and T. C. Mowry. Autonmtic compiler-inserted preletching for pointer-based applications. IEEE Transactions on Computer (Special Issue ml Cache Memoo'). 48(2 ): 134-14 I, February 1999.
    [22]
    T.C. Mowry. Tolerating Latency Through Software-Controlled Data Prefetching. PhD thesis, Stanford University. March 1994.
    [23]
    T. C. Mowry and C.-K Luk. Predicting data cache misses in non-numeric applications through correlation profiling. In Proc. 30th MICRO, pages 314- 320. December 1997.
    [24]
    S. Muchnick. Advanced Compiler Design andlmplemenmtion. Morgau Kaufmanta 1997.
    [25]
    A. Rogers, M Carlisle. J. Reppy, and L. Hendren. Supporting dynamic data structures on distributed memory machines. ACM Transactions. on Programming languages and Systems. 17(2):233-263. March 1995.
    [26]
    A. Roth, A. Moshovos. and G. Sohi. Dependence based preletching for linked data structures. In Proc. 8th ASPLOS. pages 115-126. October 1998.
    [27]
    A. Roth and G. Sohi. Effective jump-pointer preletching for linked data struclures. In Proc. 26th ISCA, pages I 11-121. May 1999.
    [28]
    A. Roth and G. S. Sohi. Speculative dam-driven umhithfeading. In Proc. 7tfi HPCA, 2001.
    [29]
    G. Sohi, S. Breach, and T. Vijaykumar. Multiscalar processors. In Proc. 22nd ISCA, pages 414-425, June 1995.
    [30]
    J.G. Steffan and T. C. Mowry. The potential for using thread-level data speculation to facilitate automatic parallellization. In Proc. 4th HPCA. February 1998.
    [31]
    K. Snndaramoorthy. Z. Purser. and E. Rotenburg. Slipstream processors: Improving both performance and fault tolerance. In Proc. 9th ASPLOS. Nov 2000.
    [32]
    D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stature. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In Proc. 23rdlSCA, pages 191-202, May 1996.
    [33]
    S. Wallace, B. Calder, and D. M. Tullsen. Threaded multiple path execution. In Proc. 25th ISCA, pages 238-249, June 1998.
    [34]
    S. C. Woo, M. Ohara, E. Torrie, J. E Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proc. 22nd ISCA, pages 24-38, June 1995.
    [35]
    C.B. Zilles and G. S. Sohi. Understanding the backward slices of performance degrading instructions. In Proc. 27th ISCA, pages 172-181, June 2000.
    [36]
    C. B. Zilles and G. S. Sohi. Execution-base prediction using speculative slices. In Proc. 28th ISCA, 2001.

    Cited By

    View all
    • (2024)Tyche: An Efficient and General Prefetcher for Indirect Memory AccessesACM Transactions on Architecture and Code Optimization10.1145/364185321:2(1-26)Online publication date: 22-Jan-2024
    • (2022)Dependence-aware Slice Execution to Boost MLP in Slice-out-of-order CoresACM Transactions on Architecture and Code Optimization10.1145/350670419:2(1-28)Online publication date: 7-Mar-2022
    • (2022)CRISP: critical slice prefetchingProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507745(300-313)Online publication date: 28-Feb-2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ISCA '01: Proceedings of the 28th annual international symposium on Computer architecture
    June 2001
    289 pages
    ISBN:0769511627
    DOI:10.1145/379240
    • cover image ACM SIGARCH Computer Architecture News
      ACM SIGARCH Computer Architecture News  Volume 29, Issue 2
      Special Issue: Proceedings of the 28th annual international symposium on Computer architecture (ISCA '01)
      May 2001
      262 pages
      ISSN:0163-5964
      DOI:10.1145/384285
      Issue’s Table of Contents

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 May 2001

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Article

    Conference

    ISCA01
    Sponsor:

    Acceptance Rates

    ISCA '01 Paper Acceptance Rate 24 of 163 submissions, 15%;
    Overall Acceptance Rate 543 of 3,203 submissions, 17%

    Upcoming Conference

    ISCA '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)20
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 10 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Tyche: An Efficient and General Prefetcher for Indirect Memory AccessesACM Transactions on Architecture and Code Optimization10.1145/364185321:2(1-26)Online publication date: 22-Jan-2024
    • (2022)Dependence-aware Slice Execution to Boost MLP in Slice-out-of-order CoresACM Transactions on Architecture and Code Optimization10.1145/350670419:2(1-28)Online publication date: 7-Mar-2022
    • (2022)CRISP: critical slice prefetchingProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507745(300-313)Online publication date: 28-Feb-2022
    • (2022)Tiny but mightyProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527400(817-830)Online publication date: 18-Jun-2022
    • (2022)Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00029(228-244)Online publication date: Oct-2022
    • (2021)GretchACM Transactions on Architecture and Code Optimization10.1145/343980318:2(1-25)Online publication date: 9-Feb-2021
    • (2020)Informed Prefetching for Indirect Memory AccessesACM Transactions on Architecture and Code Optimization10.1145/337421617:1(1-29)Online publication date: 4-Mar-2020
    • (2020)Divide and conquer frontend bottleneckProceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture10.1109/ISCA45697.2020.00017(65-78)Online publication date: 30-May-2020
    • (2019)DSPatchProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3352460.3358325(531-544)Online publication date: 12-Oct-2019
    • (2019)SDCProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330353(82-93)Online publication date: 26-Jun-2019
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media