Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1950365.1950411acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Inter-core prefetching for multicore processors using migrating helper threads

Published: 05 March 2011 Publication History

Abstract

Multicore processors have become ubiquitous in today's systems, but exploiting the parallelism they offer remains difficult, especially for legacy application and applications with large serial components. The challenge, then, is to develop techniques that allow multiple cores to work in concert to accelerate a single thread. This paper describes inter-core prefetching, a technique to exploit multiple cores to accelerate a single thread. Inter-core prefetching extends existing work on helper threads for SMT machines to multicore machines.
Inter-core prefetching uses one compute thread and one or more prefetching threads. The prefetching threads execute on cores that would otherwise be idle, prefetching the data that the compute thread will need. The compute thread then migrates between cores, following the path of the prefetch threads, and finds the data already waiting for it. Inter-core prefetching works with existing hardware and existing instruction set architectures. Using a range of state-of-the-art multiprocessors, this paper characterizes the potential benefits of the technique with microbenchmarks and then measures its impact on a range of memory intensive applications. The results show that inter-core prefetching improves performance by an average of 31 to 63%, depending on the architecture, and speeds up some applications by as much as 2.8×. It also demonstrates that inter-core prefetching reduces energy consumption by between 11 and 26% on average.

References

[1]
T. M. Aamodt, P. Chow, P. Hammarlund, H. Wang, and J. P. Shen. Hardware support for prescient instruction prefetch. In Proceedings of the 10th International Symposium on High Performance Computer Architecture, 2004.
[2]
M. Annavaram, J. M. Patel, and E. S. Davidson. Data prefetching by dependence graph precomputation. In Proceedings of the 28th annual international symposium on Computer architecture, 2001.
[3]
J. A. Brown, H. Wang, G. Chrysos, P. H. Wang, and J. P. Shen. Speculative precomputation on chip multiprocessors. In In Proceedings of the 6th Workshop on Multithreaded Execution, Architecture, and Compilation, 2001.
[4]
J. Chang and G. S. Sohi. Cooperative caching for chip multiprocessors. In Proceedings of the 33rd annual International Symposium on Computer Architecture, June 2006.
[5]
R. Chappell, J. Stark, S. Kim, S. Reinhardt, and Y. Patt. Simultaneous subordinate microthreading (ssmt). In Proceedings of the international symposium on Computer Architecture, May 1999.
[6]
T.-F. Chen and J.-L. Baer. Effective hardware-based data prefetching for high-performance processors. IEEE Transactions on Computers, (5), May 1995.
[7]
T. M. Chilimbi and M. Hirzel. Dynamic hot data stream prefetching for general-purpose programs. In Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation, 2002.
[8]
Collins, Tullsen, Wang, and Shen}collins-dspJ. Collins, D. Tullsen, H. Wang, and J. Shen. Dynamic speculative precompuation. In Proceedings of the International Symposium on Microarchitecture, December 2001.
[9]
Collins, Wang, Tullsen, Hughes, Lee, Lavery, and Shen}collins01J. Collins, H. Wang, D. Tullsen, C. Hughes, Y.-F. Lee, D. Lavery, and J. Shen. Speculative precomputation: Long-range prefetching of delinquent loads. In Proceedings of the International Symposium on Computer Architecture, July 2001.
[10]
J. Dundas and T. Mudge. Improving data cache performance by pre-executing instructions under a cache miss. In Proceedings of the 11th international conference on Supercomputing, 1997.
[11]
A. Garg and M. C. Huang. A performance-correctness explicitly-decoupled architecture. In Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture, 2008.
[12]
J. Gummaraju and M. Rosenblum. Stream programming on general-purpose processors. In Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, 2005.
[13]
D. Hackenberg, D. Molka, and W. E. Nagel. Comparing cache architectures and coherence protocols on x86-64 multicore smp systems. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009.
[14]
K. Z. Ibrahim, G. T. Byrd, and E. Rotenberg. Slipstream execution mode for cmp-based multiprocessors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture, 2003.
[15]
N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the international symposium on Computer Architecture, June 1990.
[16]
M. Kamruzzaman, S. Swanson, and D. M. Tullsen. Software data spreading: leveraging distributed caches to improve single thread performance. In Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation, 2010.
[17]
D. Kim, S. Liao, P. Wang, J. Cuvillo, X. Tian, X. Zou, H. Wang, D. Yeung, M. Girkar, and J. Shen. Physical experiment with prefetching helper threads on Intel's hyper-threaded processors. In International Symposium on Code Generation and Optimization, March 2004.
[18]
D. Kim and D. Yeung. Design and evaluation of compiler algorithm for pre-execution. In Proceedings of the international conference on Architectural support for programming languages and operating systems, October 2002.
[19]
V. Krishnan and J. Torrellas. A chip-multiprocessor architecture with speculative multithreading". IEEE Transactions on Computers, September 1999.
[20]
S. Liao, P. Wang, H. Wang, G. Hoflehner, D. Lavery, and J. Shen. Post-pass binary adaptation for software-based speculative precomputation. In Proceedings of the conference on Programming Language Design and Implementation, October 2002.
[21]
J. Lu, A. Das, W.-C. Hsu, K. Nguyen, and S. G. Abraham. Dynamic helper threaded prefetching on the sun ultrasparc cmp processor. In Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, 2005.
[22]
C.-K. Luk. Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. In Proceedings of the 28th annual international symposium on Computer architecture, July 2001.
[23]
P. Marcuello, A. González, and J. Tubella. Speculative multithreaded processors. In 12th International Conference on Supercomputing, November 1998.
[24]
P. Michaud. Exploiting the cache capacity of a single-chip multi-core processor with execution migration. In Proceedings of the 10th International Symposium on High Performance Computer Architecture, February 2004.
[25]
T. C. Mowry, M. S. Lam, and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In ASPLOS-V: Proceedings of the fifth international conference on Architectural support for programming languages and operating systems, 1992.
[26]
O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture, 2003.
[27]
J. Pisharath, Y. Liu, W. Liao, A. Choudhary, G. Memik, and J. Parhi. Nu-minebench 2.0. technical report. Technical Report CUCIS-2005-08-01, Center for Ultra-Scale Computing and Information Security, Northwestern University, August 2006. URL http://cucis.ece.northwestern.edu/techreports/pdf/CUCIS-2004-08-001.pdf%.
[28]
C. G. Quiñones, C. Madriles, J. Sánchez, P. Marcuello, A. González, and D. M. Tullsen. Mitosis compiler: an infrastructure for speculative threading based on pre-computation slices. In ACM SIGPLAN Conference on Programming Language Design and Implementation, June 2005.
[29]
J. E. Smith. Decoupled access/execute computer architectures. In ISCA '82: Proceedings of the 9th annual symposium on Computer Architecture, 1982.
[30]
G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar processors. In Proceedings of the International Symposium on Computer Architecture, June 1995.
[31]
K. Sundaramoorthy, Z. Purser, and E. Rotenberg. Slipstream processors: improving both performance and fault tolerance. SIGPLAN Not., 35 (11), 2000.
[32]
W. Zhang, D. Tullsen, and B. Calder. Accelerating and adapting precomputation threads for efficient prefetching. In Proceedings of the International Symposium on High Performance Computer Architecture, January 2007.
[33]
C. Zilles and G. Sohi. Execution-based prediction using speculative slices. In Proceedings of the International Symposium on Computer Architecture, July 2001.

Cited By

View all
  • (2024)Performance or Efficiency? A Tale of Two Cores for DB WorkloadsProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663444(1-5)Online publication date: 10-Jun-2024
  • (2023)ALP: Alleviating CPU-Memory Data Movement Overheads in Memory-Centric SystemsIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2022.322613211:2(388-403)Online publication date: 1-Apr-2023
  • (2023)vPIM: Efficient Virtual Address Translation for Scalable Processing-in-Memory Architectures2023 60th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC56929.2023.10247745(1-6)Online publication date: 9-Jul-2023
  • Show More Cited By

Index Terms

  1. Inter-core prefetching for multicore processors using migrating helper threads

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASPLOS XVI: Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
    March 2011
    432 pages
    ISBN:9781450302661
    DOI:10.1145/1950365
    • cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 46, Issue 3
      ASPLOS '11
      March 2011
      407 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/1961296
      Issue’s Table of Contents
    • cover image ACM SIGARCH Computer Architecture News
      ACM SIGARCH Computer Architecture News  Volume 39, Issue 1
      ASPLOS '11
      March 2011
      407 pages
      ISSN:0163-5964
      DOI:10.1145/1961295
      Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 March 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. chip multiprocessors
    2. compilers
    3. helper threads
    4. single-thread performance

    Qualifiers

    • Research-article

    Conference

    ASPLOS'11

    Acceptance Rates

    Overall Acceptance Rate 535 of 2,713 submissions, 20%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)45
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 02 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Performance or Efficiency? A Tale of Two Cores for DB WorkloadsProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663444(1-5)Online publication date: 10-Jun-2024
    • (2023)ALP: Alleviating CPU-Memory Data Movement Overheads in Memory-Centric SystemsIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2022.322613211:2(388-403)Online publication date: 1-Apr-2023
    • (2023)vPIM: Efficient Virtual Address Translation for Scalable Processing-in-Memory Architectures2023 60th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC56929.2023.10247745(1-6)Online publication date: 9-Jul-2023
    • (2022)Dependence-aware Slice Execution to Boost MLP in Slice-out-of-order CoresACM Transactions on Architecture and Code Optimization10.1145/350670419:2(1-28)Online publication date: 7-Mar-2022
    • (2022)Software pre-execution for irregular memory accesses in the HBM eraProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517783(231-242)Online publication date: 19-Mar-2022
    • (2022)Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00029(228-244)Online publication date: Oct-2022
    • (2019)Evaluation of Hardware Data Prefetchers on Server ProcessorsACM Computing Surveys10.1145/331274052:3(1-29)Online publication date: 18-Jun-2019
    • (2019)Efficient Data Supply for Parallel Heterogeneous ArchitecturesACM Transactions on Architecture and Code Optimization10.1145/331033216:2(1-23)Online publication date: 26-Apr-2019
    • (2019)Freeway: Maximizing MLP for Slice-Out-of-Order Execution2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00009(558-569)Online publication date: Feb-2019
    • (2018)SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order coresACM SIGPLAN Notices10.1145/3296979.319239353:4(328-343)Online publication date: 11-Jun-2018
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media