research-article

Inter-core prefetching for multicore processors using migrating helper threads

Authors:

Md Kamruzzaman,

Steven Swanson,

Dean M. TullsenAuthors Info & Claims

ASPLOS XVI: Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems

Pages 393 - 404

https://doi.org/10.1145/1950365.1950411

Published: 05 March 2011 Publication History

Abstract

Multicore processors have become ubiquitous in today's systems, but exploiting the parallelism they offer remains difficult, especially for legacy application and applications with large serial components. The challenge, then, is to develop techniques that allow multiple cores to work in concert to accelerate a single thread. This paper describes inter-core prefetching, a technique to exploit multiple cores to accelerate a single thread. Inter-core prefetching extends existing work on helper threads for SMT machines to multicore machines.

Inter-core prefetching uses one compute thread and one or more prefetching threads. The prefetching threads execute on cores that would otherwise be idle, prefetching the data that the compute thread will need. The compute thread then migrates between cores, following the path of the prefetch threads, and finds the data already waiting for it. Inter-core prefetching works with existing hardware and existing instruction set architectures. Using a range of state-of-the-art multiprocessors, this paper characterizes the potential benefits of the technique with microbenchmarks and then measures its impact on a range of memory intensive applications. The results show that inter-core prefetching improves performance by an average of 31 to 63%, depending on the architecture, and speeds up some applications by as much as 2.8×. It also demonstrates that inter-core prefetching reduces energy consumption by between 11 and 26% on average.

References

[1]

T. M. Aamodt, P. Chow, P. Hammarlund, H. Wang, and J. P. Shen. Hardware support for prescient instruction prefetch. In Proceedings of the 10th International Symposium on High Performance Computer Architecture, 2004.

Digital Library

[2]

M. Annavaram, J. M. Patel, and E. S. Davidson. Data prefetching by dependence graph precomputation. In Proceedings of the 28th annual international symposium on Computer architecture, 2001.

Digital Library

[3]

J. A. Brown, H. Wang, G. Chrysos, P. H. Wang, and J. P. Shen. Speculative precomputation on chip multiprocessors. In In Proceedings of the 6th Workshop on Multithreaded Execution, Architecture, and Compilation, 2001.

[4]

J. Chang and G. S. Sohi. Cooperative caching for chip multiprocessors. In Proceedings of the 33rd annual International Symposium on Computer Architecture, June 2006.

Digital Library

[5]

R. Chappell, J. Stark, S. Kim, S. Reinhardt, and Y. Patt. Simultaneous subordinate microthreading (ssmt). In Proceedings of the international symposium on Computer Architecture, May 1999.

Digital Library

[6]

T.-F. Chen and J.-L. Baer. Effective hardware-based data prefetching for high-performance processors. IEEE Transactions on Computers, (5), May 1995.

Digital Library

[7]

T. M. Chilimbi and M. Hirzel. Dynamic hot data stream prefetching for general-purpose programs. In Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation, 2002.

Digital Library

[8]

Collins, Tullsen, Wang, and Shen}collins-dspJ. Collins, D. Tullsen, H. Wang, and J. Shen. Dynamic speculative precompuation. In Proceedings of the International Symposium on Microarchitecture, December 2001.

Digital Library

[9]

Collins, Wang, Tullsen, Hughes, Lee, Lavery, and Shen}collins01J. Collins, H. Wang, D. Tullsen, C. Hughes, Y.-F. Lee, D. Lavery, and J. Shen. Speculative precomputation: Long-range prefetching of delinquent loads. In Proceedings of the International Symposium on Computer Architecture, July 2001.

Digital Library

[10]

J. Dundas and T. Mudge. Improving data cache performance by pre-executing instructions under a cache miss. In Proceedings of the 11th international conference on Supercomputing, 1997.

Digital Library

[11]

A. Garg and M. C. Huang. A performance-correctness explicitly-decoupled architecture. In Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture, 2008.

Digital Library

[12]

J. Gummaraju and M. Rosenblum. Stream programming on general-purpose processors. In Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, 2005.

Digital Library

[13]

D. Hackenberg, D. Molka, and W. E. Nagel. Comparing cache architectures and coherence protocols on x86-64 multicore smp systems. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009.

Digital Library

[14]

K. Z. Ibrahim, G. T. Byrd, and E. Rotenberg. Slipstream execution mode for cmp-based multiprocessors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture, 2003.

Digital Library

[15]

N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the international symposium on Computer Architecture, June 1990.

Digital Library

[16]

M. Kamruzzaman, S. Swanson, and D. M. Tullsen. Software data spreading: leveraging distributed caches to improve single thread performance. In Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation, 2010.

Digital Library

[17]

D. Kim, S. Liao, P. Wang, J. Cuvillo, X. Tian, X. Zou, H. Wang, D. Yeung, M. Girkar, and J. Shen. Physical experiment with prefetching helper threads on Intel's hyper-threaded processors. In International Symposium on Code Generation and Optimization, March 2004.

Digital Library

[18]

D. Kim and D. Yeung. Design and evaluation of compiler algorithm for pre-execution. In Proceedings of the international conference on Architectural support for programming languages and operating systems, October 2002.

Digital Library

[19]

V. Krishnan and J. Torrellas. A chip-multiprocessor architecture with speculative multithreading". IEEE Transactions on Computers, September 1999.

Digital Library

[20]

S. Liao, P. Wang, H. Wang, G. Hoflehner, D. Lavery, and J. Shen. Post-pass binary adaptation for software-based speculative precomputation. In Proceedings of the conference on Programming Language Design and Implementation, October 2002.

Digital Library

[21]

J. Lu, A. Das, W.-C. Hsu, K. Nguyen, and S. G. Abraham. Dynamic helper threaded prefetching on the sun ultrasparc cmp processor. In Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, 2005.

Digital Library

[22]

C.-K. Luk. Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. In Proceedings of the 28th annual international symposium on Computer architecture, July 2001.

Digital Library

[23]

P. Marcuello, A. González, and J. Tubella. Speculative multithreaded processors. In 12th International Conference on Supercomputing, November 1998.

Digital Library

[24]

P. Michaud. Exploiting the cache capacity of a single-chip multi-core processor with execution migration. In Proceedings of the 10th International Symposium on High Performance Computer Architecture, February 2004.

Digital Library

[25]

T. C. Mowry, M. S. Lam, and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In ASPLOS-V: Proceedings of the fifth international conference on Architectural support for programming languages and operating systems, 1992.

Digital Library

[26]

O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture, 2003.

Digital Library

[27]

J. Pisharath, Y. Liu, W. Liao, A. Choudhary, G. Memik, and J. Parhi. Nu-minebench 2.0. technical report. Technical Report CUCIS-2005-08-01, Center for Ultra-Scale Computing and Information Security, Northwestern University, August 2006. URL http://cucis.ece.northwestern.edu/techreports/pdf/CUCIS-2004-08-001.pdf%.

[28]

C. G. Quiñones, C. Madriles, J. Sánchez, P. Marcuello, A. González, and D. M. Tullsen. Mitosis compiler: an infrastructure for speculative threading based on pre-computation slices. In ACM SIGPLAN Conference on Programming Language Design and Implementation, June 2005.

Digital Library

[29]

J. E. Smith. Decoupled access/execute computer architectures. In ISCA '82: Proceedings of the 9th annual symposium on Computer Architecture, 1982.

Digital Library

[30]

G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar processors. In Proceedings of the International Symposium on Computer Architecture, June 1995.

Digital Library

[31]

K. Sundaramoorthy, Z. Purser, and E. Rotenberg. Slipstream processors: improving both performance and fault tolerance. SIGPLAN Not., 35 (11), 2000.

Digital Library

[32]

W. Zhang, D. Tullsen, and B. Calder. Accelerating and adapting precomputation threads for efficient prefetching. In Proceedings of the International Symposium on High Performance Computer Architecture, January 2007.

Digital Library

[33]

C. Zilles and G. Sohi. Execution-based prediction using speculative slices. In Proceedings of the International Symposium on Computer Architecture, July 2001.

Digital Library

Cited By

Sen R(2024)Performance or Efficiency? A Tale of Two Cores for DB WorkloadsProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663444(1-5)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663444
Ghiasi NVijaykumar NOliveira GOrosa LFernandez ISadrosadati MKanellopoulos KHajinazar NLuna JMutlu O(2023)ALP: Alleviating CPU-Memory Data Movement Overheads in Memory-Centric SystemsIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2022.322613211:2(388-403)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TETC.2022.3226132
Fatima ALiu SSeemakhupt KAusavarungnirun RKhan S(2023)vPIM: Efficient Virtual Address Translation for Scalable Processing-in-Memory Architectures2023 60th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC56929.2023.10247745(1-6)Online publication date: 9-Jul-2023
https://doi.org/10.1109/DAC56929.2023.10247745
Show More Cited By

Index Terms

Inter-core prefetching for multicore processors using migrating helper threads
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Inter-core prefetching for multicore processors using migrating helper threads
ASPLOS '11

Multicore processors have become ubiquitous in today's systems, but exploiting the parallelism they offer remains difficult, especially for legacy application and applications with large serial components. The challenge, then, is to develop techniques ...
Inter-core prefetching for multicore processors using migrating helper threads
ASPLOS '11

Multicore processors have become ubiquitous in today's systems, but exploiting the parallelism they offer remains difficult, especially for legacy application and applications with large serial components. The challenge, then, is to develop techniques ...
Software data spreading: leveraging distributed caches to improve single thread performance
PLDI '10

Single thread performance remains an important consideration even for multicore, multiprocessor systems. As a result, techniques for improving single thread performance using multiple cores have received considerable attention. This work describes a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS XVI: Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems

March 2011

432 pages

ISBN:9781450302661

DOI:10.1145/1950365

General Chair:
Rajiv Gupta
University of California, Riverside
,
Program Chair:
Todd C. Mowry
Carnegie Mellon University

ACM SIGPLAN Notices Volume 46, Issue 3
ASPLOS '11
March 2011
407 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1961296
Issue’s Table of Contents
ACM SIGARCH Computer Architecture News Volume 39, Issue 1
ASPLOS '11
March 2011
407 pages
ISSN:0163-5964
DOI:10.1145/1961295
Issue’s Table of Contents

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 March 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASPLOS'11

Sponsor:

ASPLOS'11: Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems

March 5 - 11, 2011

California, Newport Beach, USA

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

57
Total Citations
View Citations
1,057
Total Downloads

Downloads (Last 12 months)45
Downloads (Last 6 weeks)5

Reflects downloads up to 02 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sen R(2024)Performance or Efficiency? A Tale of Two Cores for DB WorkloadsProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663444(1-5)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663444
Ghiasi NVijaykumar NOliveira GOrosa LFernandez ISadrosadati MKanellopoulos KHajinazar NLuna JMutlu O(2023)ALP: Alleviating CPU-Memory Data Movement Overheads in Memory-Centric SystemsIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2022.322613211:2(388-403)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TETC.2022.3226132
Fatima ALiu SSeemakhupt KAusavarungnirun RKhan S(2023)vPIM: Efficient Virtual Address Translation for Scalable Processing-in-Memory Architectures2023 60th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC56929.2023.10247745(1-6)Online publication date: 9-Jul-2023
https://doi.org/10.1109/DAC56929.2023.10247745
Kumar RAlipour MBlack-Schaffer D(2022)Dependence-aware Slice Execution to Boost MLP in Slice-out-of-order CoresACM Transactions on Architecture and Code Optimization10.1145/350670419:2(1-28)Online publication date: 7-Mar-2022
https://dl.acm.org/doi/10.1145/3506704
Mehta SElsesser GGreyzck TEgger BSmith A(2022)Software pre-execution for irregular memory accesses in the HBM eraProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517783(231-242)Online publication date: 19-Mar-2022
https://dl.acm.org/doi/10.1145/3497776.3517783
Darabi SSadrosadati MAkbarzadeh NLindegger JHosseini MPark JGomez-Luna JMutlu OSarbazi-Azad H(2022)Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00029(228-244)Online publication date: Oct-2022
https://doi.org/10.1109/MICRO56248.2022.00029
Bakhshalipour MTabaeiaghdaei SLotfi-Kamran PSarbazi-Azad H(2019)Evaluation of Hardware Data Prefetchers on Server ProcessorsACM Computing Surveys10.1145/331274052:3(1-29)Online publication date: 18-Jun-2019
https://dl.acm.org/doi/10.1145/3312740
Ham TAragón JMartonosi M(2019)Efficient Data Supply for Parallel Heterogeneous ArchitecturesACM Transactions on Architecture and Code Optimization10.1145/331033216:2(1-23)Online publication date: 26-Apr-2019
https://dl.acm.org/doi/10.1145/3310332
Kumar RAlipour MBlack-Schaffer D(2019)Freeway: Maximizing MLP for Slice-Out-of-Order Execution2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00009(558-569)Online publication date: Feb-2019
https://doi.org/10.1109/HPCA.2019.00009
Tran KJimborean ACarlson TKoukos KSjälander MKaxiras S(2018)SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order coresACM SIGPLAN Notices10.1145/3296979.319239353:4(328-343)Online publication date: 11-Jun-2018
https://dl.acm.org/doi/10.1145/3296979.3192393
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten