research-article

Meeting midway: improving CMP performance with memory-side prefetching

Authors:

Praveen Yedlapalli,

Jagadish Kotra,

Emre Kultursay,

Mahmut Kandemir,

Anand SivasubramaniamAuthors Info & Claims

PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Pages 289 - 298

Published: 07 October 2013 Publication History

Abstract

Both on-chip resource contention and off-chip latencies have a significant impact on memory requests in largescale chip multiprocessors. We propose a memory-side prefetcher, which brings data on-chip from DRAM, but does not proactively further push this data to the cores/caches. Sitting close to memory, it avails close knowledge of DRAM state and memory channels to leverage DRAM row buffer locality and channel state to bring data (from the current row buffer) on-chip ahead of need. This not only reduces the number of off-chip accesses for demand requests, but also reduces row buffer conflicts, effectively improving DRAM access times. At the same time, our prefetcher maintains this data in a small buffer at each memory controller instead of pushing it into the caches to avoid on-chip resource contention. We show that the proposed memory-side prefetcher outperforms a state-of-the-art core-side prefetcher and an existing memory-side prefetcher. More importantly, our prefetcher can also work in tandem with the core-side prefetcher to amplify the benefits. Using a wide range of multiprogrammed and multithreaded workloads, we show that this memory-side prefetcher provides IPC improvements of 6.2% (maximum of 33.6%), and 10% (maximum of 49.6%), on an average when running alone and when combined with a core-side prefetcher, respectively. By meeting requests midway, our solution reduces the off-chip latencies while avoiding the on-chip resource contention caused by inaccurate and ill-timed prefetches.

References

[1]

M. Cade and A. Qasem, "Balancing locality and parallelism on shared-cache multicore systems," in HPCC, 2009.

Digital Library

[2]

J. Carter et al., "Impulse: Building a smarter memory controller," in HPCA, 1999.

Digital Library

[3]

T.-F. Chen and J.-L. Baer, "A performance study of software and hardware data prefetching schemes," in ISCA, 1994.

Digital Library

[4]

B. T. Davis, "Modern DRAM architectures," Ph.D. dissertation, 2001.

Digital Library

[5]

E. Ebrahimi et al., "Prefetch-aware shared resource management for multicore systems," in ISCA, 2011.

Digital Library

[6]

E. Ebrahimi et al., "Coordinated control of multiple prefetchers in multicore systems," in MICRO, 2009.

Digital Library

[7]

R. Hegde, "Optimizing application performance on intel core microarchitecture using hardware-implemented prefetchers," Intel, 2008.

[8]

C. J. Hughes and S. V. Adve, "Memory-side prefetching for linked data structures for processor-in-memory systems," Journal of PDC, 2005.

Digital Library

[9]

I. Hur and C. Lin, "Memory prefetching using adaptive stream detection," in MICRO, 2006.

Digital Library

[10]

S. Iacobovici et al., "Effective stream-based and execution-based data prefetching," in ICS, 2004.

Digital Library

[11]

B. Jacob et al., Memory Systems: Cache, DRAM, Disk. Morgan Kaufmann, 2007.

Digital Library

[12]

D. Joseph and D. Grunwald, "Prefetching using Markov predictors," in ISCA, 1997.

Digital Library

[13]

M. Karlsson et al., "A prefetching technique for irregular accesses to linked data structures," in HPCA, 2000.

[14]

C. Kim et al., "Nonuniform cache architectures for wire-delay dominated on-chip caches," Micro, IEEE, vol. 23, no. 6, nov.-dec. 2003.

Digital Library

[15]

Y. Kim et al., "Atlas: A scalable and high-performance scheduling algorithm for multiple memory controllers," in HPCA, 2010.

[16]

C. J. Lee et al., "Prefetch-aware DRAM controllers," in MICRO, 2008.

Digital Library

[17]

W.-f. Lin, "Reducing DRAM latencies with an integrated memory hierarchy design," in HPCA, 2001.

Digital Library

[18]

G. Liu et al., "Enhancements for accurate and timely streaming prefetcher," The Journal of ILP, vol. 13, Jan. 2011.

[19]

C.-K. Luk et al., "Profile-guided post-link stride prefetching," in ICS, 2002.

Digital Library

[20]

K. Luo et al., "Balancing thoughput and fairness in smt processors," in ISPASS, 2001.

[21]

P. S. Magnusson et al., "SIMICS: A full system simulation platform," Computer, vol. 35, no. 2, Feb. 2002.

Digital Library

[22]

M. M. Martin et al., "Multifacets general execution-driven multiprocessor simulator (gems) toolset," SIGARCH Comput. Archit. News, 2005.

Digital Library

[23]

Micron, "DataSheet: 1Gb DDR3 SDRAM."

[24]

Micron, "DDR3 Power Calculator."

[25]

N. Muralimanohar et al., "Optimizing nuca organizations and wiring alternatives for large caches with CACTI 6.0," in MICRO, 2007.

Digital Library

[26]

O. Mutlu and T. Moscibroda, "Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared dram systems," in ISCA, 2008.

Digital Library

[27]

D. Ortega et al., "Cost-effective compiler directed memory prefetching and bypassing," in PACT, 2002.

Digital Library

[28]

D. K. Poulsen and P.-C. Yew, "Data prefetching and data forwarding in shared memory multiprocessors," in ICPP, 1994.

Digital Library

[29]

A. J. Smith, "Sequential program prefetching in memory hierarchies," Computer, vol. 11, no. 12, Dec. 1978.

Digital Library

[30]

Y. Solihin et al., "Correlation prefetching with a user-level memory thread," IEEE Trans. Parallel Distrib. Syst., vol. 14, no. 6, Jun. 2003.

Digital Library

[31]

S. Srinath et al., "Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers," in HPCA'07.

Digital Library

[32]

K. Sudan et al., "Micro-pages: increasing dram efficiency with locality- aware data placement," in ASPLOS, 2010.

Digital Library

[33]

C.-J. Wu et al., "Pacman: prefetch-aware cache management for high performance caching," in MICRO, 2011.

Digital Library

[34]

Y. Wu, "Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching," in PLDI, 2002.

Digital Library

[35]

C.-L. Yang and A. R. Lebeck, "Push vs. pull: data movement for linked data structures," in ICS, 2000.

Digital Library

[36]

Z. Zhang et al., "A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality," in MICRO, 2000.

Digital Library

[37]

X. Zhuang and H.-H. S. Lee, "A hardware-based cache pollution filtering mechanism for aggressive prefetches," in ICPP, 2003.

Cited By

Tang XKandemir MKarakoy MArunachalam MMcKinley KFisher K(2019)Co-optimizing memory-level parallelism and cache-level parallelismProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314599(935-949)Online publication date: 8-Jun-2019
https://dl.acm.org/doi/10.1145/3314221.3314599
Liu CKotra JJung MKandemir MDas CBahar IHerlihy MWitchel ELebeck A(2019)SOML ReadProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304035(955-969)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297858.3304035
Liu CKotra JJung MKandemir MAgrawal NRangaswami R(2018)PENProceedings of the 16th USENIX Conference on File and Storage Technologies10.5555/3189759.3189766(67-82)Online publication date: 12-Feb-2018
https://dl.acm.org/doi/10.5555/3189759.3189766
Show More Cited By

Index Terms

Meeting midway: improving CMP performance with memory-side prefetching
1. Computer systems organization
  1. Architectures
    1. Parallel architectures

Recommendations

Prefetching Using Markov Predictors
Special issue on cache memory and related problems

Prefetching is one approach to reducing the latency of memory operations in modern computer systems. In this paper, we describe the Markov prefetcher. This prefetcher acts as an interface between the on-chip and off-chip cache and can be added to ...
Criticality aware tiered cache hierarchy: a fundamental relook at multi-level cache hierarchies
ISCA '18: Proceedings of the 45th Annual International Symposium on Computer Architecture

On-die caches are a popular method to help hide the main memory latency. However, it is difficult to build large caches without substantially increasing their access latency, which in turn hurts performance. To overcome this difficulty, on-die caches ...
Inter-core cooperative TLB for chip multiprocessors
ASPLOS '10

Translation Lookaside Buffers (TLBs) are commonly employed in modern processor designs and have considerable impact on overall system performance. A number of past works have studied TLB designs to lower access times and miss rates, specifically for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

October 2013

422 pages

ISBN:9781479910212

Conference Chair:
Christian Fensch
University of Edinburgh, UK
,
General Chair:
Michael O'Boyle
University of Edinburgh, UK
,
Program Chairs:
André Seznec
INRIA Rennes, France
,
François Bodin
IRISA/CAPS Entreprise, France

Sponsors

IFIP WG 10.3: IFIP WG 10.3
IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE CS TCPP: IEEE Computer Society Technical Committee on Parallel Processing

Publisher

IEEE Press

Publication History

Published: 07 October 2013

Check for updates

Author Tags

Qualifiers

Research-article

Acceptance Rates

PACT '13 Paper Acceptance Rate 36 of 208 submissions, 17%;

Overall Acceptance Rate 121 of 471 submissions, 26%

Upcoming Conference

PACT '24

Sponsor:
sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 13 - 16, 2024

Long Beach , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
201
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tang XKandemir MKarakoy MArunachalam MMcKinley KFisher K(2019)Co-optimizing memory-level parallelism and cache-level parallelismProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314599(935-949)Online publication date: 8-Jun-2019
https://dl.acm.org/doi/10.1145/3314221.3314599
Liu CKotra JJung MKandemir MDas CBahar IHerlihy MWitchel ELebeck A(2019)SOML ReadProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304035(955-969)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297858.3304035
Liu CKotra JJung MKandemir MAgrawal NRangaswami R(2018)PENProceedings of the 16th USENIX Conference on File and Storage Technologies10.5555/3189759.3189766(67-82)Online publication date: 12-Feb-2018
https://dl.acm.org/doi/10.5555/3189759.3189766
Rafique MZhu Z(2018)CAMPSProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225112(1-9)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3225058.3225112
George SLiao MJiang HKotra JKandemir MSampson JNarayanan VOskin MInoue K(2018)MDACacheProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00073(841-854)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00073
Kotra JZhang HAlameldeen AWilkerson CKandemir MOskin MInoue K(2018)CHAMELEONProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00050(533-545)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00050
Tang XKandemir MYedlapalli PKotra JHsu WYang CLipasti MLee H(2016)Improving bank-level parallelism for irregular applicationsThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195708(1-12)Online publication date: 15-Oct-2016
https://dl.acm.org/doi/10.5555/3195638.3195708
Mittal S(2016)A Survey of Recent Prefetching Techniques for Processor CachesACM Computing Surveys10.1145/290707149:2(1-35)Online publication date: 2-Aug-2016
https://dl.acm.org/doi/10.1145/2907071
Rodríguez GAndión JKandemir MTouriño JFranke BWu YRastello F(2016)Trace-based affine reconstruction of codesProceedings of the 2016 International Symposium on Code Generation and Optimization10.1145/2854038.2854056(139-149)Online publication date: 29-Feb-2016
https://dl.acm.org/doi/10.1145/2854038.2854056
Chidambaram Nachiappan NYedlapalli PSoundararajan NKandemir MSivasubramaniam ADas C(2014)GemDroidACM SIGMETRICS Performance Evaluation Review10.1145/2637364.259197342:1(355-366)Online publication date: 16-Jun-2014
https://dl.acm.org/doi/10.1145/2637364.2591973
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents