Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/2523721.2523761acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Meeting midway: improving CMP performance with memory-side prefetching

Published: 07 October 2013 Publication History

Abstract

Both on-chip resource contention and off-chip latencies have a significant impact on memory requests in largescale chip multiprocessors. We propose a memory-side prefetcher, which brings data on-chip from DRAM, but does not proactively further push this data to the cores/caches. Sitting close to memory, it avails close knowledge of DRAM state and memory channels to leverage DRAM row buffer locality and channel state to bring data (from the current row buffer) on-chip ahead of need. This not only reduces the number of off-chip accesses for demand requests, but also reduces row buffer conflicts, effectively improving DRAM access times. At the same time, our prefetcher maintains this data in a small buffer at each memory controller instead of pushing it into the caches to avoid on-chip resource contention. We show that the proposed memory-side prefetcher outperforms a state-of-the-art core-side prefetcher and an existing memory-side prefetcher. More importantly, our prefetcher can also work in tandem with the core-side prefetcher to amplify the benefits. Using a wide range of multiprogrammed and multithreaded workloads, we show that this memory-side prefetcher provides IPC improvements of 6.2% (maximum of 33.6%), and 10% (maximum of 49.6%), on an average when running alone and when combined with a core-side prefetcher, respectively. By meeting requests midway, our solution reduces the off-chip latencies while avoiding the on-chip resource contention caused by inaccurate and ill-timed prefetches.

References

[1]
M. Cade and A. Qasem, "Balancing locality and parallelism on shared-cache multicore systems," in HPCC, 2009.
[2]
J. Carter et al., "Impulse: Building a smarter memory controller," in HPCA, 1999.
[3]
T.-F. Chen and J.-L. Baer, "A performance study of software and hardware data prefetching schemes," in ISCA, 1994.
[4]
B. T. Davis, "Modern DRAM architectures," Ph.D. dissertation, 2001.
[5]
E. Ebrahimi et al., "Prefetch-aware shared resource management for multicore systems," in ISCA, 2011.
[6]
E. Ebrahimi et al., "Coordinated control of multiple prefetchers in multicore systems," in MICRO, 2009.
[7]
R. Hegde, "Optimizing application performance on intel core microarchitecture using hardware-implemented prefetchers," Intel, 2008.
[8]
C. J. Hughes and S. V. Adve, "Memory-side prefetching for linked data structures for processor-in-memory systems," Journal of PDC, 2005.
[9]
I. Hur and C. Lin, "Memory prefetching using adaptive stream detection," in MICRO, 2006.
[10]
S. Iacobovici et al., "Effective stream-based and execution-based data prefetching," in ICS, 2004.
[11]
B. Jacob et al., Memory Systems: Cache, DRAM, Disk. Morgan Kaufmann, 2007.
[12]
D. Joseph and D. Grunwald, "Prefetching using Markov predictors," in ISCA, 1997.
[13]
M. Karlsson et al., "A prefetching technique for irregular accesses to linked data structures," in HPCA, 2000.
[14]
C. Kim et al., "Nonuniform cache architectures for wire-delay dominated on-chip caches," Micro, IEEE, vol. 23, no. 6, nov.-dec. 2003.
[15]
Y. Kim et al., "Atlas: A scalable and high-performance scheduling algorithm for multiple memory controllers," in HPCA, 2010.
[16]
C. J. Lee et al., "Prefetch-aware DRAM controllers," in MICRO, 2008.
[17]
W.-f. Lin, "Reducing DRAM latencies with an integrated memory hierarchy design," in HPCA, 2001.
[18]
G. Liu et al., "Enhancements for accurate and timely streaming prefetcher," The Journal of ILP, vol. 13, Jan. 2011.
[19]
C.-K. Luk et al., "Profile-guided post-link stride prefetching," in ICS, 2002.
[20]
K. Luo et al., "Balancing thoughput and fairness in smt processors," in ISPASS, 2001.
[21]
P. S. Magnusson et al., "SIMICS: A full system simulation platform," Computer, vol. 35, no. 2, Feb. 2002.
[22]
M. M. Martin et al., "Multifacets general execution-driven multiprocessor simulator (gems) toolset," SIGARCH Comput. Archit. News, 2005.
[23]
Micron, "DataSheet: 1Gb DDR3 SDRAM."
[24]
Micron, "DDR3 Power Calculator."
[25]
N. Muralimanohar et al., "Optimizing nuca organizations and wiring alternatives for large caches with CACTI 6.0," in MICRO, 2007.
[26]
O. Mutlu and T. Moscibroda, "Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared dram systems," in ISCA, 2008.
[27]
D. Ortega et al., "Cost-effective compiler directed memory prefetching and bypassing," in PACT, 2002.
[28]
D. K. Poulsen and P.-C. Yew, "Data prefetching and data forwarding in shared memory multiprocessors," in ICPP, 1994.
[29]
A. J. Smith, "Sequential program prefetching in memory hierarchies," Computer, vol. 11, no. 12, Dec. 1978.
[30]
Y. Solihin et al., "Correlation prefetching with a user-level memory thread," IEEE Trans. Parallel Distrib. Syst., vol. 14, no. 6, Jun. 2003.
[31]
S. Srinath et al., "Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers," in HPCA'07.
[32]
K. Sudan et al., "Micro-pages: increasing dram efficiency with locality- aware data placement," in ASPLOS, 2010.
[33]
C.-J. Wu et al., "Pacman: prefetch-aware cache management for high performance caching," in MICRO, 2011.
[34]
Y. Wu, "Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching," in PLDI, 2002.
[35]
C.-L. Yang and A. R. Lebeck, "Push vs. pull: data movement for linked data structures," in ICS, 2000.
[36]
Z. Zhang et al., "A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality," in MICRO, 2000.
[37]
X. Zhuang and H.-H. S. Lee, "A hardware-based cache pollution filtering mechanism for aggressive prefetches," in ICPP, 2003.

Cited By

View all
  • (2019)Co-optimizing memory-level parallelism and cache-level parallelismProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314599(935-949)Online publication date: 8-Jun-2019
  • (2019)SOML ReadProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304035(955-969)Online publication date: 4-Apr-2019
  • (2018)PENProceedings of the 16th USENIX Conference on File and Storage Technologies10.5555/3189759.3189766(67-82)Online publication date: 12-Feb-2018
  • Show More Cited By

Index Terms

  1. Meeting midway: improving CMP performance with memory-side prefetching

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
    October 2013
    422 pages
    ISBN:9781479910212

    Sponsors

    Publisher

    IEEE Press

    Publication History

    Published: 07 October 2013

    Check for updates

    Author Tags

    1. NOC
    2. memory
    3. prefetching

    Qualifiers

    • Research-article

    Acceptance Rates

    PACT '13 Paper Acceptance Rate 36 of 208 submissions, 17%;
    Overall Acceptance Rate 121 of 471 submissions, 26%

    Upcoming Conference

    PACT '24

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 03 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)Co-optimizing memory-level parallelism and cache-level parallelismProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314599(935-949)Online publication date: 8-Jun-2019
    • (2019)SOML ReadProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304035(955-969)Online publication date: 4-Apr-2019
    • (2018)PENProceedings of the 16th USENIX Conference on File and Storage Technologies10.5555/3189759.3189766(67-82)Online publication date: 12-Feb-2018
    • (2018)CAMPSProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225112(1-9)Online publication date: 13-Aug-2018
    • (2018)MDACacheProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00073(841-854)Online publication date: 20-Oct-2018
    • (2018)CHAMELEONProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00050(533-545)Online publication date: 20-Oct-2018
    • (2016)Improving bank-level parallelism for irregular applicationsThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195708(1-12)Online publication date: 15-Oct-2016
    • (2016)A Survey of Recent Prefetching Techniques for Processor CachesACM Computing Surveys10.1145/290707149:2(1-35)Online publication date: 2-Aug-2016
    • (2016)Trace-based affine reconstruction of codesProceedings of the 2016 International Symposium on Code Generation and Optimization10.1145/2854038.2854056(139-149)Online publication date: 29-Feb-2016
    • (2014)GemDroidACM SIGMETRICS Performance Evaluation Review10.1145/2637364.259197342:1(355-366)Online publication date: 16-Jun-2014
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media