Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Micro-pages: increasing DRAM efficiency with locality-aware data placement

Published: 13 March 2010 Publication History
  • Get Citation Alerts
  • Abstract

    Power consumption and DRAM latencies are serious concerns in modern chip-multiprocessor (CMP or multi-core) based compute systems. The management of the DRAM row buffer can significantly impact both power consumption and latency. Modern DRAM systems read data from cell arrays and populate a row buffer as large as 8 KB on a memory request. But only a small fraction of these bits are ever returned back to the CPU. This ends up wasting energy and time to read (and subsequently write back) bits which are used rarely. Traditionally, an open-page policy has been used for uni-processor systems and it has worked well because of spatial and temporal locality in the access stream. In future multi-core processors, the possibly independent access streams of each core are interleaved, thus destroying the available locality and significantly under-utilizing the contents of the row buffer. In this work, we attempt to improve row-buffer utilization for future multi-core systems.
    The schemes presented here are motivated by our observations that a large number of accesses within heavily accessed OS pages are to small, contiguous "chunks" of cache blocks. Thus, the co-location of chunks (from different OS pages) in a row-buffer will improve the overall utilization of the row buffer contents, and consequently reduce memory energy consumption and access time. Such co-location can be achieved in many ways, notably involving a reduction in OS page size and software or hardware assisted migration of data within DRAM. We explore these mechanisms and discuss the trade-offs involved along with energy and performance improvements from each scheme. On average, for applications with room for improvement, our best performing scheme increases performance by 9% (max. 18%) and reduces memory energy consumption by 15% (max. 70%).

    References

    [1]
    STREAM -- Sustainable Memory Bandwidth in High Performance Computers. http://www.cs.virginia.edu/stream/.
    [2]
    Virtutech Simics Full System Simulator. http://www.virtutech.com.
    [3]
    Java Server Benchmark, 2005. Available at http://www.spec.org/jbb2005/.
    [4]
    K. Albayraktaroglu, A. Jaleel, X. Wu, M. Franklin, B. Jacob, C.-W. Tseng, and D. Yeung. BioBench: A Benchmark Suite of Bioinformatics Applications. In Proceedings of ISPASS, 2005.
    [5]
    K. Asanovic and et. al. The Landscape of Parallel Computing Research: A View from Berkeley. Technical report, EECS Department, University of California, Berkeley, 2006.
    [6]
    M. Awasthi, K. Sudan, R. Balasubramonian, and J. Carter. Dynamic Hardware-Assisted Software-Controlled Page Placement to Manage Capacity Allocation and Sharing within Large Caches. In Proceedings of HPCA, 2009.
    [7]
    D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, D. Dagum, R. Fatoohi, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS Parallel Benchmarks. International Journal of Supercomputer Applications, 5(3): 63.73, Fall 1991.
    [8]
    L. Barroso and U. Holzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan & Claypool, 2009.
    [9]
    C. Benia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. Technical report, Department of Computer Science, Princeton University, 2008.
    [10]
    B. Bershad, B. Chen, D. Lee, and T. Romer. Avoiding Conflict Misses Dynamically in Large Direct-Mapped Caches. In Proceedings of ASPLOS, 1994.
    [11]
    J. Carter, W. Hsieh, L. Stroller, M. Swanson, L. Zhang, E. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama. Impulse: Building a Smarter Memory Controller. In Proceedings of HPCA, 1999.
    [12]
    R. Chandra, S. Devine, B. Verghese, A. Gupta, and M. Rosenblum. Scheduling and Page Migration for Multiprocessor Compute Servers. In Proceedings of ASPLOS, 1994.
    [13]
    M. Chaudhuri. PageNUCA: Selected Policies For Page-Grain Locality Management In Large Shared Chip-Multiprocessor Caches. In Proceedings of HPCA, 2009.
    [14]
    S. Cho and L. Jin. Managing Distributed, Shared L2 Caches through OS-Level Page Allocation. In Proceedings of MICRO, 2006.
    [15]
    J. Corbalan, X. Martorell, and J. Labarta. Page Migration with Dynamic Space-Sharing Scheduling Policies: The case of SGI 02000. International Journal of Parallel Programming, 32(4), 2004.
    [16]
    R. Crisp. Direct Rambus Technology: The New Main Memory Standard. In Proceedings of MICRO, 1997.
    [17]
    V. Cuppu and B. Jacob. Concurrency, Latency, or System Overhead: Which Has the Largest Impact on Uniprocessor DRAM-System Performance. In Proceedings of ISCA, 2001.
    [18]
    V. Cuppu, B. Jacob, B. Davis, and T. Mudge. A Performance Comparison of Contemporary DRAM Architectures. In Proceedings of ISCA, 1999.
    [19]
    V. Delaluz, M. Kandemir, N. Vijaykrishnan, A. Sivasubramaniam, and M. Irwin. DRAM Energy Management Using Software and Hardware Directed Power Mode Control. In Proceedings of HPCA, 2001.
    [20]
    X. Ding, D. S. Nikopoulosi, S. Jiang, and X. Zhang. MESA: Reducing Cache Conflicts by Integrating Static and Run-Time Methods. In Proceedings of ISPASS, 2006.
    [21]
    X. Fan, H. Zeng, and C. Ellis. Memory Controller Policies for DRAM Power Management. In Proceedings of ISLPED, 2001.
    [22]
    Z. Fang, L. Zhang, J. Carter, S. McKee, and W. Hsieh. Online Superpage Promotion Revisited (Poster Session). SIGMETRICS Perform. Eval. Rev., 2000.
    [23]
    N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Reactive NUCA: Near-Optimal Block Placement And Replication In Distributed Caches. In Proceedings of ISCA, 2009.
    [24]
    J. L. Henning. SPEC CPU2006 Benchmark Descriptions. In Proceedings of ACM SIGARCH Computer Architecture News, 2005.
    [25]
    H. Huang, P. Pillai, and K. G. Shin. Design And Implementation Of Power-Aware Virtual Memory. In Proceedings Of The Annual Conference On Usenix Annual Technical Conference, 2003.
    [26]
    H. Huang, K. Shin, C. Lefurgy, and T. Keller. Improving Energy Efficiency by Making DRAM Less Randomly Accessed. In Proceedings of ISLPED, 2005.
    [27]
    Intel 845G/845GL/845GV Chipset Datasheet: Intel 82845G/82845GL/82845GV Graphics and Memory Controller Hub (GMCH). Intel Corporation, 2002. http://download.intel.com/design/chipsets/datashts/29074602.pdf.
    [28]
    ITRS. International Technology Roadmap for Semiconductors, 2007 Edition. http://www.itrs.net/Links/2007ITRS/Home2007.htm.
    [29]
    B. Jacob, S.W. Ng, and D. T.Wang. Memory Systems -- Cache, DRAM, Disk. Elsevier, 2008.
    [30]
    JEDEC. JESD79: Double Data Rate (DDR) SDRAM Specification. JEDEC Solid State Technology Association, Virginia, USA, 2003.
    [31]
    N. Jouppi. Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers. In Proceedings of ISCA-17, pages 364.373, May 1990.
    [32]
    R. E. Kessler and M. D. Hill. Page Placement Algorithms for Large Real-Indexed Caches. ACM Trans. Comput. Syst., 10(4), 1992.
    [33]
    D. E. Knuth. The Art of Computer Programming: Fundamental Algorithms, volume 1. Addison-Wesley, third edition, 1997.
    [34]
    R. LaRowe and C. Ellis. Experimental Comparison of Memory Management Policies for NUMA Multiprocessors. Technical report, 1990.
    [35]
    R. LaRowe and C. Ellis. Page Placement policies for NUMA multiprocessors. J. Parallel Distrib. Comput., 11(2), 1991.
    [36]
    R. LaRowe, J. Wilkes, and C. Ellis. Exploiting Operating System Support for Dynamic Page Placement on a NUMA Shared Memory Multiprocessor. In Proceedings of PPOPP, 1991.
    [37]
    K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. Reinhardt, and T. Wenisch. Disaggregated Memory for Expansion and Sharing in Blade Servers. In Proceedings of ISCA, 2009.
    [38]
    K. Lim, P. Ranganathan, J. Chang, C. Patel, T. Mudge, and S. Reinhardt. Understanding and Designing New Server Architectures for Emerging Warehouse--Computing Environments. In Proceedings of ISCA, 2008.
    [39]
    W. Lin, S. Reinhardt, and D. Burger. Designing a Modern Memory Hierarchy with Hardware Prefetching. In Proceedings of IEEE Transactions on Computers, 2001.
    [40]
    P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A Full System Simulation Platform. IEEE Computer, 35(2):50.58, February 2002.
    [41]
    Micron DDR2 SDRAM Part MT47H64M8. Micron Technology Inc., 2004.
    [42]
    R. Min and Y. Hu. Improving Performance of Large Physically Indexed Caches by Decoupling Memory Addresses from Cache Addresses. IEEE Trans. Comput., 50(11), 2001.
    [43]
    N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. In Proceedings of MICRO, 2007.
    [44]
    O. Mutlu and T. Moscibroda. Stall--Time Fair Memory Access Scheduling for Chip Multiprocessors. In Proceedings of MICRO, 2007.
    [45]
    O. Mutlu and T. Moscibroda. Parallelism-Aware Batch Scheduling: Enhancing Both Performance and Fairness of Shared DRAM Systems. In Proceedings of ISCA, 2008.
    [46]
    J. Navarro, S. Iyer, P. Druschel, and A. Cox. Practical, Transparent Operating System
    [47]
    N. Rafique, W. Lim, and M. Thottethodi. Architectural Support for Operating System Driven CMP Cache Management. In Proceedings of PACT, 2006.
    [48]
    S. Rixner, W. Dally, U. Kapasi, P. Mattson, and J. Owens. Memory Access Scheduling. In Proceedings of ISCA, 2000.
    [49]
    T. Romer, W. Ohlrich, A. Karlin, and B. Bershad. Reducing TLB and Memory Overhead Using Online Superpage Promotion. In Proceedings of ISCA-22, 1995.
    [50]
    T. Sherwood, B. Calder, and J. Emer. Reducing Cache Misses Using Hardware and Software Page Placement. In Proceedings of SC, 1999.
    [51]
    A. Snavely, D. Tullsen, and G. Voelker. Symbiotic Jobscheduling with Priorities for a Simultaneous Multithreading Processor. In Proceedings of SIGMETRICS, 2002.
    [52]
    M. Swanson, L. Stoller, and J. Carter. Increasing TLB Reach using Superpages Backed by Shadow Memory. In Proceedings of ISCA, 1998.
    [53]
    M. Talluri and M. D. Hill. Surpassing the TLB Performance of Superpages with Less Operating System Support. In Proceedings of ASPLOS-VI, 1994.
    [54]
    S. Thoziyoor, N. Muralimanohar, and N. Jouppi. CACTI 5.0. Technical report, HP Laboratories, 2007.
    [55]
    B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. SIGPLAN Not., 31(9), 1996.
    [56]
    D. Wallin, H. Zeffer, M. Karlsson, and E. Hagersten. VASA: A Simulator Infrastructure with Adjustable Fidelity. In Proceedings of IASTED International Conference on Parallel and Distributed Computing and Systems, 2005.
    [57]
    D. Wang, B. Ganesh, N. Tuaycharoen, K. Baynes, A. Jaleel, and B. Jacob. DRAMsim: A Memory-System Simulator. In SIGARCH Computer Architecture News, volume 33, September 2005.
    [58]
    X. Zhang, S. Dwarkadas, and K. Shen. Hardware Execution Throttling for Multi-core Resource Management. In Proceedings of USENIX, 2009.
    [59]
    Z. Zhang, Z. Zhu, and X. Zhand. A Permutation-Based Page Interleaving Scheme to Reduce Row--Buffer Conflicts and Exploit Data Locality. In Proceedings of MICRO, 2000.
    [60]
    H. Zheng, J. Lin, Z. Zhang, E. Gorbatov, H. David, and Z. Zhu. Mini-Rank: Adaptive DRAM Architecture For Improving Memory Power Efficiency. In Proceedings of MICRO, 2008.
    [61]
    H. Zheng, J. Lin, Z. Zhang, and Z. Zhu. Decoupled DIMM: Building High-Bandwidth Memory System from Low-Speed DRAM Devices. In Proceedings of ISCA, 2009.
    [62]
    Z. Zhu and Z. Zhang. A Performance Comparison of DRAM Memory System Optimizations for SMT Processors. In Proceedings of HPCA, 2005.
    [63]
    Z. Zhu, Z. Zhang, and X. Zhang. Fine-grain Priority Scheduling on Multi-channel Memory Systems. In Proceedings of HPCA, 2002

    Cited By

    View all
    • (2023)Extension VM: Interleaved Data Layout in Vector MemoryACM Transactions on Architecture and Code Optimization10.1145/363152821:1(1-23)Online publication date: 7-Nov-2023
    • (2022)Hybrid Refresh: Improving DRAM Performance by Handling Weak Rows SmartlyProceedings of the 2022 International Symposium on Memory Systems10.1145/3565053.3565060(1-11)Online publication date: 3-Oct-2022
    • (2022)BunchBloomer: Cost-Effective Bloom Filter Accelerator for Genomics Applications2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL57034.2022.00014(9-16)Online publication date: Aug-2022
    • Show More Cited By

    Index Terms

    1. Micro-pages: increasing DRAM efficiency with locality-aware data placement

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM SIGARCH Computer Architecture News
      ACM SIGARCH Computer Architecture News  Volume 38, Issue 1
      ASPLOS '10
      March 2010
      399 pages
      ISSN:0163-5964
      DOI:10.1145/1735970
      Issue’s Table of Contents
      • cover image ACM Conferences
        ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems
        March 2010
        422 pages
        ISBN:9781605588391
        DOI:10.1145/1736020
        • General Chair:
        • James C. Hoe,
        • Program Chair:
        • Vikram S. Adve
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 March 2010
      Published in SIGARCH Volume 38, Issue 1

      Check for updates

      Author Tags

      1. data placement
      2. dram row-buffer management

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)66
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 27 Jul 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Extension VM: Interleaved Data Layout in Vector MemoryACM Transactions on Architecture and Code Optimization10.1145/363152821:1(1-23)Online publication date: 7-Nov-2023
      • (2022)Hybrid Refresh: Improving DRAM Performance by Handling Weak Rows SmartlyProceedings of the 2022 International Symposium on Memory Systems10.1145/3565053.3565060(1-11)Online publication date: 3-Oct-2022
      • (2022)BunchBloomer: Cost-Effective Bloom Filter Accelerator for Genomics Applications2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL57034.2022.00014(9-16)Online publication date: Aug-2022
      • (2019)Innovations in the Memory SystemSynthesis Lectures on Computer Architecture10.2200/S00933ED1V01Y201906CAC04814:2(1-151)Online publication date: 10-Sep-2019
      • (2018)What Your DRAM Power Models Are Not Telling YouProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/32244192:3(1-41)Online publication date: 21-Dec-2018
      • (2016)Impact of Intrinsic Profiling Limitations on Effectiveness of Adaptive OptimizationsACM Transactions on Architecture and Code Optimization10.1145/300866113:4(1-26)Online publication date: 12-Dec-2016
      • (2016)DReAMProceedings of the Second International Symposium on Memory Systems10.1145/2989081.2989102(362-373)Online publication date: 3-Oct-2016
      • (2015)Cross-layer memory management for managed language applicationsACM SIGPLAN Notices10.1145/2858965.281432250:10(488-504)Online publication date: 23-Oct-2015
      • (2015)Cross-layer memory management for managed language applicationsProceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications10.1145/2814270.2814322(488-504)Online publication date: 23-Oct-2015
      • (2014)ANATOMYACM SIGMETRICS Performance Evaluation Review10.1145/2637364.259199542:1(505-517)Online publication date: 16-Jun-2014
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media