research-article

Open access

BPM/BPM+: Software-based dynamic memory partitioning mechanisms for mitigating DRAM bank-/channel-level interferences in multicore systems

Authors:

Lei Liu,

Zehan Cui,

Yong Li,

Yungang Bao,

Mingyu Chen,

Chengyong WuAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 11, Issue 1

Article No.: 5, Pages 1 - 28

https://doi.org/10.1145/2579672

Published: 01 February 2014 Publication History

PDF eReader

Abstract

The main memory system is a shared resource in modern multicore machines that can result in serious interference leading to reduced throughput and unfairness. Many new memory scheduling mechanisms have been proposed to address the interference problem. However, these mechanisms usually employ relative complex scheduling logic and need modifications to Memory Controllers (MCs), which incur expensive hardware design and manufacturing overheads.

This article presents a practical software approach to effectively eliminate the interference without any hardware modifications. The key idea is to modify the OS memory management system and adopt a page-coloring-based Bank-level Partitioning Mechanism (BPM) that allocates dedicated DRAM banks to each core (or thread). By using BPM, memory requests from distinct programs are segregated across multiple memory banks to promote locality/fairness and reduce interference. We further extend BPM to BPM+ by incorporating channel-level partitioning, on which we demonstrate additional gain over BPM in many cases. To achieve benefits in the presence of diverse application memory needs and avoid performance degradation due to resource underutilization, we propose a dynamic mechanism upon BPM/BPM+ that assigns appropriate bank/channel resources based on application memory/bandwidth demands monitored through PMU (performance-monitoring unit) and a low-overhead OS page table scanning process.

We implement BPM/BPM+ in Linux 2.6.32.15 kernel and evaluate the technique on four-core and eight-core real machines by running a large amount of randomly generated multiprogrammed and multithreaded workloads. Experimental results show that BPM/BPM+ can improve the overall system throughput by 4.7%/5.9%, on average, (up to 8.6%/9.5%) and reduce the unfairness by an average of 4.2%/6.1% (up to 15.8%/13.9%).

References

[1]

N. Aggarwal, et al. 2008. Power Efficient DRAM Speculation. In HPCA-14.

Google Scholar

[2]

R. Azimi, D. K. Tam, L. Soares, and M. Stumm. 2009. Enhancing Operating System Support for Multicore Processors by Using Hardware Performance Monitoring. ACM SIGOPS Operating Systems Review 43, 2, 56--65.

Digital Library

Google Scholar

[3]

Y. Bao, et al. 2008. HMTT: A Platform Independent Full-System Memory Trace Monitoring System. In SIGMETRICS'08.

Digital Library

Google Scholar

[4]

S. Beamer, et al. 2010. Re-Architecting DRAM Memory Systems with Monolithically Integrated Silicon Photonics. In ISCA-37.

Digital Library

Google Scholar

[5]

C. Bienia, et al. 2008. The PARSEC Benchmark Suite: Characterization and Architectural Implications. Technical Report TR-811-08, Princeton University.

Google Scholar

[6]

S. Cho and L. Jin. 2006. Managing Distributed, Shared L2 Caches through OS-Level Page Allocation. In MICRO-39.

Digital Library

Google Scholar

[7]

Z.Cui, et al. 2011. A Fine-grained Component-Level Power Measurement Method. In IGCC'11.

Google Scholar

[8]

R. Das, et al. 2013. Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems. In HPCA'13.

Digital Library

Google Scholar

[9]

J. Demme, et al. 2011. Rapid Identification of Architectural Bottlenecks via Precise Event Counting. In ISCA'11.

Digital Library

Google Scholar

[10]

G. Dhiman, G. Marchetti, and T. Rosing. 2009. vGreen: A System for Energy Efficient Computing in Virtualized Environments. In ISLPED'09.

Digital Library

Google Scholar

[11]

E. Ebrahimi, et al. 2010. Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multi-Core Memory Systems. In ASPLOS'10.

Digital Library

Google Scholar

[12]

Hewlett-Packard Development Company Perfmon Project. 2005. Retrieved from http://www.hpl.hp.com/techreports/2004/HPL-2004-200R1.html&quest;jumpid=reg_r1002_usen_c-001_title_r0001.

Google Scholar

[13]

I. Hur and C. Lin. 2007. Memory Scheduling for Modern Microprocessors. ACM Transactions on Computer Systems 25, 4, Article 10.

Digital Library

Google Scholar

[14]

R. Iyer, et al. 2007. QoS policy and Architecture for Cache/Memory in CMP Platforms. In SIGMETRICS'07.

Digital Library

Google Scholar

[15]

M. K. Jeong, et al. 2012. Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems. In HPCA-18.

Digital Library

Google Scholar

[16]

Y. Kim, et al. 2010. ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple MCs. In HPCA-16.

Google Scholar

[17]

Y. Kim, M. Papamicheal, and O. Mutlu. 2010. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. In MICRO-43.

Digital Library

Google Scholar

[18]

D. Kaseridis, J. Stuecheli, and L. K. John. 2011. Minimalist Open-Page: A DRAM Page-Mode Scheduling Policy for the Many-Core Era. In MICRO-44.

Digital Library

Google Scholar

[19]

R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. 2008. Using OS Observations to Improve Performance in Multicore Systems. In Micro-41.

Digital Library

Google Scholar

[20]

C. J. Lee, et al. 2009. Improving Memory Bank-level Parallelism in the Presence of Prefetching. In MICRO-42.

Digital Library

Google Scholar

[21]

J. Liedtke, H. Haertig, and M. Hohmuth. 1997. OS-Controlled Cache Predictability for Real-Time Systems. In RTAS-3.

Digital Library

Google Scholar

[22]

Linux/RK. 2013. Homepage. Retrieved from https://rtml.ece.cmu.edu/redmine/projects/rk.

Google Scholar

[23]

J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. 2008. Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems. In HPCA-14.

Google Scholar

[24]

L. Liu, et al. 2012. A Software Memory Partition Approach for Eliminating Bank-level Interference in Multicore Systems. In PACT-21.

Digital Library

Google Scholar

[25]

W. Mi, et al. 2010. Software-Hardware Cooperative DRAM Bank Partitioning for Chip Multiprocessors. In Proc. the 2010 IFIP Int'l Conf. NPC.

Digital Library

Google Scholar

[26]

T. Moscibroda and O. Mutlu. 2007. Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems. In USENIX Security.

Digital Library

Google Scholar

[27]

S. P. Muralidhara, et al. 2011. Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning. In Micro-44.

Digital Library

Google Scholar

[28]

O. Mutlu and T. Moscibroda. 2008. Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. In ISCA-35.

Digital Library

Google Scholar

[29]

O. Mutlu and T. Moscibroda. 2007. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In MICRO-40.

Digital Library

Google Scholar

[30]

C. Natarajan, B. Christenson, and F. Briggs. 2004. A Study of Performance Impact of MC Features in Multi-Processor Environment. In Proceedings of WMPI.

Digital Library

Google Scholar

[31]

K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. 2006. Fair Queuing Memory Systems. In MICRO-39.

Digital Library

Google Scholar

[32]

H. Park, et al. 2013. Regularities Considered Harmful: Forcing Randomness to Memory Accesses to Reduce Row Buffer Conflicts for Multi-Core, Multi-Bank Systems. In ASPLOS'13.

Digital Library

Google Scholar

[33]

M. K. Qureshi and Y. N. Patt. 2006. Utility-based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. In MICRO-39.

Digital Library

Google Scholar

[34]

S. Rixner, et al. 2000. Memory Access Scheduling. In ISCA-27.

Digital Library

Google Scholar

[35]

B. Rogers, et al. 2009. Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scaling. In ISCA-42.

Digital Library

Google Scholar

[36]

Standard Performance Evaluation Corporation. 2011. SPEC CPU2006. Retrieved from http://www.spec.org/cpu2006.

Google Scholar

[37]

K. Sudan, et al. 2010. Micro-Pages: Increasing DRAM Efficiency with Locality-Aware. In ASPLOS'10.

Digital Library

Google Scholar

[38]

G. E. Suh, L. Rudolph, and S. Devadas. 2004. Dynamic Partitioning of Shared Cache Memory. Journal of Supercomputing 28, 1, 7--26.

Digital Library

Google Scholar

[39]

G. E. Suh, S. Devadas, and L. Rudolph. 2002. A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning. In HPCA-8.

Digital Library

Google Scholar

[40]

N. Suzuki, et al. 2013. Coordinated Bank and Cache Coloring for Temporal Protection of Memory Accesses. In ICESS'13.

Digital Library

Google Scholar

[41]

H. S. Stone, J. Turek, and J. L. Wolf. 1992. Optimal Partitioning of Cache Memory. IEEE Transactions on Computers 41, 9, 1054--1068.

Digital Library

Google Scholar

[42]

L. Subramanian, et al. 2012. MISE: Providing Performance Predictability and Improving Fairness in Shared Main Memory Systems. In HPCA'13.

Digital Library

Google Scholar

[43]

A. Udipi, et al. 2010. Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores. In ISCA'10.

Digital Library

Google Scholar

[44]

G. L. Yuan, et al. 2009. Complexity Effective Memory Access Scheduling for Many-Core Accelerator Architectures. In MICRO-42.

Digital Library

Google Scholar

[45]

X. Zhang, S. Dwarkadas, and K. Shen. 2009. Hardware Execution Throttling for Multi-Core Resource Management. In USENIX ATC'09.

Digital Library

Google Scholar

[46]

Z. Zhang, Z. Zhu, and X. Zhang. 2000. A Permutation-based Page Interleaving Scheme to Reduce Row-buffer Conflicts and Exploit Data Locality. In MICRO-33.

Digital Library

Google Scholar

[47]

S. Zhuravlev, S. Blagodurov, and A. Fedorova. 2010. Addressing Shared Resource Contention in Multicore Processors via Scheduling. In ASPLOS'10.

Digital Library

Google Scholar

Cited By

View all

Yang HShao RCheng YChen YZhou RLiu GXie GZhou Q(2024)REDBJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2024.103135151:COnline publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1016/j.sysarc.2024.103135
Fan HYe YIbrahim SHuang ZLi XXue WWu SYu CShi XJin H(2023)QoS-pro: A QoS-enhanced Transaction Processing Framework for Shared SSDsACM Transactions on Architecture and Code Optimization10.1145/363295521:1(1-25)Online publication date: 14-Nov-2023
https://dl.acm.org/doi/10.1145/3632955
Wei RLi CChen CSun GHe M(2021)Memory Access Optimization of a Neural Network Accelerator Based on Memory ControllerElectronics10.3390/electronics1004043810:4(438)Online publication date: 10-Feb-2021
https://doi.org/10.3390/electronics10040438
Show More Cited By

Index Terms

BPM/BPM+: Software-based dynamic memory partitioning mechanisms for mitigating DRAM bank-/channel-level interferences in multicore systems

Recommendations

DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators

Modern SoCs integrate multiple CPU cores and hardware accelerators (HWAs) that share the same main memory system, causing interference among memory requests from different agents. The result of this interference, if it is not controlled well, is missed ...
A software memory partition approach for eliminating bank-level interference in multicore systems
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

Main memory system is a shared resource in modern multicore machines, resulting in serious interference, which causes performance degradation in terms of throughput slowdown and unfairness. Numerous new memory scheduling algorithms have been proposed to ...
Reducing memory interference in multicore systems via application-aware memory channel partitioning
MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

Main memory is a major shared resource among cores in a multicore system. If the interference between different applications' memory requests is not controlled effectively, system performance can degrade significantly. Previous work aimed to mitigate ...

Reviews

Reviewer: Anshuman Gupta

Dynamic random access memory (DRAM) interference in shared memory systems can, as demonstrated in this paper, lead to degradation of performance. The paper proposes a software-based solution to provide isolation between applications by allocating different banks (bank-level partitioning mechanism (BPM)) and channels (BPM+) to different applications, thereby improving performance. The paper also proposes dynamically allocating the banks to applications based on their memory requirements with the objective of attaining uniform channel utilization. The paper attacks the important problem of interference on shared memory systems, and does an excellent job of quantifying the harmful effects of interference and highlighting the worsening of interference effects in the future. Moreover, the paper clearly quantifies the benefits of resource isolation and dynamic resource allocation to reaffirm these ideas, which have been proposed in the past. However, the papers software-based solution to reduce interference, determine memory requirements, and calculate dynamic allocations lacks details, is poorly evaluated, and might not be sufficiently lightweight to be useful in real systems with rapidly changing application phases or complicated memory behavior. Moreover, there is no analysis of the scalability of the proposed solution to future systems, which seem to be sharing ever-increasing resources between an increasing number of cores. Here are some key thoughts on the paper in more detail: The paper quantifies the impact of interference on fairness and throughput. The uneven and large slowdowns in applications due to bandwidth contention or row-buffer contentions can lead to degradation of overall system performance. The paper demonstrates that resource isolation can reduce interference to reduce fairness and improve throughput. This, however, comes at the cost of fragmentation: allocating banks and channels to an application that has smaller requirements can lead to underutilization. The paper demonstrates that dynamic resource allocation is required because staticfixed allocation is insufficient. The system should be able to change the resource allocation based on changing application behaviors and overall system resource availability. The papers evaluation is not very thorough. While the authors used modern benchmarks, both single threaded and multithreaded, the authors fail to show the results for all benchmark combinations under all scenarios. The results look cherry-picked. Some claims in the paper are not backed up with data. For example, dynamic allocation is evaluated in intervals of ten seconds, but why The paper shows a graph between channel utilization mismatch and performance improvement, and concludes that better balance in channel utilization leads to higher performance. This might be a correlation rather than causality, as a better channel utilization will lead to both balanced channel utilization as well as higher performance. While resource isolation is useful in reducing interference, and dynamic allocation helps to improve resource utilization, it is essential that these two mechanisms are lightweight and accurate in order to account for changing application phases. The papers software approach might be too expensive and slow to adapt to phase changes. The paper uses last level cache (LLC) miss rate to determine the applications memory behavior, which is insufficient since a low miss rate can be attributed to either large working set size or low memory-level parallelism in the application phase. The determination of application memory requirements from the LLC miss rate is not present in the paper. The authors comment that hardware solutions are complicated, but the software solution might be more expensive in terms of execution time and energy. They fail to provide any evaluation of performance and energy overheads of the mechanism. The related work section of the paper fails to look at very recent work in this area (for example, TimeCube [1]). Overall, I would recommend reading this paper to get insights into the worsening ill effects of interference in shared DRAM systems, and the qualitative and quantitative benefits of resource isolation and dynamic allocation. However, I would be cautious about the papers software-based solution to achieve these two properties. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 11, Issue 1

February 2014

373 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2591460

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 February 2014

Accepted: 01 December 2013

Revised: 01 November 2013

Received: 01 June 2013

Published in TACO Volume 11, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
832
Total Downloads

Downloads (Last 12 months)89
Downloads (Last 6 weeks)9

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Yang HShao RCheng YChen YZhou RLiu GXie GZhou Q(2024)REDBJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2024.103135151:COnline publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1016/j.sysarc.2024.103135
Fan HYe YIbrahim SHuang ZLi XXue WWu SYu CShi XJin H(2023)QoS-pro: A QoS-enhanced Transaction Processing Framework for Shared SSDsACM Transactions on Architecture and Code Optimization10.1145/363295521:1(1-25)Online publication date: 14-Nov-2023
https://dl.acm.org/doi/10.1145/3632955
Wei RLi CChen CSun GHe M(2021)Memory Access Optimization of a Neural Network Accelerator Based on Memory ControllerElectronics10.3390/electronics1004043810:4(438)Online publication date: 10-Feb-2021
https://doi.org/10.3390/electronics10040438
Yang SLi XDou XGong XLiu HChen LLiu L(2021)Monitoring Memory Behaviors and Mitigating NUMA Drawbacks on Tiered NVM SystemsNetwork and Parallel Computing10.1007/978-3-030-79478-1_33(386-391)Online publication date: 23-Jun-2021
https://doi.org/10.1007/978-3-030-79478-1_33
Liu LYang SPeng LLi X(2019)Hierarchical Hybrid Memory Management in OS for Tiered Memory SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.290817530:10(2223-2236)Online publication date: 1-Oct-2019
https://doi.org/10.1109/TPDS.2019.2908175
Hu JLi J(2019)Research on Shared Resource Contention of Cloud Data CenterHigh-Performance Computing Applications in Numerical Simulation and Edge Computing10.1007/978-981-32-9987-0_16(186-197)Online publication date: 29-Aug-2019
https://doi.org/10.1007/978-981-32-9987-0_16
Li SReddy DJacob BJacob B(2018)A performance & power comparison of modern high-speed DRAM architecturesProceedings of the International Symposium on Memory Systems10.1145/3240302.3240315(341-353)Online publication date: 1-Oct-2018
https://dl.acm.org/doi/10.1145/3240302.3240315
Vijaykumar NJain AMajumdar DHsieh KPekhimenko GEbrahimi EHajinazar NGibbons PMutlu O(2018)A case for richer cross-layer abstractionsProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00027(207-220)Online publication date: 2-Jun-2018
https://dl.acm.org/doi/10.1109/ISCA.2018.00027
Hillenbrand MGottschlag MKehne JBellosa F(2017)Multiple Physical MappingsProceedings of the 8th Asia-Pacific Workshop on Systems10.1145/3124680.3124742(1-9)Online publication date: 2-Sep-2017
https://dl.acm.org/doi/10.1145/3124680.3124742
Hu YSong MLi T(2017)Towards "Full Containerization" in Containerized Network Function VirtualizationACM SIGARCH Computer Architecture News10.1145/3093337.303771345:1(467-481)Online publication date: 4-Apr-2017
https://dl.acm.org/doi/10.1145/3093337.3037713
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

References

Cited By

Index Terms

Recommendations

DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators

A software memory partition approach for eliminating bank-level interference in multicore systems

Reducing memory interference in multicore systems via application-aware memory channel partitioning

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations