Article

Analysis of a Memory Bandwidth Limited Scenario for NUMA and GPU Systems

Author:

IPDPSW '11: Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

Pages 693 - 699

https://doi.org/10.1109/IPDPS.2011.193

Published: 16 May 2011 Publication History

Abstract

The processing power and parallelism in hardware is expected to increase rapidly over the next years, whereas memory bandwidth per flop and the amount of main memory per flop will be falling behind. These trends will result in both more algorithms to become limited by memory bandwidth, and overall memory requirements to become an important factor for algorithm design. In this paper we study the Gauß-Seidel stencil as an example of a memory bandwidth limited algorithm. We consider GPUs and NUMA systems, which are both designed to provide high memory bandwidth at the cost of making algorithm design more complex. The mapping of the non-linear memory access pattern of the Gauß-Seidel stencil to the different hardware is important to achieve high performance. We show that there is a trade-off between overall performance and memory requirements when optimizing for optimal memory access pattern. Vectorizing on the NUMA system and optimizing to utilize all processors on the GPU does not pay off in terms of performance per memory used, which we consider an important measurement regarding the trends named before.

Cited By

View all

Li TDong QWang YGong XYang Y(2019)Dual buffer rotation four-stage pipeline for CPU---GPU cooperative computingSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-017-2795-023:3(859-869)Online publication date: 1-Feb-2019
https://dl.acm.org/doi/10.1007/s00500-017-2795-0
Chen QYang HMars JTang L(2016)BaymaxACM SIGARCH Computer Architecture News10.1145/2980024.287236844:2(681-696)Online publication date: 25-Mar-2016
https://dl.acm.org/doi/10.1145/2980024.2872368
Chen QYang HMars JTang L(2016)BaymaxACM SIGPLAN Notices10.1145/2954679.287236851:4(681-696)Online publication date: 25-Mar-2016
https://dl.acm.org/doi/10.1145/2954679.2872368
Show More Cited By

Recommendations

Hybrid DRAM/PRAM-based main memory for single-chip CPU/GPU
DAC '12: Proceedings of the 49th Annual Design Automation Conference

Single-chip CPU/GPU architecture is being adopted in high-end (embedded) systems, e.g., smartphones and tablet PCs. Main memory subsystem is expected to consist of hybrid DRAM and phase-change RAM (PRAM) due to the difficulties in DRAM scaling. In this ...
NUMA obliviousness through memory mapping
DaMoN'15: Proceedings of the 11th International Workshop on Data Management on New Hardware

With the rise of multi-socket multi-core CPUs a lot of effort is being put into how to best exploit their abundant CPU power. In a shared memory setting the multi-socket CPUs are equipped with their own memory module, and access memory modules across ...
NUMAlloc: A Faster NUMA Memory Allocator
ISMM 2023: Proceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management

The NUMA architecture accommodates the hardware trend of an increasing number of CPU cores. It requires the cooperation of memory allocators to achieve good performance for multithreaded applications. Unfortunately, existing allocators do not support ...

Comments

Information & Contributors

Information

Published In

IPDPSW '11: Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

May 2011

2107 pages

ISBN:9780769545776

Publisher

IEEE Computer Society

United States

Publication History

Published: 16 May 2011

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Li TDong QWang YGong XYang Y(2019)Dual buffer rotation four-stage pipeline for CPU---GPU cooperative computingSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-017-2795-023:3(859-869)Online publication date: 1-Feb-2019
https://dl.acm.org/doi/10.1007/s00500-017-2795-0
Chen QYang HMars JTang L(2016)BaymaxACM SIGARCH Computer Architecture News10.1145/2980024.287236844:2(681-696)Online publication date: 25-Mar-2016
https://dl.acm.org/doi/10.1145/2980024.2872368
Chen QYang HMars JTang L(2016)BaymaxACM SIGPLAN Notices10.1145/2954679.287236851:4(681-696)Online publication date: 25-Mar-2016
https://dl.acm.org/doi/10.1145/2954679.2872368
Chen QYang HMars JTang LConte TZhou Y(2016)BaymaxProceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/2872362.2872368(681-696)Online publication date: 25-Mar-2016
https://dl.acm.org/doi/10.1145/2872362.2872368

Abstract

Cited By

Recommendations

Hybrid DRAM/PRAM-based main memory for single-chip CPU/GPU

NUMA obliviousness through memory mapping

NUMAlloc: A Faster NUMA Memory Allocator

Comments

Information

Published In

Publisher

Publication History

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Share

Share this Publication link

Share on social media

Affiliations