research-article

Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

Authors:

Byunghyun Jang,

Dana Schaa,

Perhaad Mistry,

David KaeliAuthors Info & Claims

IEEE Transactions on Parallel and Distributed Systems, Volume 22, Issue 1

Pages 105 - 118

https://doi.org/10.1109/TPDS.2010.107

Published: 01 January 2011 Publication History

Publisher Site

Abstract

The introduction of General-Purpose computation on GPUs (GPGPUs) has changed the landscape for the future of parallel computing. At the core of this phenomenon are massively multithreaded, data-parallel architectures possessing impressive acceleration ratings, offering low-cost supercomputing together with attractive power budgets. Even given the numerous benefits provided by GPGPUs, there remain a number of barriers that delay wider adoption of these architectures. One major issue is the heterogeneous and distributed nature of the memory subsystem commonly found on data-parallel architectures. Application acceleration is highly dependent on being able to utilize the memory subsystem effectively so that all execution units remain busy. In this paper, we present techniques for enhancing the memory efficiency of applications on data-parallel architectures, based on the analysis and characterization of memory access patterns in loop bodies; we target vectorization via data transformation to benefit vector-based architectures (e.g., AMD GPUs) and algorithmic memory selection for scalar-based architectures (e.g., NVIDIA GPUs). We demonstrate the effectiveness of our proposed methods with kernels from a wide range of benchmark suites. For the benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4{\times} and 13.5{\times} over baseline GPU implementations on each platform, respectively) by applying our proposed methodology.

Cited By

View all

Kashikar PSentieys OSinha S(2024)Combining Weight Approximation, Sharing and Retraining for Neural Network Model CompressionACM Transactions on Embedded Computing Systems10.1145/368746623:6(1-23)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3687466
Swatman SVarbanescu APimentel ASalzburger AKrasznahorkay ABalsamo SKnottenbelt WAbad CShang W(2024)Using Evolutionary Algorithms to Find Cache-Friendly Generalized Morton Layouts for ArraysProceedings of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629526.3645034(83-94)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3629526.3645034
Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638
Show More Cited By

Recommendations

Exploiting Hyper-Loop Parallelism in Vectorization to Improve Memory Performance on CUDA GPGPU
TRUSTCOM-BIGDATASE-ISPA '15: Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 03

Memory performance is of great importance to achieve high performance on the Nvidia CUDA GPU. Previous work has proposed specific optimizations such as thread coarsening, caching data in shared memory, and global data layout transformation. We argue ...
Exploiting Hyper-Loop Parallelism in Vectorization to Improve Memory Performance on CUDA GPGPU
TRUSTCOM-BIGDATASE-ISPA '15: Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 03

Memory performance is of great importance to achieve high performance on the Nvidia CUDA GPU. Previous work has proposed specific optimizations such as thread coarsening, caching data in shared memory, and global data layout transformation. We argue ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems

IEEE Transactions on Parallel and Distributed Systems Volume 22, Issue 1

January 2011

191 pages

ISSN:1045-9219

Issue’s Table of Contents

Publisher

IEEE Press

Publication History

Published: 01 January 2011

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

62
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Kashikar PSentieys OSinha S(2024)Combining Weight Approximation, Sharing and Retraining for Neural Network Model CompressionACM Transactions on Embedded Computing Systems10.1145/368746623:6(1-23)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3687466
Swatman SVarbanescu APimentel ASalzburger AKrasznahorkay ABalsamo SKnottenbelt WAbad CShang W(2024)Using Evolutionary Algorithms to Find Cache-Friendly Generalized Morton Layouts for ArraysProceedings of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629526.3645034(83-94)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3629526.3645034
Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638
Sanzo PQuaglia F(2023)On the Effects of Transaction Data Access Patterns on Performance in Lock-Based Concurrency ControlIEEE Transactions on Computers10.1109/TC.2022.322208472:6(1718-1732)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1109/TC.2022.3222084
Li XYuan ZGuan YSun GZhang TWei RNiu D(2022)Flatfish: A Reinforcement Learning Approach for Application-Aware Address MappingIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.314620441:11(4758-4770)Online publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1109/TCAD.2022.3146204
Pikle NSathe SVyavahare A(2022)Low occupancy high performance elemental products in assembly free FEM on GPUEngineering with Computers10.1007/s00366-021-01350-638:Suppl 3(2189-2204)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1007/s00366-021-01350-6
Funke HTeubner J(2020)Data-parallel query processing on non-uniform dataProceedings of the VLDB Endowment10.14778/3380750.338075813:6(884-897)Online publication date: 11-Mar-2020
https://dl.acm.org/doi/10.14778/3380750.3380758
Lavin PYoung JVuduc RRiedy JVose AErnst D(2020)Evaluating Gather and Scatter Performance on CPUs and GPUsProceedings of the International Symposium on Memory Systems10.1145/3422575.3422794(209-222)Online publication date: 28-Sep-2020
https://dl.acm.org/doi/10.1145/3422575.3422794
Sultana TAllen BQasem ASarkar VKim H(2020)Intelligent Data Placement on Discrete GPU Nodes with Unified MemoryProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414651(139-151)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414651
Wang XLeidel JWilliams BChen YParashar MVlassov VIrwin DMohror K(2020)PAC: Paged Adaptive Coalescer for 3D-Stacked MemoryProceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3369583.3392670(137-148)Online publication date: 23-Jun-2020
https://dl.acm.org/doi/10.1145/3369583.3392670
Show More Cited By

Abstract

Cited By

Recommendations

Exploiting Hyper-Loop Parallelism in Vectorization to Improve Memory Performance on CUDA GPGPU

Exploiting Hyper-Loop Parallelism in Vectorization to Improve Memory Performance on CUDA GPGPU

Vectorizing Unstructured Mesh Computations for Many-core Architectures

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations