article

A new memory mapping mechanism for GPGPUs' stencil computation

Authors:

Renfa LiAuthors Info & Claims

Computing, Volume 97, Issue 8

Pages 795 - 812

https://doi.org/10.1007/s00607-014-0434-5

Published: 01 August 2015 Publication History

Abstract

When optimizing performance on a GPU, control flow divergence of threads in one warp can make up the possible performance bottlenecks. In our hand-coded GPU stencil computation optimization, with a view to remove this control flow divergence brought by conventional mapping method between global memory and shared memory, we devise a new mapping mechanism by modeling the coalesced memory accesses of GPU threads and the aligned ghost zone overheads to remove conditional statements of the boundary XY-tile stencil computation points for improved performance. In addition, we utilize only one XY-tile loaded into registers in every stencil computation iteration, common sub-expression elimination and software prefetching to reduce overheads. Finally, detailed performance evaluation demonstrates that global memory access traffic is close to the idealized lower bound value through our optimized policies, that is to say, in every computed point of one XY-tile the memory access traffic is roughly 6 and 4 % more than 8 bytes per XY-tile point of the idealized lower bound memory access traffic in which ghost zone overheads are not taken into consideration on Tesla C2050 and Kepler K20X respectively.

References

[1]

Taflove A (2005) Computational electrodynamics: the finite-difference time-domain method. Artech House Publishers, Boston

[2]

Smith G (2004) Numerical solution of partial differential equations: finite difference methods. Oxford University Press, Philadelphia

[3]

Cong J, Huang M, Zou Y (2011) Accelerating fluid registration algorithm on multi-FPGA platforms. Field programmable logic and application, 2011 international conference on 5---7 Sept 2011 IEEE computer society press: Chania, USA, pp 50---57.

[4]

Datta K, Williams S, Volkov V, Carter J, Oliker L, Shalf J, Yelick K (2009) Auto-tuning the 27-point stencil for multicore. In iWAPT, 4th international workshop on automatic performance tuning

[5]

Meng J, Skadron K (2011) A performance study for iterative stencil loops on gpus with ghost zone optimizations. Int J Parallel Program 39(1):115---142.

[6]

Micikevicius P (2009) 3d finite difference computation on gpus using cuda, in: GPGPU-2. In: Proceedings of 2nd workshop on general purpose processing on graphics processing units, ACM, New York, pp 79---84.

[7]

Everett H (2010) Phillips and massimiliano fatica implementing the himeno benchmark with CUDA on GPU clusters. In: Parallel distributed processing (IPDPS), IEEE international sSymposium on, 19---23, IEEE computer society Atlanta, GA, pp 1---10.

[8]

Zhang Y, Mueller F Auto-generation and auto-tuning of 3d stencil codes on gpu clusters. In: Proceedings of the tenth international symposium on code generation and optimization, CGO '12, ACM, New York, USA, pp 155---164.

[9]

NVIDIA corporation CUDA C programming guide programming guide, Version 5.0 2012

[10]

Christen M, Schenk O, Burkhart H (2011) Patus: a code generation and auto-tuning framework for parallel iterative stencil computations on modern microarchitectures. In: Parallel distributed processing symposium (IPDPS) IEEE international, 16---20, Anchorage, AK, pp 676---687.

[11]

Nguyen A, Satish N, Chhugani J, Kim C, Dubey P (2010) 3.5-D blocking optimization for stencil computations on modern cpus and gpus, In: Proceedings of the 2010 ACM/IEEE international conference for high performance computing, networking, storage and analysis, 13---19 Nov. 2010, IEEE computer Society New Orleans, LA, pp 1---13.

Digital Library

[12]

Kamil S, Chan C, Oliker L, Shalf J, Williams S (2010) An auto-tuning framework for parallel multicore stencil computations. In: IPDPS, 2010 IEEE international symposium on, 19---23, IEEE computer society Atlanta, GA, pp 1---12.

[13]

Holewinski J, Pouchet LN, Sadayappan P (2012) High-performance code generation for stencil computations on gpu architectures. In: Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, ACM, New York, NY, pp 311---320.

[14]

Tang Y, Chowdhury R.A, Kuszmaul B.C, Luk CK, Leiserson CE (2011) The Pochoir stencil compiler. SPAA'11, ACM New York, NY, pp 117---128.

[15]

Unat D, Cai X, Baden S (2011) Mint: realizing CUDA performance in 3D stencil methods with Annotated C. In: Proceedings of the 25th international conference on supercomputing, May 31---June 4, ACM: TuScon, Arizona, USA, pp 214---224.

[16]

Maruyama N, Aoki T (2014) Optimizing stencil computations for NVIDIA Kepler GPUs. First international workshop on high-performance stencil computations, January 21, Vienna, Austria

[17]

Kutz JN (2013) Data-driven modeling and scientific computing: methods for integrating dynamics of complex systems and big data. Oxford University Press

[18]

Merrill D, Garland M, Grimshaw A (2012) Scalable GPU graph traversal. In: Proceedings of the 17th ACM SIGPLAN symposium on principles and practice of parallel programming, New Orleans, Louisiana. pp 117---128

Cited By

Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638

Index Terms

A new memory mapping mechanism for GPGPUs' stencil computation
1. General and reference
  1. Cross-computing tools and techniques
    1. Performance

Recommendations

Memory bandwidth optimization of SpMV on GPGPUs

It is an important task to improve performance for sparse matrix vector multiplication (SpMV), and it is a difficult task because of its irregular memory access. General purpose GPU (GPGPU) provides high computing ability and substantial bandwidth that ...
Memory Access Optimization of High-Order CFD Stencil Computations on GPU
Parallel and Distributed Computing, Applications and Technologies
Abstract
Stencils computations are a class of computations commonly found in scientific and engineering applications. They have relatively lower arithmetic intensity. Therefore, their performance is greatly affected by memory access. This paper studies the ...
Efficient 3D stencil computations using CUDA

We present an efficient implementation of 7-point and 27-point stencils on high-end Nvidia GPUs. A new method of reading the data from the global memory to the shared memory of thread blocks is developed. The method avoids conditional statements and ...

Comments

Information & Contributors

Information

Published In

cover image Computing

Computing Volume 97, Issue 8

August 2015

98 pages

ISSN:0010-485X

Issue’s Table of Contents

Copyright © Copyright © 2015 Springer-Verlag Wien.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 August 2015

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents