research-article

Performance analysis and optimization of Clang's OpenMP 4.5 GPU support

Authors:

Matt Martineau,

Simon McIntosh-Smith,

Carlo Bertolli,

Arpith C. Jacob,

Samuel F. Antao,

Alexandre Eichenberger,

Gheorghe-Teodor Bercea,

Georgios Rokos,

Zehra SuraAuthors Info & Claims

PMBS '16: Proceedings of the 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems

Pages 54 - 64

Published: 13 November 2016 Publication History

Abstract

The Clang implementation of OpenMP^® 4.5 now provides full support for the specification, offering the only open source option for targeting NVIDIA^® GPUs. While using OpenMP allows portability across different architectures, matching native CUDA^® performance without major code restructuring is an open research issue.

In order to analyze the current performance, we port a suite of representative benchmarks, and the mature mini-apps TeaLeaf, CloverLeaf, and SNAP to the Clang OpenMP 4.5 compiler. We then collect performance results for those ports, and their equivalent CUDA ports, on an NVIDIA Kepler GPU. Through manual analysis of the generated code, we are able to discover the root cause of the performance differences between OpenMP and CUDA.

A number of improvements can be made to the existing compiler implementation to enable performance that approaches that of hand-optimized CUDA. Our first observation was that the generated code did not use fused-multiply-add instructions, which was resolved using an existing flag. Next we saw that the compiler was not passing any loads through non-coherent cache, and added a new flag to the compiler to assist with this problem.

We then observed that the compiler partitioning of threads and teams could be improved upon for the majority of kernels, which guided work to ensure that the compiler can pick more optimal defaults. We uncovered a register allocation issue with the existing implementation that, when fixed alongside the other issues, enables performance that is close to CUDA.

Finally, we use some different kernels to emphasize that support for managing memory hierarchies needs to be introduced into the specification, and propose a simple option for programming shared caches.

References

[1]

V. G. V. Larrea, W. Joubert, M. G. Lopez, and O. Hernandez, "Early Experiences Writing Performance Portable OpenMP 4 Codes," Cray Users Group, London, 2016.

[2]

J. Herdman, W. Gaudin, S. McIntosh-Smith, M. Boulton, D. Beckingsale, A. Mallinson, and S. Jarvis, "Accelerating Hydrocodes with OpenACC, OpenCL and CUDA," in High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:. IEEE, 2012, pp. 465--471.

Digital Library

[3]

S. McIntosh-Smith and D. Curran, "Evaluation of a Performance Portable Lattice Boltzmann Code Using OpenCL," in Proceedings of the International Workshop on OpenCL 2013 & 2014, ser. IWOCL '14. New York, NY, USA.: ACM, 2014, pp. 2:1--2:12.

Digital Library

[4]

M. Martineau, S. McIntosh-Smith, M. Boulton, and W. Gaudin, "An Evaluation of Emerging Many-Core Parallel Programming Models," in Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores, ser. PMAM'16, 2016.

Digital Library

[5]

H. C. Edwards, C. Trott, and D. Sunderland, "Kokkos: Enabling manycore performance portability through polymorphic memory access patterns," Journal of Parallel and Distributed Computing, vol. 74, no. 12, pp. 3202--3216, 2014.

Digital Library

[6]

M. Martineau, S. McIntosh-Smith, and W. Gaudin, "Evaluating OpenMP 4.0's Effectiveness as a Heterogeneous Parallel Programming Model," in Submitted to 21st International Workship on High-Level Parallel Programming Models and Supportive Environments, ser. HIPS'16, 2016.

[7]

C. Bertolli, S. Antao, G.-T. Bercea et al., "Integrating GPU Support for OpenMP Offloading Directives into Clang," in Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, ser. LLVM '15, 2015.

Digital Library

[8]

OpenMP Architecture Review Board, "OpenMP Application Program Interface Version 4.0," 2013.

[9]

G. Tan, L. Li, S. Triechle, E. Phillips, Y. Bao, and N. Sun, "Fast implementation of DGEMM on Fermi GPU," in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2011, p. 35.

Digital Library

[10]

A. Mallinson, D. Beckingsale, W. Gaudin et al., "Towards Portable Performance for Explicit Hydrodynamics Codes," 2013.

[11]

T. Deakin, S. McIntosh-Smith, and W. Gaudin, Many-Core Acceleration of a Discrete Ordinates Transport Mini-App at Extreme Scale. Springer International Publishing, 2016, pp. 429--448.

[12]

C. Bertolli, S. F. Antao, A. Eichenberger et al., "Coordinating GPU threads for OpenMP 4.0 in LLVM," in Proceedings of the 2014 LLVM Compiler Infrastructure in HPC. IEEE Press, 2014, pp. 12--21.

Digital Library

[13]

N. Corporation, "CUDA C Programming Guide Version 7.5," 2015.

[14]

UKMAC, "UK Mini-App Consortium," http://uk-mac.github.io, 2016.

[15]

M. Heroux, D. Doerfler et al., "Improving Performance via Mini-applications," Sandia National Laboratories, Tech. Rep. SAND2009-5574, 2009.

[16]

I. Karlin, "Quad-Lab Proposal for Fundamental Cross Architecture Multi-Level Memory Support," Presentation at DOE Centers of Execellence Performance Portability Meeting. Available from: https://asc.llnl.gov/DOE-COE-Mtg-2016/talks/2-05_Karlin.pdf, 2016.

[17]

C. Liao, Y. Yan, B. de Supinski et al., "Early Experiences with the OMP Accelerator Model," in OpenMP in the Era of Low Power Devices and Accelerators. Springer, 2013.

[18]

E. L. J. Ozen, G. Ayguadé, "On the Roles of the Programmer, the Compiler and the Runtime System When Programming Accelerators in OpenMP," in Using and Improving OpenMP for Devices, Tasks, and More. Springer, 2014, pp. 215--229.

[19]

G. Bercea, C. Bertolli, S. Antao, A. Jacob et al., "Performance Analysis of OpenMP on a GPU using a Coral Proxy Application," in Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems. ACM, 2015, p. 2.

Digital Library

[20]

S. McIntosh-Smith, M. Boulton, D. Curran, and J. Price, "On the Performance Portability of Structured Grid Codes on Many-Core Computer Architectures," in Supercomputing, ser. Lecture Notes in Computer Science. Springer International Publishing, 2014, vol. 8488, pp. 53--75.

Digital Library

[21]

M. Martineau, J. Price, McIntosh-Smith, and W. Gaudin, "Pragmatic Performance Portability with OpenMP 4.x," in Proceedings of the International Workshop on OpenMP. Springer, 2016.

[22]

A. Hart, "First Experiences Porting a Parallel Application to a Hybrid Supercomputer with OpenMP 4.0 Device Constructs," in OpenMP: Heterogenous Execution and Data Movements: 11th International Workshop on OpenMP, IWOMP 2015, Proceedings, 2015, pp. 73--85.

Cited By

(2019)Performance evaluation of OpenMP's target construct on GPUs-exploring compiler optimisationsInternational Journal of High Performance Computing and Networking10.5555/3302714.330271813:1(54-69)Online publication date: 9-Feb-2019
https://dl.acm.org/doi/10.5555/3302714.3302718
Kurth ACapotondi AVogel PBenini LMarongiu ABartolini ACardoso JSilvano C(2018)HEROProceedings of the 2nd Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems10.1145/3295816.3295821(1-6)Online publication date: 4-Nov-2018
https://dl.acm.org/doi/10.1145/3295816.3295821
Johnston BMilthorpe J(2018)Dwarfs on AcceleratorsWorkshop Proceedings of the 47th International Conference on Parallel Processing10.1145/3229710.3229729(1-10)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3229710.3229729
Show More Cited By

Performance analysis and optimization of Clang's OpenMP 4.5 GPU support

Recommendations

Performance analysis of OpenMP on a GPU using a CORAL proxy application
PMBS '15: Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems

OpenMP provides high-level parallel abstractions for programing heterogeneous systems based on acceleration technology. Active areas of research are looking to characterise the performance that can be expected from even the simplest combinations of ...
Integrating GPU support for OpenMP offloading directives into Clang
LLVM '15: Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC

The LLVM community is currently developing OpenMP 4.1 support, consisting of software improvements for Clang and new runtime libraries. OpenMP 4.1 includes offloading constructs that permit execution of user selected regions on generic devices, external ...
Offloading Support for OpenMP in Clang and LLVM
LLVM-HPC '16: Proceedings of the Third Workshop on LLVM Compiler Infrastructure in HPC

OpenMP 4.5 allows performance portability by enabling users to write a single application code and run it on multiple types of accelerators. Our goal is to deliver a high-performance implementation of OpenMP into the Clang/LLVM project. This paper ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PMBS '16: Proceedings of the 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems

November 2016

123 pages

ISBN:9781509052189

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing
IEEE-CS\DATC: IEEE Computer Society

In-Cooperation

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

IEEE Press

Publication History

Published: 13 November 2016

Check for updates

Qualifiers

Research-article

Conference

SC16

Sponsor:

SIGHPC
IEEE-CS\DATC

SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 13 - 18, 2016

Utah, Salt Lake City

Acceptance Rates

Overall Acceptance Rate 9 of 22 submissions, 41%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

(2019)Performance evaluation of OpenMP's target construct on GPUs-exploring compiler optimisationsInternational Journal of High Performance Computing and Networking10.5555/3302714.330271813:1(54-69)Online publication date: 9-Feb-2019
https://dl.acm.org/doi/10.5555/3302714.3302718
Kurth ACapotondi AVogel PBenini LMarongiu ABartolini ACardoso JSilvano C(2018)HEROProceedings of the 2nd Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems10.1145/3295816.3295821(1-6)Online publication date: 4-Nov-2018
https://dl.acm.org/doi/10.1145/3295816.3295821
Johnston BMilthorpe J(2018)Dwarfs on AcceleratorsWorkshop Proceedings of the 47th International Conference on Parallel Processing10.1145/3229710.3229729(1-10)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3229710.3229729
Yviquel HCruz LAraujo G(2018)Cluster Programming using the OpenMP Accelerator ModelACM Transactions on Architecture and Code Optimization10.1145/322611215:3(1-23)Online publication date: 28-Aug-2018
https://dl.acm.org/doi/10.1145/3226112

View Options

View options

Figures

Tables

Media

View Table of Conten