Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3019057.3019063acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Performance analysis and optimization of Clang's OpenMP 4.5 GPU support

Published: 13 November 2016 Publication History

Abstract

The Clang implementation of OpenMP® 4.5 now provides full support for the specification, offering the only open source option for targeting NVIDIA® GPUs. While using OpenMP allows portability across different architectures, matching native CUDA® performance without major code restructuring is an open research issue.
In order to analyze the current performance, we port a suite of representative benchmarks, and the mature mini-apps TeaLeaf, CloverLeaf, and SNAP to the Clang OpenMP 4.5 compiler. We then collect performance results for those ports, and their equivalent CUDA ports, on an NVIDIA Kepler GPU. Through manual analysis of the generated code, we are able to discover the root cause of the performance differences between OpenMP and CUDA.
A number of improvements can be made to the existing compiler implementation to enable performance that approaches that of hand-optimized CUDA. Our first observation was that the generated code did not use fused-multiply-add instructions, which was resolved using an existing flag. Next we saw that the compiler was not passing any loads through non-coherent cache, and added a new flag to the compiler to assist with this problem.
We then observed that the compiler partitioning of threads and teams could be improved upon for the majority of kernels, which guided work to ensure that the compiler can pick more optimal defaults. We uncovered a register allocation issue with the existing implementation that, when fixed alongside the other issues, enables performance that is close to CUDA.
Finally, we use some different kernels to emphasize that support for managing memory hierarchies needs to be introduced into the specification, and propose a simple option for programming shared caches.

References

[1]
V. G. V. Larrea, W. Joubert, M. G. Lopez, and O. Hernandez, "Early Experiences Writing Performance Portable OpenMP 4 Codes," Cray Users Group, London, 2016.
[2]
J. Herdman, W. Gaudin, S. McIntosh-Smith, M. Boulton, D. Beckingsale, A. Mallinson, and S. Jarvis, "Accelerating Hydrocodes with OpenACC, OpenCL and CUDA," in High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:. IEEE, 2012, pp. 465--471.
[3]
S. McIntosh-Smith and D. Curran, "Evaluation of a Performance Portable Lattice Boltzmann Code Using OpenCL," in Proceedings of the International Workshop on OpenCL 2013 & 2014, ser. IWOCL '14. New York, NY, USA.: ACM, 2014, pp. 2:1--2:12.
[4]
M. Martineau, S. McIntosh-Smith, M. Boulton, and W. Gaudin, "An Evaluation of Emerging Many-Core Parallel Programming Models," in Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores, ser. PMAM'16, 2016.
[5]
H. C. Edwards, C. Trott, and D. Sunderland, "Kokkos: Enabling manycore performance portability through polymorphic memory access patterns," Journal of Parallel and Distributed Computing, vol. 74, no. 12, pp. 3202--3216, 2014.
[6]
M. Martineau, S. McIntosh-Smith, and W. Gaudin, "Evaluating OpenMP 4.0's Effectiveness as a Heterogeneous Parallel Programming Model," in Submitted to 21st International Workship on High-Level Parallel Programming Models and Supportive Environments, ser. HIPS'16, 2016.
[7]
C. Bertolli, S. Antao, G.-T. Bercea et al., "Integrating GPU Support for OpenMP Offloading Directives into Clang," in Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, ser. LLVM '15, 2015.
[8]
OpenMP Architecture Review Board, "OpenMP Application Program Interface Version 4.0," 2013.
[9]
G. Tan, L. Li, S. Triechle, E. Phillips, Y. Bao, and N. Sun, "Fast implementation of DGEMM on Fermi GPU," in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2011, p. 35.
[10]
A. Mallinson, D. Beckingsale, W. Gaudin et al., "Towards Portable Performance for Explicit Hydrodynamics Codes," 2013.
[11]
T. Deakin, S. McIntosh-Smith, and W. Gaudin, Many-Core Acceleration of a Discrete Ordinates Transport Mini-App at Extreme Scale. Springer International Publishing, 2016, pp. 429--448.
[12]
C. Bertolli, S. F. Antao, A. Eichenberger et al., "Coordinating GPU threads for OpenMP 4.0 in LLVM," in Proceedings of the 2014 LLVM Compiler Infrastructure in HPC. IEEE Press, 2014, pp. 12--21.
[13]
N. Corporation, "CUDA C Programming Guide Version 7.5," 2015.
[14]
UKMAC, "UK Mini-App Consortium," http://uk-mac.github.io, 2016.
[15]
M. Heroux, D. Doerfler et al., "Improving Performance via Mini-applications," Sandia National Laboratories, Tech. Rep. SAND2009-5574, 2009.
[16]
I. Karlin, "Quad-Lab Proposal for Fundamental Cross Architecture Multi-Level Memory Support," Presentation at DOE Centers of Execellence Performance Portability Meeting. Available from: https://asc.llnl.gov/DOE-COE-Mtg-2016/talks/2-05_Karlin.pdf, 2016.
[17]
C. Liao, Y. Yan, B. de Supinski et al., "Early Experiences with the OMP Accelerator Model," in OpenMP in the Era of Low Power Devices and Accelerators. Springer, 2013.
[18]
E. L. J. Ozen, G. Ayguadé, "On the Roles of the Programmer, the Compiler and the Runtime System When Programming Accelerators in OpenMP," in Using and Improving OpenMP for Devices, Tasks, and More. Springer, 2014, pp. 215--229.
[19]
G. Bercea, C. Bertolli, S. Antao, A. Jacob et al., "Performance Analysis of OpenMP on a GPU using a Coral Proxy Application," in Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems. ACM, 2015, p. 2.
[20]
S. McIntosh-Smith, M. Boulton, D. Curran, and J. Price, "On the Performance Portability of Structured Grid Codes on Many-Core Computer Architectures," in Supercomputing, ser. Lecture Notes in Computer Science. Springer International Publishing, 2014, vol. 8488, pp. 53--75.
[21]
M. Martineau, J. Price, McIntosh-Smith, and W. Gaudin, "Pragmatic Performance Portability with OpenMP 4.x," in Proceedings of the International Workshop on OpenMP. Springer, 2016.
[22]
A. Hart, "First Experiences Porting a Parallel Application to a Hybrid Supercomputer with OpenMP 4.0 Device Constructs," in OpenMP: Heterogenous Execution and Data Movements: 11th International Workshop on OpenMP, IWOMP 2015, Proceedings, 2015, pp. 73--85.

Cited By

View all
  • (2019)Performance evaluation of OpenMP's target construct on GPUs-exploring compiler optimisationsInternational Journal of High Performance Computing and Networking10.5555/3302714.330271813:1(54-69)Online publication date: 9-Feb-2019
  • (2018)HEROProceedings of the 2nd Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems10.1145/3295816.3295821(1-6)Online publication date: 4-Nov-2018
  • (2018)Dwarfs on AcceleratorsWorkshop Proceedings of the 47th International Conference on Parallel Processing10.1145/3229710.3229729(1-10)Online publication date: 13-Aug-2018
  • Show More Cited By
  1. Performance analysis and optimization of Clang's OpenMP 4.5 GPU support

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        PMBS '16: Proceedings of the 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems
        November 2016
        123 pages
        ISBN:9781509052189

        Sponsors

        In-Cooperation

        Publisher

        IEEE Press

        Publication History

        Published: 13 November 2016

        Check for updates

        Qualifiers

        • Research-article

        Conference

        SC16
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 9 of 22 submissions, 41%

        Upcoming Conference

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 25 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2019)Performance evaluation of OpenMP's target construct on GPUs-exploring compiler optimisationsInternational Journal of High Performance Computing and Networking10.5555/3302714.330271813:1(54-69)Online publication date: 9-Feb-2019
        • (2018)HEROProceedings of the 2nd Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems10.1145/3295816.3295821(1-6)Online publication date: 4-Nov-2018
        • (2018)Dwarfs on AcceleratorsWorkshop Proceedings of the 47th International Conference on Parallel Processing10.1145/3229710.3229729(1-10)Online publication date: 13-Aug-2018
        • (2018)Cluster Programming using the OpenMP Accelerator ModelACM Transactions on Architecture and Code Optimization10.1145/322611215:3(1-23)Online publication date: 28-Aug-2018

        View Options

        View options

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media