article

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Authors:

José M. Andión,

François Bodin,

Gabriel Rodríguez,

Juan TouriñoAuthors Info & Claims

International Journal of Parallel Programming, Volume 44, Issue 3

Pages 620 - 643

https://doi.org/10.1007/s10766-015-0362-9

Published: 01 June 2016 Publication History

Abstract

The use of GPUs for general purpose computation has increased dramatically in the past years due to the rising demands of computing power and their tremendous computing capacity at low cost. Hence, new programming models have been developed to integrate these accelerators with high-level programming languages, giving place to heterogeneous computing systems. Unfortunately, this heterogeneity is also exposed to the programmer complicating its exploitation. This paper presents a new technique to automatically rewrite sequential programs into a parallel counterpart targeting GPU-based heterogeneous systems. The original source code is analyzed through domain-independent computational kernels, which hide the complexity of the implementation details by presenting a non-statement-based, high-level, hierarchical representation of the application. Next, a locality-aware technique based on standard compiler transformations is applied to the original code through OpenHMPP directives. Two representative case studies from scientific applications have been selected: the three-dimensional discrete convolution and the simple-precision general matrix multiplication. The effectiveness of our technique is corroborated by a performance evaluation on NVIDIA GPUs.

References

[1]

Andión, J.M., Arenaz, M., Rodríguez, G., Touriño, J.: A novel compiler support for automatic parallelization on multicore systems. Parallel Comput. 39(9), 442---460 (2013)

[2]

Andrade, D., Arenaz, M., Fraguela, B.B., Touriño, J., Doallo, R.: Automated and accurate cache behavior analysis for codes with irregular access patterns. Concurr. Comput. Pract. Exp. 19(18), 2407---2423 (2007)

Digital Library

[3]

Appentra Solutions: Parallware for OpenACC. http://www.appentra.com/products/parallware/. Accessed 31 Jan 2015

[4]

Arenaz, M., Touriño, J., Doallo, R.: Compiler support for parallel code generation through kernel recognition. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS), Santa Fe, NM, USA, p. 79b. IEEE (2004)

[5]

Arenaz, M., Touriño, J., Doallo, R.: XARK: an extensible framework for automatic recognition of computational kernels. ACM Trans. Program. Lang. Syst. 30(6), 32:1---32:56 (2008)

Digital Library

[6]

Baskaran, M.M., Ramanujam, J., Sadayappan, P.: Automatic C-to-CUDA code generation for affine programs. In: Proceedings of the 19th International Conference on Compiler Construction (CC), Paphos, Cyprus, LNCS, vol. 6011, pp. 244---263. Springer (2010)

Digital Library

[7]

BLAS: Basic Linear Algebra Subprograms. http://www.netlib.org/blas/. Accessed 31 Jan 2015

[8]

Bodin, F., Bihan, S.: Heterogeneous multicore parallel programming for graphics processing units. Sci. Program. 17(4), 325---336 (2009)

Digital Library

[9]

Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical automatic polyhedral parallelizer and locality optimizer. In: Proceedings of the 29th Conference on Programming Language Design and Implementation (PLDI), Tucson, AZ, USA, pp. 101---113. ACM (2008)

Digital Library

[10]

Christen, M., Schenk, O., Burkhart, H.: Automatic code generation and tuning for stencil kernels on modern shared memory architectures. Comp. Sci. Res. Dev. 26(3---4), 205---210 (2011)

Digital Library

[11]

Eigenmann, R., Hoeflinger, J., Li, Z., Padua, D.A.: Experience in the automatic parallelization of four perfect-benchmark programs. In: Proceedings of the 4th International Workshop on Languages and Compilers for Parallel Computing (LCPC), Santa Clara, CA, USA, LNCS, vol. 589, pp. 65---83. Springer (1992)

Digital Library

[12]

Grauer-Gray, S., Xu, L., Searles, R., Ayalasomayajula, S., Cavazos, J.: Auto-tuning a high-level language targeted to GPU codes. In: Proceedings of Innovative Parallel Computing (InPar), San Jose, CA, USA, pp. 1---10. IEEE (2012)

[13]

Han, T.D., Abdelrahman, T.S.: hiCUDA: High-level GPGPU programming. IEEE Trans. Parallel Distrib. Syst. 22(1), 78---90 (2011)

Digital Library

[14]

HPC Project: Par4All. http://www.par4all.org/. Accessed 31 Jan 2015

[15]

Intel Corporation: Intel Math Kernel Library. http://software.intel.com/intel-mkl/. Accessed 31 Jan 2015

[16]

Jablin, T.B., Jablin, J.A., Prabhu, P., Liu, F., August, D.I.: Dynamically managed data for CPU---GPU architectures. In: Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO), San Jose, CA, USA, pp. 165---174. ACM (2012)

Digital Library

[17]

Jablin, T.B., Prabhu, P., Jablin, J.A., Johnson, N.P., Beard, S.R., August, D.I.: Automatic CPU---GPU communication management and optimization. In: Proceedings of the 32nd Conference on Programming Language Design and Implementation (PLDI), San Jose, CA, USA, pp. 142---151. ACM (2011)

Digital Library

[18]

Kurzak, J., Tomov, S., Dongarra, J.: Autotuning GEMM kernels for the Fermi GPU. IEEE Trans. Parallel Distrib. Syst. 23(11), 2045---2057 (2012)

Digital Library

[19]

Larsen, E.S., McAllister, D.: Fast matrix multiplies using graphics hardware. In: Proceedings of the 14th International Conference on High Performance Computing, Networking, Storage and Analysis (SC), Denver, CO, USA, p. 55. ACM (2001)

Digital Library

[20]

Lee, S., Eigenmann, R.: OpenMPC: Extended OpenMP programming and tuning for GPUs. In: Proceedings of the 23rd International Conference on High Performance Computing, Networking, Storage and Analysis (SC), New Orleans, LA, USA, pp. 1---11. IEEE (2010)

Digital Library

[21]

Lee, S., Vetter, J.S.: Early evaluation of directive-based GPU programming models for productive exascale computing. In: Proceedings of the 25th International Conference on High Performance Computing, Networking, Storage and Analysis (SC), Salt Lake City, UT, USA, pp. 23:1---23:11. IEEE (2012)

Digital Library

[22]

Novatte Pte. Ltd.: CAPS Compilers. http://www.novatte.com/component/content/article/126-products/hpcclusters/301-caps-compilers-for-cuda-and-opencl/. Accessed 31 Jan 2015

[23]

NVIDIA Corporation: Cg Toolkit. http://developer.nvidia.com/Cg/. Accessed 31 Jan 2015

[24]

NVIDIA Corporation: CUBLAS Library. https://developer.nvidia.com/cublas/. Accessed 31 Jan 2015

[25]

NVIDIA Corporation: CUDA C Best Practices Guide. http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/. Accessed 31 Jan 2015

[26]

NVIDIA Corporation: CUDA C Programming Guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/. Accessed 31 Jan 2015

[27]

OpenHMPP Consortium: OpenHMPP Concepts and Directives. http://en.wikipedia.org/wiki/OpenHMPP. Accessed 31 Jan 2015

[28]

OpenMP Architecture Review Board: OpenMP Application Program Interface (Version 4.0). http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf. Accessed 31 Jan 2015

[29]

Owens, J., Houston, M., Luebke, D., Green, S., Stone, J., Phillips, J.: GPU computing. Proc. IEEE 96(5), 879---899 (2008)

[30]

The Khronos Group Inc.: The OpenCL Specification (Version 2.0). http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf. Accessed 31 Jan 2015

[31]

The Khronos Group Inc.: The OpenGL Shading Language (Version 4.50). https://www.opengl.org/registry/doc/GLSLangSpec.4.50.pdf. Accessed 31 Jan 2015

[32]

The OpenACC Standards Group: The OpenACC Application Programming Interface (Version 2.0a). http://www.openacc.org/sites/default/files/OpenACC.2.0a_1.pdf. Accessed 31 Jan 2015

[33]

Verdoolaege, S., Juega, J.C., Cohen, A., Gómez, J.I., Tenllado, C., Catthoor, F.: Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. 9(4), 54:1---54:23 (2013)

Digital Library

[34]

Viñas, M., Lobeiras, J., Fraguela, B.B., Arenaz, M., Amor, M., García, J.A., Castro, M.J., Doallo, R.: A multi-GPU shallow-water simulation with transport of contaminants. Concurr. Comput. Pract. Exp. 25(8), 1153---1169 (2013)

[35]

Volkov, V.: Better performance at lower occupancy. In: Proceedings of the 2010 GPU technology conference (GTC), San Jose, CA, USA. NVIDIA (2010)

[36]

Wolfe, M.: Implementing the PGI accelerator model. In: Proceedings of the 3rd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU), Pittsburgh, PA, USA, pp. 43---50. ACM (2010)

Digital Library

[37]

Zima, E.: Simplification and optimization of transformations of chains of recurrences. In: Proceedings of the 1995 International Symposium on Symbolic and Algebraic Computation (ISSAC), Montreal, Canada, pp. 42---50. ACM (1995)

Digital Library

[38]

Zhang, Y., Mueller, F.: Autogeneration and autotuning of 3D stencil codes on homogeneous and heterogeneous GPU clusters. IEEE Trans. Parallel Distrib. Syst. 24(3), 417---427 (2013)

Digital Library

Cited By

Rodrigues MGuimarães BPereira FKandemir MJimborean AMoseley T(2019)Generation of in-bounds inputs for arrays in memory-unsafe languagesProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314890(136-148)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.5555/3314872.3314890
Ramos PSouza GSoares DAraújo GPereira FEvripidou SStenström PO'Boyle M(2018)Automatic annotation of tasks in structured codeProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243200(1-13)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3243176.3243200
Mendonça GGuimarães BAlves PPereira MAraújo GPereira F(2017)DawnCCACM Transactions on Architecture and Code Optimization10.1145/308454014:2(1-25)Online publication date: 26-May-2017
https://dl.acm.org/doi/10.1145/3084540

Index Terms

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. General programming languages
      1. Language types
        Parallel programming languages

Index terms have been assigned to the content through auto-classification.

Recommendations

From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture

Comparing the architectures and performance levels of an Nvidia Fermi accelerator with an Intel MIC Architecture coprocessor demonstrates the benefit of the coprocessor for bringing highly parallel applications into, or even beyond, GPGPU performance ...
Accelerating the Calculation of Scattering of Complex Targets from Background Radiation with CUDA, OpenACC and OpenHMPP
ICPADS '13: Proceedings of the 2013 International Conference on Parallel and Distributed Systems

Graphics Processing Unit (GPU) is used to accelerate the calculation of scattering of complex target from background radiation in infrared spectrum. Compute Unified Device Architecture (CUDA), OpenACC, and Hybrid Multicore Parallel Programming (OpenHMPP)...
Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations
ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing

A trend that has materialized, and has given rise to much attention, is of the increasingly heterogeneous computing platforms. Presently, it has become very common for a desktop or a notebook computer to come equipped with both a multi-core CPU and a ...

Comments

Information & Contributors

Information

Published In

cover image International Journal of Parallel Programming

International Journal of Parallel Programming Volume 44, Issue 3

June 2016

327 pages

ISSN:0885-7458

Issue’s Table of Contents

Copyright © Copyright © 2016 Springer Science+Business Media New York.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 June 2016

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Rodrigues MGuimarães BPereira FKandemir MJimborean AMoseley T(2019)Generation of in-bounds inputs for arrays in memory-unsafe languagesProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314890(136-148)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.5555/3314872.3314890
Ramos PSouza GSoares DAraújo GPereira FEvripidou SStenström PO'Boyle M(2018)Automatic annotation of tasks in structured codeProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243200(1-13)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3243176.3243200
Mendonça GGuimarães BAlves PPereira MAraújo GPereira F(2017)DawnCCACM Transactions on Architecture and Code Optimization10.1145/308454014:2(1-25)Online publication date: 26-May-2017
https://dl.acm.org/doi/10.1145/3084540

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents