research-article

Open access

Memory-access-aware Safety and Profitability Analysis for Transformation of Accelerator-bound OpenMP Loops

Authors:

José Nelson Amaral,

Muhammad UsmanAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 16, Issue 3

Article No.: 30, Pages 1 - 26

https://doi.org/10.1145/3333060

Published: 18 July 2019 Publication History

All formats PDF

Abstract

Iteration Point Difference Analysis is a new static analysis framework that can be used to determine the memory coalescing characteristics of parallel loops that target GPU offloading and to ascertain safety and profitability of loop transformations with the goal of improving their memory access characteristics. This analysis can propagate definitions through control flow, works for non-affine expressions, and is capable of analyzing expressions that reference conditionally defined values. This analysis framework enables safe and profitable loop transformations. Experimental results demonstrate potential for dramatic performance improvements. GPU kernel execution time across the Polybench suite is improved by up to 25.5× on an Nvidia P100 with benchmark overall improvement of up to 3.2×. An opportunity detected in a SPEC ACCEL benchmark yields kernel speedup of 86.5× with a benchmark improvement of 3.3×. This work also demonstrates how architecture-aware compilers improve code portability and reduce programmer effort.

References

[1]

ACF-Coalescing-LLVM 2019. ACF static analysis framework source-code. Retrieved from: https://github.com/uasys/ACF-Coalescing-LLVM.

[2]

A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS’09). 163--174.

[3]

Utpal K. Banerjee. 1988. Dependence Analysis for Supercomputing. Kluwer Academic Publishers, Norwell, MA.

Digital Library

[4]

W. Blume and R. Eigenmann. 1994. The range test: A dependence test for symbolic, non-linear expressions. In Proceedings of the Conference on Supercomputing. 528--537.

Digital Library

[5]

Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the ACM-SIGPLAN Symposium on Programming Language Design and Implementation (PLDI’08). 101--113.

Digital Library

[6]

T. E. Cheatham and J. A. Townley. 1976. Symbolic evaluation of programs: A look at loop analysis. In Proceedings of the Symposium on Symbolic and Algebraic Computation.

Digital Library

[7]

Patrick Cousot and Radhia Cousot. 1977. Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. In Proceedings of the 4th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL’77). ACM, 238--252.

Digital Library

[8]

Thomas Fahringer, Michael Gerndt, Graham Riley, and Jesper Larsson Träff. 2000. Formalizing OpenMP performance properties with ASL. In Proceedings of the International Symposium on High Performance Computing, Mateo Valero, Kazuki Joe, Masaru Kitsuregawa, and Hidehiko Tanaka (Eds.). 428--439.

Digital Library

[9]

Michael P. Gerlek, Eric Stoltz, and Michael Wolfe. 1995. Beyond induction variables: Detecting and classifying sequences using a demand-driven SSA form. ACM Trans. Prog. Lang. Syst. 17, 1 (Jan. 1995), 85--122.

Digital Library

[10]

Gina Goff, Ken Kennedy, and Chau-Wen Tseng. 1991. Practical dependence testing. In Proceedings of the ACM-SIGPLAN Symposium on Programming Language Design and Implementation (PLDI’91). 15--29.

Digital Library

[11]

M. Haghighat and C. Polychronopoulos. 1993. Symbolic program analysis and optimization for parallelizing compilers. In Proceedings of the Workshop on Languages and Compilers and Parallel Computing (LCPC’93). 538--562.

Digital Library

[12]

Intel. 2011. Avoiding and identifying false sharing among threads. Retrieved on March 13, 2018 from: https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads.

[13]

Guido Juckeland, William Brantley, Sunita Chandrasekaran, Barbara Chapman, Shuai Che, Mathew Colgrove, Huiyu Feng, Alexander Grund, Robert Henschel, Wen-Mei W. Hwu, Huian Li, Matthias S. Müller, Wolfgang E. Nagel, Maxim Perminov, Pavel Shelepugin, Kevin Skadron, John Stratton, Alexey Titov, Ke Wang, Matthijs van Waveren, Brian Whitney, Sandra Wienke, Rengan Xu, and Kalyan Kumaran. 2015. SPEC ACCEL: A standard application suite for measuring hardware accelerator performance. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation (LNCS), Vol. 8966. Stephen A. Jarvis, Steven A. Wright, and Simon D. Hammond (Eds.). 46--67.

[14]

Leslie Lamport. 1974. The parallel execution of DO loops. Commun. ACM 17, 2 (Feb. 1974), 83--93.

Digital Library

[15]

Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. 2009. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proceedings of the Symposium on Principles 8 Practice of Parallel Programming (PPoPP’09). 101--110.

Digital Library

[16]

T. Lloyd, K. Ali, and J. N. Amaral. 2019. GPUCheck: Detecting CUDA Performance Problems with Static Analysis. Technical Report. University of Alberta, Edmonton, AB, Canada.

[17]

T. Lloyd, A. Chikin, J. N. Amaral, and E. Tiotto. 2018. Automated GPU grid geometry selection for OpenMP kernels. In Proceedings of the Workshop on Applications for Multi-Core Architectures (WAMCA’18). Retrieved from: https://webdocs.cs.ualberta.ca/amaral/papers/LloydWAMCA18.pdf.

[18]

Scott A. Mahlke, David C. Lin, William Y. Chen, Richard E. Hank, and Roger A. Bringmann. 1992. Effective compiler support for predicated execution using the hyperblock. In Proceedings of the International Symposium on Microarchitecture (MICRO’92). 45--54.

Digital Library

[19]

Jiayuan Meng and Kevin Skadron. 2009. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In Proceedings of the International Conference on Supercomputing (ICS’09). 256--265.

Digital Library

[20]

D. Miles, D. Norton, and M. Wolfe. 2014. Performance portability and OpenACC. In Proceedings of the Conference of Cray User Group (CUG’14).

[21]

Sungdo Moon, Mary W. Hall, and Brian R. Murphy. 1998. Predicated array data-flow analysis for run-time parallelization. In Proceedings of the 12th International Conference on Supercomputing (ICS’98). 204--211.

Digital Library

[22]

Nvidia. {n.d.}. Nvidia Tesla V100 GPU architecture—The world’s most advanced data center GPU. Retrieved from: http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.

[23]

Cosmin E. Oancea and Lawrence Rauchwerger. 2012. Logical inference techniques for loop parallelization. In Proceedings of the ACM-SIGPLAN Symposium on Programming Language Design and Implementation (PLDI’12). 509--520.

Digital Library

[24]

Cosmin E. Oancea and Lawrence Rauchwerger. 2015. Scalable conditional induction variables (CIV) analysis. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’15). 213--224.

Digital Library

[25]

Michael F. P. O’Boyle, Zheng Wang, and Dominik Grewe. 2013. Portable mapping of data parallel programs to OpenCL for heterogeneous systems. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’13). 1--10.

Digital Library

[26]

OpenMP Language Committee. 2013. OpenMP application program interface version 4.0. Retrieved on March 13, 2018 from: http://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf.

[27]

Yunheung Paek, Jay Hoeflinger, and David Padua. 2002. Efficient and precise array access analysis. ACM Trans. Prog. Lang. Syst. 24, 1 (Jan. 2002), 65--109.

Digital Library

[28]

Daniel Rolls, Carl Joslin, and Sven-Bodo Scholz. 2010. Unibench: A tool for automated and collaborative benchmarking. In Proceedings of the International Conference on Program Comprehension (ICPC’10). 50--51.

Digital Library

[29]

Silvius Rus, Dongmin Zhang, and Lawrence Rauchwerger. 2004. The value evolution graph and its use in memory reference analysis. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’04). 243--254.

Digital Library

[30]

Arthur Stoutchinin and Francois de Ferriere. 2001. Efficient static single assignment form for predication. In Proceedings of the International Symposium on Microarchitecture (MICRO’01). 172--181.

Digital Library

[31]

Robert A. van Engelen, J. Birch, Y. Shou, B. Walsh, and Kyle A. Gallivan. 2004. A unified framework for nonlinear dependence testing and symbolic analysis. In Proceedings of the International Conference on Supercomputing (ICS’04). 106--115.

Digital Library

[32]

Zheng Wang and Michael F. P. O’Boyle. 2009. Mapping parallelism to multi-cores: A machine learning based approach. In Proceedings of the ACM SIGPLAN Symposium on Principles 8 Practice of Parallel Programming (PPoPP’09).

Digital Library

[33]

Michael Joseph Wolfe. 1982. Optimizing Supercompilers for Supercomputers. Ph.D. Dissertation. University of Illinois at Urbana-Champaign, Champaign, IL.

[34]

Zhong Zheng, Xuhao Chen, Zhiying Wang, Li Shen, and Jiawen Li. 2011. Performance model for OpenMP parallelized loops. In Proceedings of the Conference on Transportation, Mechanical, and Electrical Engineering (TMEE’11).

Index Terms

Memory-access-aware Safety and Profitability Analysis for Transformation of Accelerator-bound OpenMP Loops
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
    2. Parallel architectures
      1. Multicore architectures
      2. Single instruction, multiple data

Recommendations

OpenMP to GPGPU: a compiler framework for automatic translation and optimization
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

GPGPUs have recently emerged as powerful vehicles for general-purpose high-performance computing. Although a new Compute Unified Device Architecture (CUDA) programming model from NVIDIA offers improved programmability for general computing, programming ...
Improving Memory Hierarchy Performance through Combined Loop Interchange and Multi-Level Fusion

Because of the increasing gap between the speeds of processors and main memories, compilers must enhance the locality of applications to achieve high performance. Loop fusion enhances locality by fusing loops that access similar sets of data. Typically, ...
Assessing One-to-One Parallelism Levels Mapping for OpenMP Offloading to GPUs
PMAM'17: Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores

The proliferation of accelerators in modern clusters makes efficient coprocessor programming a key requirement if application codes are to achieve high levels of performance with acceptable energy consumption on such platforms. This has led to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 16, Issue 3

September 2019

347 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3341169

Editor:
Koen De Bosschere
Ghent University, Belgium

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2019

Accepted: 01 May 2019

Revised: 01 April 2019

Received: 01 August 2018

Published in TACO Volume 16, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Natural Sciences and Engineering Research Council of Canada
IBM Centre for Advanced Studies

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
707
Total Downloads

Downloads (Last 12 months)128
Downloads (Last 6 weeks)18

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents