Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Memory-access-aware Safety and Profitability Analysis for Transformation of Accelerator-bound OpenMP Loops

Published: 18 July 2019 Publication History

Abstract

Iteration Point Difference Analysis is a new static analysis framework that can be used to determine the memory coalescing characteristics of parallel loops that target GPU offloading and to ascertain safety and profitability of loop transformations with the goal of improving their memory access characteristics. This analysis can propagate definitions through control flow, works for non-affine expressions, and is capable of analyzing expressions that reference conditionally defined values. This analysis framework enables safe and profitable loop transformations. Experimental results demonstrate potential for dramatic performance improvements. GPU kernel execution time across the Polybench suite is improved by up to 25.5× on an Nvidia P100 with benchmark overall improvement of up to 3.2×. An opportunity detected in a SPEC ACCEL benchmark yields kernel speedup of 86.5× with a benchmark improvement of 3.3×. This work also demonstrates how architecture-aware compilers improve code portability and reduce programmer effort.

References

[1]
ACF-Coalescing-LLVM 2019. ACF static analysis framework source-code. Retrieved from: https://github.com/uasys/ACF-Coalescing-LLVM.
[2]
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS’09). 163--174.
[3]
Utpal K. Banerjee. 1988. Dependence Analysis for Supercomputing. Kluwer Academic Publishers, Norwell, MA.
[4]
W. Blume and R. Eigenmann. 1994. The range test: A dependence test for symbolic, non-linear expressions. In Proceedings of the Conference on Supercomputing. 528--537.
[5]
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the ACM-SIGPLAN Symposium on Programming Language Design and Implementation (PLDI’08). 101--113.
[6]
T. E. Cheatham and J. A. Townley. 1976. Symbolic evaluation of programs: A look at loop analysis. In Proceedings of the Symposium on Symbolic and Algebraic Computation.
[7]
Patrick Cousot and Radhia Cousot. 1977. Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. In Proceedings of the 4th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL’77). ACM, 238--252.
[8]
Thomas Fahringer, Michael Gerndt, Graham Riley, and Jesper Larsson Träff. 2000. Formalizing OpenMP performance properties with ASL. In Proceedings of the International Symposium on High Performance Computing, Mateo Valero, Kazuki Joe, Masaru Kitsuregawa, and Hidehiko Tanaka (Eds.). 428--439.
[9]
Michael P. Gerlek, Eric Stoltz, and Michael Wolfe. 1995. Beyond induction variables: Detecting and classifying sequences using a demand-driven SSA form. ACM Trans. Prog. Lang. Syst. 17, 1 (Jan. 1995), 85--122.
[10]
Gina Goff, Ken Kennedy, and Chau-Wen Tseng. 1991. Practical dependence testing. In Proceedings of the ACM-SIGPLAN Symposium on Programming Language Design and Implementation (PLDI’91). 15--29.
[11]
M. Haghighat and C. Polychronopoulos. 1993. Symbolic program analysis and optimization for parallelizing compilers. In Proceedings of the Workshop on Languages and Compilers and Parallel Computing (LCPC’93). 538--562.
[12]
Intel. 2011. Avoiding and identifying false sharing among threads. Retrieved on March 13, 2018 from: https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads.
[13]
Guido Juckeland, William Brantley, Sunita Chandrasekaran, Barbara Chapman, Shuai Che, Mathew Colgrove, Huiyu Feng, Alexander Grund, Robert Henschel, Wen-Mei W. Hwu, Huian Li, Matthias S. Müller, Wolfgang E. Nagel, Maxim Perminov, Pavel Shelepugin, Kevin Skadron, John Stratton, Alexey Titov, Ke Wang, Matthijs van Waveren, Brian Whitney, Sandra Wienke, Rengan Xu, and Kalyan Kumaran. 2015. SPEC ACCEL: A standard application suite for measuring hardware accelerator performance. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation (LNCS), Vol. 8966. Stephen A. Jarvis, Steven A. Wright, and Simon D. Hammond (Eds.). 46--67.
[14]
Leslie Lamport. 1974. The parallel execution of DO loops. Commun. ACM 17, 2 (Feb. 1974), 83--93.
[15]
Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. 2009. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proceedings of the Symposium on Principles 8 Practice of Parallel Programming (PPoPP’09). 101--110.
[16]
T. Lloyd, K. Ali, and J. N. Amaral. 2019. GPUCheck: Detecting CUDA Performance Problems with Static Analysis. Technical Report. University of Alberta, Edmonton, AB, Canada.
[17]
T. Lloyd, A. Chikin, J. N. Amaral, and E. Tiotto. 2018. Automated GPU grid geometry selection for OpenMP kernels. In Proceedings of the Workshop on Applications for Multi-Core Architectures (WAMCA’18). Retrieved from: https://webdocs.cs.ualberta.ca/amaral/papers/LloydWAMCA18.pdf.
[18]
Scott A. Mahlke, David C. Lin, William Y. Chen, Richard E. Hank, and Roger A. Bringmann. 1992. Effective compiler support for predicated execution using the hyperblock. In Proceedings of the International Symposium on Microarchitecture (MICRO’92). 45--54.
[19]
Jiayuan Meng and Kevin Skadron. 2009. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In Proceedings of the International Conference on Supercomputing (ICS’09). 256--265.
[20]
D. Miles, D. Norton, and M. Wolfe. 2014. Performance portability and OpenACC. In Proceedings of the Conference of Cray User Group (CUG’14).
[21]
Sungdo Moon, Mary W. Hall, and Brian R. Murphy. 1998. Predicated array data-flow analysis for run-time parallelization. In Proceedings of the 12th International Conference on Supercomputing (ICS’98). 204--211.
[22]
Nvidia. {n.d.}. Nvidia Tesla V100 GPU architecture—The world’s most advanced data center GPU. Retrieved from: http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.
[23]
Cosmin E. Oancea and Lawrence Rauchwerger. 2012. Logical inference techniques for loop parallelization. In Proceedings of the ACM-SIGPLAN Symposium on Programming Language Design and Implementation (PLDI’12). 509--520.
[24]
Cosmin E. Oancea and Lawrence Rauchwerger. 2015. Scalable conditional induction variables (CIV) analysis. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’15). 213--224.
[25]
Michael F. P. O’Boyle, Zheng Wang, and Dominik Grewe. 2013. Portable mapping of data parallel programs to OpenCL for heterogeneous systems. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’13). 1--10.
[26]
OpenMP Language Committee. 2013. OpenMP application program interface version 4.0. Retrieved on March 13, 2018 from: http://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf.
[27]
Yunheung Paek, Jay Hoeflinger, and David Padua. 2002. Efficient and precise array access analysis. ACM Trans. Prog. Lang. Syst. 24, 1 (Jan. 2002), 65--109.
[28]
Daniel Rolls, Carl Joslin, and Sven-Bodo Scholz. 2010. Unibench: A tool for automated and collaborative benchmarking. In Proceedings of the International Conference on Program Comprehension (ICPC’10). 50--51.
[29]
Silvius Rus, Dongmin Zhang, and Lawrence Rauchwerger. 2004. The value evolution graph and its use in memory reference analysis. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’04). 243--254.
[30]
Arthur Stoutchinin and Francois de Ferriere. 2001. Efficient static single assignment form for predication. In Proceedings of the International Symposium on Microarchitecture (MICRO’01). 172--181.
[31]
Robert A. van Engelen, J. Birch, Y. Shou, B. Walsh, and Kyle A. Gallivan. 2004. A unified framework for nonlinear dependence testing and symbolic analysis. In Proceedings of the International Conference on Supercomputing (ICS’04). 106--115.
[32]
Zheng Wang and Michael F. P. O’Boyle. 2009. Mapping parallelism to multi-cores: A machine learning based approach. In Proceedings of the ACM SIGPLAN Symposium on Principles 8 Practice of Parallel Programming (PPoPP’09).
[33]
Michael Joseph Wolfe. 1982. Optimizing Supercompilers for Supercomputers. Ph.D. Dissertation. University of Illinois at Urbana-Champaign, Champaign, IL.
[34]
Zhong Zheng, Xuhao Chen, Zhiying Wang, Li Shen, and Jiawen Li. 2011. Performance model for OpenMP parallelized loops. In Proceedings of the Conference on Transportation, Mechanical, and Electrical Engineering (TMEE’11).

Index Terms

  1. Memory-access-aware Safety and Profitability Analysis for Transformation of Accelerator-bound OpenMP Loops

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Architecture and Code Optimization
        ACM Transactions on Architecture and Code Optimization  Volume 16, Issue 3
        September 2019
        347 pages
        ISSN:1544-3566
        EISSN:1544-3973
        DOI:10.1145/3341169
        Issue’s Table of Contents
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 18 July 2019
        Accepted: 01 May 2019
        Revised: 01 April 2019
        Received: 01 August 2018
        Published in TACO Volume 16, Issue 3

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. GPUs
        2. Heterogeneous Computing
        3. Loop Collapsing
        4. Loop Interchange
        5. Loop Transformations
        6. Memory Coalescing
        7. OpenMP
        8. Performance Portability

        Qualifiers

        • Research-article
        • Research
        • Refereed

        Funding Sources

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 707
          Total Downloads
        • Downloads (Last 12 months)128
        • Downloads (Last 6 weeks)18
        Reflects downloads up to 10 Nov 2024

        Other Metrics

        Citations

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Get Access

        Login options

        Full Access

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media