research-article

Open access

Divergence analysis

Authors:

Rafael Martins de Souza,

Caroline Collange,

Fernando Magno Quintão PereiraAuthors Info & Claims

ACM Transactions on Programming Languages and Systems (TOPLAS), Volume 35, Issue 4

Article No.: 13, Pages 1 - 36

https://doi.org/10.1145/2523815

Published: 03 January 2014 Publication History

Abstract

Growing interest in graphics processing units has brought renewed attention to the Single Instruction Multiple Data (SIMD) execution model. SIMD machines give application developers tremendous computational power; however, programming them is still challenging. In particular, developers must deal with memory and control-flow divergences. These phenomena stem from a condition that we call data divergence, which occurs whenever two processing elements (PEs) see the same variable name holding different values. This article introduces divergence analysis, a static analysis that discovers data divergences. This analysis, currently deployed in an industrial quality compiler, is useful in several ways: it improves the translation of SIMD code to non-SIMD CPUs, it helps developers to manually improve their SIMD applications, and it also guides the automatic optimization of SIMD programs. We demonstrate this last point by introducing the notion of a divergence-aware register spiller. This spiller uses information from our analysis to either rematerialize or share common data between PEs. As a testimony of its effectiveness, we have tested it on a suite of 395 CUDA kernels from well-known benchmarks. The divergence-aware spiller produces GPU code that is 26.21% faster than the code produced by the register allocator used in the baseline compiler.

References

[1]

Abel, N. E., Budnik, P. P., Kuck, D. J., Muraoka, Y., Northcote, R. S., and Wilhelmson, R. B. 1969. TRANQUIL: A language for an array processing computer. In Proceedings of AFIPS. ACM, 57--73.

Digital Library

[2]

Aho, A. V., Lam, M. S., Sethi, R., and Ullman, J. D. 2006. Compilers: Principles, Techniques, and Tools 2nd Ed. Addison Wesley.

Digital Library

[3]

Aiken, A. and Gay, D. 1998. Barrier inference. In Proceedings of POPL. ACM, 342--354.

Digital Library

[4]

Appel, A. W. 1998. SSA is functional programming. SIGPLAN Not. 33, 4, 17--20.

Digital Library

[5]

Baghsorkhi, S. S., Delahaye, M., Patel, S. J., Gropp, W. D., and Hwu, W.-M. W. 2010. An adaptive performance modeling tool for GPU architectures. In Proceedings of PPoPP. ACM, 105--114.

Digital Library

[6]

Belady, L. A. 1966. A study of replacement algorithms for a virtual storage computer. IBM Syst. J. 5, 2, 78--101.

Digital Library

[7]

Blelloch, G. and Chatterjee, S. 1990. Vcode: A data-parallel intermediate language. In Proceedings of FMPC. ACM, 471--480.

[8]

Boudier, P. and Sellers, G. 2011. Memory system on Fusion APUs. In Proceedings of the AMD Fusion Developer Summit. AMD.

[9]

Bougé, L. and Levaire, J.-L. 1992. Control structures for data-parallel SIMD languages: Semantics and implementation. Future Gen. Comput. Syst. 8, 4, 363--378.

Digital Library

[10]

Bouknight, W., Denenberg, S. A., McIntyre, D. E., Randall, J. M., Sameh, A. H., and Slotnick, D. L. 1972. The Illiac IV system. IEEE 60, 4, 369--388.

[11]

Briggs, P., Cooper, K. D., and Torczon, L. 1992. Rematerialization. In Proceedings of PLDI. ACM, 311--321.

Digital Library

[12]

Brockmann, K. and Wanka, R. 1997. Efficient oblivious parallel sorting on the MasPar MP-1. Vol. ICSS. 1, 200.

Digital Library

[13]

Budimlic, Z., Cooper, K. D., Harvey, T. J., Kennedy, K., Oberg, T. S., and Reeves, S. W. 2002. Fast copy coalescing and live-range identification. In Proceedings of PLDI. ACM, 25--32.

Digital Library

[14]

Carrillo, S., Siegel, J., and Li, X. 2009. A control-structure splitting optimization for GPGPU. In Proceedings of the 6th ACM Conference Computing Frontiers. ACM, 147--150.

Digital Library

[15]

Cederman, D. and Tsigas, P. 2009. GPU-quicksort: A practical quicksort algorithm for graphics processors. J. Exp. Algorithmics 14, 1, 4--24.

Digital Library

[16]

Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J. W., Lee, S.-H., and Skadron, K. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC). IEEE, 44--54.

Digital Library

[17]

Choi, J.-D., Cytron, R., and Ferrante, J. 1991. Automatic construction of sparse data flow evaluation graphs. In Proceedings of POPL. ACM, 55--66.

Digital Library

[18]

Collange, S., Defour, D., and Zhang, Y. 2009. Dynamic detection of uniform and affine vectors in GPGPU computations. In Proceedings of the Parallel Processing Workshop (HPPC). Lecture Notes in computer science, vol. 6043, Springer, 46--55.

Digital Library

[19]

Cousot, P. and Halbwachs, N. 1978. Automatic discovery of linear restraints among variables of a program. In Proceedings of POPL. ACM, 84--96.

Digital Library

[20]

Coutinho, B., Sampaio, D., Pereira, F. M. Q., and Meira, W. 2011. Divergence analysis and optimizations. In Proceedings of International Conference on Parallel Architecture and Compilation Techniques (PACT). IEEE, 320--329.

Digital Library

[21]

Coutinho, B., Sampaio, D., Pereira, F. M. Q., and Meira, W. 2013. Profiling divergences in GPU applications. Concur. Comput. Pract. Exp. 25, 6, 775--789.

[22]

Cunningham, D., Bordawekar, R., and Saraswat, V. 2011. GPU programming in a high level language: Compiling ×10 to CUDA. In Proceedings of the ACM SIGPLAN X10 Workshop. ACM, 8:1--8:10.

Digital Library

[23]

Cytron, R., Ferrante, J., Rosen, B. K., Wegman, M. N., and Zadeck, F. K. 1991. Efficiently computing static single assignment form and the control dependence graph. ACM Trans. Program. Lang. Syst. 13, 4, 451--490.

Digital Library

[24]

Darema, F., George, D. A., Norton, V. A., and Pfister, G. F. 1988. A single-program-multiple-data computational model for EPEX/FORTRAN. Parallel Comput. 7, 1, 11--24.

[25]

Diamos, G., Kerr, A., Yalamanchili, S., and Clark, N. 2010. Ocelot, a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques (PACT). IEEE, 354--364.

Digital Library

[26]

Dubach, C., Cheng, P., Rabbah, R., Bacon, D. F., and Fink, S. J. 2012. Compiling a high-level language for GPUS: (via language support for architectures and compilers). In Proceedings of ACM SIGPLAN PLDI. ACM, 1--12.

Digital Library

[27]

Farrell, C. A. and Kieronska, D. H. 1996. Formal specification of parallel SIMD execution. Theo. Comp. Science 169, 1, 39--65.

Digital Library

[28]

Ferrante, J., Ottenstein, K. J., and Warren, J. D. 1987. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst. 9, 3, 319--349.

Digital Library

[29]

Flynn, M. 1972. Some computer organizations and their effectiveness. IEEE Trans. Comput. 21, 9, 948--960.

Digital Library

[30]

Fung, W. W. L., Sham, I., Yuan, G., and Aamodt, T. M. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of IEEE/ACM MICRO. IEEE, 407--420.

Digital Library

[31]

Garland, M. and Kirk, D. B. 2010. Understanding throughput-oriented architectures. Comm. ACM 53, 58--66.

Digital Library

[32]

Gou, C. and Gaydadjiev, G. 2013. Addressing GPU on-chip shared memory bank conflicts using elastic pipeline. Int. J. Parallel Program. 41, 3, 400--429.

[33]

Grover, V., Joannes, B., Aarts, M., and Murphy, M. 2009. Variance analysis for translating CUDA code for execution by a general purpose processor. U.S. Patent 2009/0259997, Filed March 31, 2009, and issued October 15, 2009.

[34]

Hack, S. and Goos, G. 2006. Optimal register allocation for SSA-form programs in polynomial time. Inf. Process. Lett. 98, 4, 150--155.

Digital Library

[35]

Han, T. D. and Abdelrahman, T. S. 2011. Reducing branch divergence in GPU programs. In Proceedings of GPGPU-4. ACM, 3:1--3:8.

Digital Library

[36]

Jang, B., Schaa, D., Mistry, P., and Kaeli, D. 2010. Static memory access pattern analysis on a massively parallel GPU. In Proceedings of PLDI. ACM.

[37]

Karrenberg, R. and Hack, S. 2011. Whole-function vectorization. In Proceedings of the 9th IEEE/ACM CGO. IEEE, 141--150.

Digital Library

[38]

Karrenberg, R. and Hack, S. 2012. Improving performance of OpenCL on CPUS. In Proceedings of CC. Springer, 1--20.

Digital Library

[39]

Keryell, R., Materat, P., and Paris, N. 1991. POMP, or how to design a massively parallel machine with small developments. In Proceedings of Parallel Architectures and Languages Europe (PARLE). Lecture Note in Computer Science, vol. 505, Springer, 83--100.

[40]

Kung, S.-Y., Arun, K. S., Gal-Ezer, R. J., and Bhaskar Rao, D. V. 1982. Wavefront array processor: Language, architecture, and applications. IEEE Trans. Comput. 31, 1054--1066.

Digital Library

[41]

Lashgar, A. and Baniasadi, A. 2011. Performance in GPU architectures: Potentials and distances. In Proceedings of the 9th Workshop on Duplicating, Deconstructing, and Debunking (WDDD). IEEE, 75--81.

[42]

Lawrie, D. H., Layman, T., Baer, D., and Randal, J. M. 1975. GLYPNIR-a programming language for ILLIAC IV. Comm. ACM 18, 3, 157--164.

Digital Library

[43]

Lee, S., Min, S.-J., and Eigenmann, R. 2009. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proceedings of ACM SIGPLAN PPoPP. ACM, 101--110.

Digital Library

[44]

Lee, Y., Avizienis, R., Bishara, A., Xia, R., Lockhart, D., Batten, C., and Asanovic, K. 2011. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In Proceedings of the 38th International Symposium on Computer Architecture (ISCA). ACM, 129--140.

Digital Library

[45]

Lee, Y., Krashinsky, R., Grover, V., Keckler, S. W., and Asanovic, K. 2013. Convergence and scalarization for data-parallel architectures. In Proceedings of the IEEE/ACM CGO. ACM, 1--11.

Digital Library

[46]

Leiba, R., Hack, S., and Wald, I. 2012. Extending a C-like language for portable SIMD programming. In Proceedings of ACM SIGPLAN PPOPP. ACM, 65--74.

Digital Library

[47]

Meng, J., Tarjan, D., and Skadron, K. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of the International Symposium on Computer Architecture (ISCA). ACM, 235--246.

Digital Library

[48]

Miné, A. 2006. The octagon abstract domain. Higher Order Symbol. Comput. 19, 31--100.

Digital Library

[49]

Mu, S., Zhang, X., Zhang, N., Lu, J., Deng, Y. S., and Zhang, S. 2010. IP routing processing with graphic processors. In Proceedings of DATE. IEEE, 93--98.

Digital Library

[50]

Nickolls, J. and Dally, W. J. 2010. The GPU computing era. IEEE Micro 30, 2, 56--69.

Digital Library

[51]

Nickolls, J. and Kirk, D. 2009. Graphics and Computing GPUs. Computer Organization and Design, (Patterson and Hennessy) 4th Ed. Elsevier, Chapter A, A.1--A.77.

[52]

Nielson, F., Nielson, H. R., and Hankin, C. 2005. Principles of Program Analysis. Springer.

Digital Library

[53]

Ottenstein, K. J., Ballance, R. A., and MacCabe, A. B. 1990. The program dependence Web: A representation supporting control-, data-, and demand-driven interpretation of imperative languages. In Proceedings of ACM SIGPLAN PLDI. ACM, 257--271.

Digital Library

[54]

Pereira, F. M. Q. 2011. Dirmap's Blog. http://divmap.wordpress.com/.

[55]

Perrot, R. H. 1979. A language for array and vector processors. ACM Trans. Program. Lang. Syst. 1, 177--195.

Digital Library

[56]

Pharr, M. and Mark, W. R. 2012. ISPC: A SPMD compiler for high-performance CPU programming. In Proceedings of Innorative Parallel Computing (InPar).

[57]

Poletto, M. and Sarkar, V. 1999. Linear scan register allocation. ACM Trans. Program. Lang. Syst. 21, 5, 895--913.

Digital Library

[58]

Prabhu, T., Ramalingam, S., Might, M., and Hall, M. 2011. EigenCFA: Accelerating flow analysis with GPUs. In Proceedings of POPL. ACM.

Digital Library

[59]

Ryoo, S., Rodrigues, C. I., Baghsorkhi, S. S., Stone, S. S., Kirk, D. B., and mei W. Hwu, W. 2008. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of ACM SIGPLAN PPoPP. ACM, 73--82.

Digital Library

[60]

Saha, B., Zhou, X., Chen, H., Gao, Y., Yan, S., Rajagopalan, M., Fang, J., Zhang, P., Ronen, R., and Mendelson, A. 2009. Programming model for a heterogeneous ×86 platform. In Proceedings of ACM SIGPLAN PLDI. ACM, 431--440.

Digital Library

[61]

Samadi, M., Hormati, A., Mehrara, M., and Mahlke, S. 2012. Adaptive input-aware compilation for graphics engines. In Proceedings of ACM SIGPLAN PLDI. ACM.

Digital Library

[62]

Sampaio, D., Martins, R., Collange, S., and Pereira, F. M. Q. 2012a. Divergence analysis with affine constraints. In Proceedings of the 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE, 137--146.

Digital Library

[63]

Sampaio, D. N., Gedeon, E., Pereira, F. M. Q., and Collange, S. 2012b. Spill code placement for simd machines. In Proceedings of the 16th Brazilian Symposium on Language Programmings (SBLP). Lecture Notes on Computer Science, vol. 7554, Springer, 12--26.

Digital Library

[64]

Sandes, E. F. O. and de Melo, A. C. M. 2010. CUDALIGN: Using GPU to accelerate the comparison of megabase genomic sequences. In Proceedings of ACM SIGPLAN PPoPP. ACM, 137--146.

Digital Library

[65]

Scholz, B., Zhang, C., and Cifuentes, C. 2008. User-input dependence analysis via graph reachability. Tech. rep., Sun, Inc.

Digital Library

[66]

Stratton, J. A., Grover, V., Marathe, J., Aarts, B., Murphy, M., Hu, Z., and Hwu, W.-m. W. 2010. Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs. In Proceedings of IEEE/ACM CGO. IEEE, 111--119.

Digital Library

[67]

Stratton, J. A., Rodrigues, C., Sun, I.-J., Obeid, N., Chang, L.-W., Anssari, N., Liu, G. D., and mei W. Hwu, W. 2012. The parboil report. Tech. rep., University of Illinois at Urbana-Champaign.

[68]

Tu, P. and Padua, D. 1995. Efficient building and placing of gating functions. In Proceedings of ACM SIGPLAN PLDI. ACM, 47--55.

Digital Library

[69]

Weiser, M. 1981. Program slicing. In Proceedings of ICSE. IEEE, 439--449.

Digital Library

[70]

Yang, Y., Xiang, P., Kong, J., and Zhou, H. 2010. A GPGPU compiler for memory optimization and parallelism management. In Proceedings of ACM SIGPLAN PLDI. ACM, 86--97.

Digital Library

[71]

Zhang, E. Z., Jiang, Y., Guo, Z., and Shen, X. 2010. Streamlining GPU applications on the fly: Thread divergence elimination through runtime thread-data remapping. In Proceedings of the 24th ACM International Conference on Supercomputing (ICS). ACM, 115--126.

Digital Library

[72]

Zhang, E. Z., Jiang, Y., Guo, Z., Tian, K., and Shen, X. 2011. On-the-fly elimination of dynamic irregularities for GPU computing. In Proceedings of ASPLOS. ACM, 369--380.

Digital Library

[73]

Zhang, Y. and Owens, J. D. 2011. A quantitative performance analysis model for GPU architectures. In Proceedings of the IEEE 17th International Symposium on High Performance Computer Architecture (HPCA). ACM, 382--393.

Digital Library

Cited By

Soares LCanesche MPereira F(2023)Side-channel Elimination via Partial Control-flow LinearizationACM Transactions on Programming Languages and Systems10.1145/359473645:2(1-43)Online publication date: 26-Jun-2023
https://dl.acm.org/doi/10.1145/3594736
Kandiah VLustig DVilla ONellans DHardavellas NDubach CBruening DHardekopf B(2023)Parsimony: Enabling SIMD/Vector Programming in Standard Compiler FlowsProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580019(186-198)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3579990.3580019
Rosemann JMoll SHack S(2021)An abstract interpretation for SPMD divergence on reducible control flow graphsProceedings of the ACM on Programming Languages10.1145/34343125:POPL(1-31)Online publication date: 4-Jan-2021
https://dl.acm.org/doi/10.1145/3434312
Show More Cited By

Index Terms

Divergence analysis
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Pointer-Based Divergence Analysis for OpenCL 2.0 Programs
A modern GPU is designed with many large thread groups to achieve a high throughput and performance. Within these groups, the threads are grouped into fixed-size SIMD batches in which the same instruction is applied to vectors of data in a lockstep. This ...
Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads
Highlights
- We propose a new thread synchronization heuristic called Min-SP/PC.
- Min-SP/PC ...
Abstract
Simultaneous Multi-Threading (SMT) is a hardware model in which different threads share the same processing unit. This model is a compromise between high parallelism and low hardware cost. Minimal Multi-Threading (MMT) is one ...
Divergence Analysis with Affine Constraints
SBAC-PAD '12: Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

The rising popularity of graphics processing units is bringing renewed interest in code optimization techniques for SIMD processors. Many of these optimizations rely on divergence analyses, which classify variables as uniform, if they have the same ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Programming Languages and Systems

ACM Transactions on Programming Languages and Systems Volume 35, Issue 4

December 2013

169 pages

ISSN:0164-0925

EISSN:1558-4593

DOI:10.1145/2560142

Issue’s Table of Contents

Copyright © 2014 ACM.

© 2014 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 January 2014

Accepted: 01 August 2013

Revised: 01 June 2013

Received: 01 December 2012

Published in TOPLAS Volume 35, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Fundação de Amparo à Pesquisa do Estado de Minas Gerais
Web during his stay in Brazil
Conselho Nacional de Desenvolvimento Científico e Tecnológico

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
979
Total Downloads

Downloads (Last 12 months)193
Downloads (Last 6 weeks)25

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Soares LCanesche MPereira F(2023)Side-channel Elimination via Partial Control-flow LinearizationACM Transactions on Programming Languages and Systems10.1145/359473645:2(1-43)Online publication date: 26-Jun-2023
https://dl.acm.org/doi/10.1145/3594736
Kandiah VLustig DVilla ONellans DHardavellas NDubach CBruening DHardekopf B(2023)Parsimony: Enabling SIMD/Vector Programming in Standard Compiler FlowsProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580019(186-198)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3579990.3580019
Rosemann JMoll SHack S(2021)An abstract interpretation for SPMD divergence on reducible control flow graphsProceedings of the ACM on Programming Languages10.1145/34343125:POPL(1-31)Online publication date: 4-Jan-2021
https://dl.acm.org/doi/10.1145/3434312
Zheng RPai SLee J(2021)Efficient execution of graph algorithms on CPU with SIMD extensionsProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370326(262-276)Online publication date: 27-Feb-2021
https://dl.acm.org/doi/10.1109/CGO51591.2021.9370326
Chandrasekhar AChen GChen PChen WGu JGuo PKumar SLueh GMistry PPan WRaoux TTrifunovic KKandemir MJimborean AMoseley T(2019)IGC: the open source Intel graphics compilerProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314902(254-265)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.5555/3314872.3314902
Sun HFey FZhao JGorlatch SEigenmann RDing CMcKee S(2019)WCCVProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3331059(319-329)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3331059
Moll SSharma SKurtenacker MHack S(2019)Multi-dimensional Vectorization in LLVMProceedings of the 5th Workshop on Programming Models for SIMD/Vector Processing10.1145/3303117.3306172(1-8)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3303117.3306172
Moreira RCollange CQuintão Pereira F(2017)Function Call Re-VectorizationACM SIGPLAN Notices10.1145/3155284.301875152:8(313-326)Online publication date: 26-Jan-2017
https://dl.acm.org/doi/10.1145/3155284.3018751
Moreira RCollange CQuintão Pereira FSarkar VRauchwerger L(2017)Function Call Re-VectorizationProceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3018743.3018751(313-326)Online publication date: 26-Jan-2017
https://dl.acm.org/doi/10.1145/3018743.3018751
Khorasani FGupta RBhuyan LPrvulovic M(2015)Efficient warp execution in presence of divergence with collaborative context collectionProceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830796(204-215)Online publication date: 5-Dec-2015
https://dl.acm.org/doi/10.1145/2830772.2830796

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents