Article

Runtime dependence computation and execution of loops on heterogeneous systems

Authors:

R. Govindarajan,

Jayvant AnantpurAuthors Info & Claims

CGO '13: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Pages 1 - 10

https://doi.org/10.1109/CGO.2013.6494992

Published: 23 February 2013 Publication History

Abstract

GPUs have been used for parallel execution of DOALL loops. However, loops with indirect array references can potentially cause cross iteration dependences which are hard to detect using existing compilation techniques. Applications with such loops cannot easily use the GPU and hence do not benefit from the tremendous compute capabilities of GPUs. In this paper, we present an algorithm to compute at runtime the cross iteration dependences in such loops. The algorithm uses both the CPU and the GPU to compute the dependences. Specifically, it effectively uses the compute capabilities of the GPU to quickly collect the memory accesses performed by the iterations by executing the slice functions generated for the indirect array accesses. Using the dependence information, the loop iterations are levelized such that each level contains independent iterations which can be executed in parallel. Another interesting aspect of the proposed solution is that it pipelines the dependence computation of the future level with the actual computation of the current level to effectively utilize the resources available in the GPU. We use NVIDIA Tesla C2070 to evaluate our implementation using benchmarks from Polybench suite and some synthetic benchmarks. Our experiments show that the proposed technique can achieve an average speedup of 6.4x on loops with a reasonable number of cross iteration dependences.

References

[1]

M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, P. Sadayappan. A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs. In ICS, 2008.

Digital Library

[2]

M. Baskaran, J. Ramanujam, P. Sadayappan. Automatic C-to-CUDA code generation for affine programs. In CC, 2010.

Digital Library

[3]

B. Blume, R. Eigenmann, K. Faigin, J. Grout, J. Hoeflinger, D. Padua, P. Petersen, B. Pottenger, L. Rauchwerger, P. Tu, S. Weatherford. Polaris: The Next Generation in Parallelizing Compilers. In LCPC, 1994.

[4]

D. K. Chen, P. C. Yew, J. Torellas. An efficient algorithm for the run time parallelization of doacross loops. In Supercomputing, 1994.

Digital Library

[5]

J. Saltz, R. Mirchandaney, K. Crowley. Run-time parallelization and scheduling of loops. IEEE Trans. Computers, 1991.

Digital Library

[6]

M. Feng, R. Gupta, L. N. Bhuyan. Speculative Parallelization on GPGPUs. In PPoPP, 2012.

Digital Library

[7]

W W. L. Fung, I. Singh, A. Brownsword, T M. Aamodt Hardware Transactional Memory for GPU Architectures. In MICRO, 2011.

Digital Library

[8]

H. Kim, N. P. Johnson, J. W. Lee, S. A. Mahlke, D. I. August. Automatic Speculative DOALL for Clusters. In CGO, 2012.

Digital Library

[9]

S. Lee, S. J. Min. R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In PPoPP, 2009.

Digital Library

[10]

E. Lindholm, J. Nickolls, S. Oberman, J. Montrym. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro, 2008.

Digital Library

[11]

M. Kim, H. Kim, C Luk. SD3: A Scalable Approach to Dynamic Data-Dependence Profiling. In MICRO, 2010.

Digital Library

[12]

L. N. Pouchet. The Polyhendral Benchmark suite. http://www.cse.ohio-state.edu/~pouchet/software/polybench.

[13]

NVIDIA Corp, NVIDIA CUDA: Compute Unified Device Architecture: Programming Guide, Version 4.2, 2012.

[14]

NVIDIA Corp, Fermi Compute Architecture White Paper.

[15]

L. Rauchwerger, N. Amato, D. Padua. A scalable method to runtime loop parallelism. In IJPP, July 1995.

Digital Library

[16]

M. Samadi, A. Horamati, J. Lee, S. Mahlke. Paragon: Collaborative Speculative Loop Execution on GPU and CPU. GPGPU-5 2012.

Digital Library

[17]

Stanford Compiler Group. SUIF: A parallelizing and optimizing research compiler. Technical Report CSL-TR-94-620, Stanford University, Computer Systems Laboratory, 1994.

[18]

H. Yu, Z. Li, Multi-slicing: A Compiler-Supported Parallel Approach to Data Dependence Profiling, In ISSTA, 2012.

Digital Library

[19]

E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, X. Shen On-the-Fly Elimination of Dynamic Irregularities for GPU Computing, In ASPLOS, 2011.

Digital Library

[20]

C. Zhu, P. C. Yew. A scheme to enfore data dependence on large multiprocessor systems. IEEE Trans. Software Engineering, June 1987.

Digital Library

[21]

X. Zhuang, A. E. Eichenberger, Y. Luo, Kevin O'Brien, Kathryn O'Brien. Exploiting Parallelism with Dependence-Aware Scheduling, In PACT, 2009.

Digital Library

Cited By

Cheshmi KStrout MMehri Dehnavi MMohror KArnold DBadia R(2023)Runtime Composition of Iterations for Fusing Loop-carried Sparse DependenceProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607097(1-15)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607097
Cheshmi KStrout MDehnavi MLee JAgrawal KSpear M(2022)Optimizing sparse computations jointlyProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508439(459-460)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508439
Liu BCheshmi KSoori SStrout MDehnavi MGupta RShen X(2020)MatRoxProceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3332466.3374548(389-402)Online publication date: 19-Feb-2020
https://dl.acm.org/doi/10.1145/3332466.3374548
Show More Cited By

Index Terms

Runtime dependence computation and execution of loops on heterogeneous systems
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation
2. Theory of computation
  1. Models of computation
    1. Concurrency
      1. Parallel computing models

Recommendations

(R) Estimating Parallel Execution Time of Loops with Loop - Carried Dependence
ICPP '96: Proceedings of the Proceedings of the 1996 International Conference on Parallel Processing - Volume 3

Abstract: In this paper, we propose a scheme to estimate exact minimum parallel execution time of the single loop with loop-carried dependences in medium and fine grain parallel execution. The minimum parallel execution time of a loop is given by the ...
Exploitation of parallelism to nested loops with dependence cycles

In this paper, we analyze the recurrences from the breakability of the dependence links formed in general multi-statements in a nested loop. The major findings include: (1) A sink variable renaming technique, which can reposition an undesired anti-...
On Effective Execution of Nonuniform DOACROSS Loops

It is extremely difficult to parallelize DOACROSS loops with nonuniform loop-carried dependences. In this paper, we present a static scheduling scheme with an accompanying synchronization strategy that can execute such DOACROSS loops effectively and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '13: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

February 2013

366 pages

ISBN:9781467355247

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 23 February 2013

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
66
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)1

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cheshmi KStrout MMehri Dehnavi MMohror KArnold DBadia R(2023)Runtime Composition of Iterations for Fusing Loop-carried Sparse DependenceProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607097(1-15)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607097
Cheshmi KStrout MDehnavi MLee JAgrawal KSpear M(2022)Optimizing sparse computations jointlyProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508439(459-460)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508439
Liu BCheshmi KSoori SStrout MDehnavi MGupta RShen X(2020)MatRoxProceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3332466.3374548(389-402)Online publication date: 19-Feb-2020
https://dl.acm.org/doi/10.1145/3332466.3374548
Cheshmi KKamil SStrout MDehnavi M(2018)ParSyProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291739(1-15)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.5555/3291656.3291739
Belviranli MLee SVetter JBhuyan L(2018)JugglerACM SIGPLAN Notices10.1145/3200691.317849253:1(54-67)Online publication date: 10-Feb-2018
https://dl.acm.org/doi/10.1145/3200691.3178492
Belviranli MLee SVetter JBhuyan LKrall AGross T(2018)JugglerProceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3178487.3178492(54-67)Online publication date: 10-Feb-2018
https://dl.acm.org/doi/10.1145/3178487.3178492
Cheshmi KKamil SStrout MDehnavi M(2018)ParSyProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00065(1-15)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.1109/SC.2018.00065
Venkat AMohammadi MPark JRong HBarik RStrout MHall MWest J(2016)Automating wavefront parallelization for sparse matrix computationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014959(1-12)Online publication date: 13-Nov-2016
https://dl.acm.org/doi/10.5555/3014904.3014959
Ravishankar MDathathri RElango VPouchet LRamanujam JRountev ASadayappan P(2015)Distributed memory code generation for mixed Irregular/Regular computationsACM SIGPLAN Notices10.1145/2858788.268851550:8(65-75)Online publication date: 24-Jan-2015
https://dl.acm.org/doi/10.1145/2858788.2688515
Ravishankar MDathathri RElango VPouchet LRamanujam JRountev ASadayappan PCohen AGrove D(2015)Distributed memory code generation for mixed Irregular/Regular computationsProceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/2688500.2688515(65-75)Online publication date: 24-Jan-2015
https://dl.acm.org/doi/10.1145/2688500.2688515
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents