research-article

Open access

Compiler/Runtime Framework for Dynamic Dataflow Parallelization of Tiled Programs

Authors:

Louis-Noël Pouchet,

R. Govindarajan,

P. SadayappanAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 11, Issue 4

Article No.: 61, Pages 1 - 30

https://doi.org/10.1145/2687652

Published: 09 January 2015 Publication History

Abstract

Task-parallel languages are increasingly popular. Many of them provide expressive mechanisms for intertask synchronization. For example, OpenMP 4.0 will integrate data-driven execution semantics derived from the StarSs research language. Compared to the more restrictive data-parallel and fork-join concurrency models, the advanced features being introduced into task-parallel models in turn enable improved scalability through load balancing, memory latency hiding, mitigation of the pressure on memory bandwidth, and, as a side effect, reduced power consumption.

In this article, we develop a systematic approach to compile loop nests into concurrent, dynamically constructed graphs of dependent tasks. We propose a simple and effective heuristic that selects the most profitable parallelization idiom for every dependence type and communication pattern. This heuristic enables the extraction of interband parallelism (cross-barrier parallelism) in a number of numerical computations that range from linear algebra to structured grids and image processing. The proposed static analysis and code generation alleviates the burden of a full-blown dependence resolver to track the readiness of tasks at runtime. We evaluate our approach and algorithms in the PPCG compiler, targeting OpenStream, a representative dataflow task-parallel language with explicit intertask dependences and a lightweight runtime. Experimental results demonstrate the effectiveness of the approach.

References

[1]

Randy Allen and Ken Kennedy. 2002. Optimizing Compilers for Modern Architectures. Morgan Kaufmann, San Francisco, CA.

Digital Library

[2]

Vinayaka Bandishti, Irshad Pananilath, and Uday Bondhugula. 2012. Tiling stencil computations to maximize parallelism. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 40.

Digital Library

[3]

Muthu Manikandan Baskaran, Nagavijayalakshmi Vydyanathan, Uday Kumar Reddy Bondhugula, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2009. Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors. ACM Sigplan Notices 44, 4, 219--228.

Digital Library

[4]

Cedric Bastoul. 2004. Code generation in the polyhedral model is easier than you think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 7--16.

Digital Library

[5]

Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. ACM SIGPLAN Notices 43, 6, 101--113.

Digital Library

[6]

George Bosilca, Aurelien Bouteiller, Anthony Danalis, Thomas Hérault, Pierre Lemarinier, and Jack Dongarra. 2012. DAGuE: A generic distributed DAG engine for high performance computing. Parallel Comput. 38, 1--2, 37--51.

Digital Library

[7]

Zoran Budimli&cgrave;, Michael Burke, Vincent Cavé, Kathleen Knobe, Geoff Lowney, Ryan Newton, Jens Palsberg, David Peixotto, Vivek Sarkar, Frank Schlimbach, and Sagnak Taşirlar. 2010. Concurrent collections. Sci. Program. 18, 3--4, 203--217. http://portal.acm.org/citation.cfm&quest;id=1938482.1938486

Digital Library

[8]

Vincent Cavé, Jisheng Zhao, Jun Shirako, and Vivek Sarkar. 2011. Habanero-Java: The new adventures of old X10. In Proceedings of the 9th International Conference on Principles and Practice of Programming in Java. ACM, New York, NY, 51--61.

Digital Library

[9]

Alain Darte and Frédéric Vivien. 1997. Optimal fine and medium grain parallelism detection in polyhedral reduced dependence graphs. International Journal of Parallel Programming 25, 6, 447--496.

Digital Library

[10]

Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem, part II: Multidimensional time. Intl. J. of Parallel Programming 21, 6, 389--420.

Digital Library

[11]

Sylvain Girbal, Nicolas Vasilache, Cédric Bastoul, Albert Cohen, David Parello, Marc Sigler, and Olivier Temam. 2006. Semi-automatic composition of loop transformations. International Journal of Parallel Programming 34, 3, 261--317.

Digital Library

[12]

Martin Griebl, Paul Feautrier, and Christian Lengauer. 2000. Index set splitting. International Journal of Parallel Programming 28, 6 (2000).

[13]

Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2013. A stencil compiler for short-vector SIMD architectures. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. ACM Press, New York, NY, 13--24.

Digital Library

[14]

ETI International. 2014. SWARM (SWift Adaptive Runtime Machine). Retrieved November 17, 2014 from http://www.etinternational.com/index.php/products/swarmbeta.

[15]

Wesley M. Johnston, J. R. Paul Hanna, and Richard J. Millar. 2004. Advances in dataflow programming languages. Comput. Surveys 36, 1, 1--34.

Digital Library

[16]

Gilles Kahn. 1974. The semantics of a simple language for parallel programming. In IFIP’94, North Holland (Ed.). 471--475.

[17]

Martin Kong, Antoniu Pop, R. Govindarajan, Louis-Noël Pouchet, Albert Cohen, and P. Sadayappan. 2014. Compiler/Run-Time Framework for Dynamic Data-Flow Parallelization of Tiled Programs. Technical Report OSU-CISRC-7/14-TR14. Department of Computer Science and Engineering, The Ohio State University.

[18]

Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-Noël Pouchet, and P. Sadayappan. 2013. When polyhedral transformations meet SIMD code generation. ACM SIGPLAN Notices 48, 6, 127--138.

Digital Library

[19]

Costas Kyriacou, Paraskevas Evripidou, and Pedro Trancoso. 2006. Data-driven multithreading using conventional microprocessors. IEEE Trans. on Parallel Distributed Systems 17, 10, 1176--1188.

Digital Library

[20]

Samuel P. Midkiff and David A. Padua. 1986. Compiler generated synchronization for do loops. In ICPP. 544--551.

[21]

Samuel P. Midkiff and David A. Padua. 1987. Compiler algorithms for synchronization. IEEE Transactions on Computers 36, 12. 1485--1495.

Digital Library

[22]

Judit Planas, Rosa M. Badia, Eduard Ayguadé, and Jesús Labarta. 2009. Hierarchical task-based programming with StarSs. International Journal on High Performance Computing Architecture 23, 3, 284--299.

Digital Library

[23]

Antoniu Pop and Albert Cohen. 2012. Control-Driven Data Flow. Technical Report RR-8015. INRIA.

[24]

Antoniu Pop and Albert Cohen. 2013. OpenStream: Expressiveness and data-flow compilation of OpenMP streaming programs. ACM Transactions on Architecture and Code Optimization (TACO)

Digital Library

[25]

Louis-Noel Pouchet. 2012. PolyBench: The Polyhedral Benchmark suite. http://web.cse.ohio-state.edu/&sim; pouchet/software/polybench.

[26]

Louis-Noël Pouchet, Uday Bondhugula, Cédric Bastoul, Albert Cohen, J. Ramanujam, and P. Sadayappan. 2010. Combined iterative and model-driven optimization in an automatic parallelization framework. In Conference on Supercomputing (SC’10). IEEE Computer Society Press, New Orleans, LA.

Digital Library

[27]

William Pugh and Evan Rosser. 1997. Iteration space slicing and its application to communication optimization. In Proceedings of the 11th International Conference on Supercomputing. ACM, New York, 221--228.

Digital Library

[28]

Joshua Suettlerlein, Stéphane Zuckerman, and Guang R. Gao. 2013. An implementation of the codelet model. In Euro-Par. 633--644.

Digital Library

[29]

Sven Verdoolaege. 2010. ISL: An integer set library for the polyhedral model. In Mathematical Software--ICMS 2010. Springer, New York, NY, 299--302.

Digital Library

[30]

Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization (TACO) 9, 4, 54.

Digital Library

[31]

Ian Watson and John R. Gurd. 1982. A practical data flow computer. IEEE Computer 15, 2, 51--57.

Digital Library

Cited By

Abdelaal KKong MZhou HMoreira JMueller FEtsion Y(2021)Tile size selection of affine programs for GPGPUs using polyhedral cross-compilationProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460369(13-26)Online publication date: 3-Jun-2021
https://dl.acm.org/doi/10.1145/3447818.3460369
Chen JXu NChen PZhang H(2021)Efficient Compiler Autotuning via Bayesian OptimizationProceedings of the 43rd International Conference on Software Engineering10.1109/ICSE43902.2021.00110(1198-1209)Online publication date: 22-May-2021
https://dl.acm.org/doi/10.1109/ICSE43902.2021.00110
Iooss GAlias CRajopadhye S(2021)Monoparametric Tiling of Polyhedral ProgramsInternational Journal of Parallel Programming10.1007/s10766-021-00694-249:3(376-409)Online publication date: 1-Jun-2021
https://dl.acm.org/doi/10.1007/s10766-021-00694-2
Show More Cited By

Index Terms

Compiler/Runtime Framework for Dynamic Dataflow Parallelization of Tiled Programs
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Automatic speculative parallelization of loops using polyhedral dependence analysis
COSMIC '13: Proceedings of the First International Workshop on Code OptimiSation for MultI and many Cores

Speculative Execution (SE) runs loops in parallel even in the presence of a dependence. Using polyhedral dependence analysis, more speculation candidate loops can be discovered than normal OpenMP parallelization. In this research, a framework is ...
Generation of parallel synchronization-free tiled code

A novel approach to generation of parallel synchronization-free tiled code for the loop nest is presented. It is derived via a combination of the Polyhedral and Iteration Space Slicing frameworks. It uses the transitive closure of loop nest dependence ...
Compiler Parallelization of SIMPLE for a Distributed Memory Machine

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 11, Issue 4

January 2015

797 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2695583

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 January 2015

Accepted: 01 November 2014

Revised: 01 November 2014

Received: 01 June 2014

Published in TACO Volume 11, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

U.S. National Science Foundation award CCF-1321147
Intel's University Research Office Intel Strategic Research Alliance program
French “Investments for the Future” grant ManycoreLabs
European FP7 project CARP id. 287767

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
832
Total Downloads

Downloads (Last 12 months)79
Downloads (Last 6 weeks)10

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Abdelaal KKong MZhou HMoreira JMueller FEtsion Y(2021)Tile size selection of affine programs for GPGPUs using polyhedral cross-compilationProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460369(13-26)Online publication date: 3-Jun-2021
https://dl.acm.org/doi/10.1145/3447818.3460369
Chen JXu NChen PZhang H(2021)Efficient Compiler Autotuning via Bayesian OptimizationProceedings of the 43rd International Conference on Software Engineering10.1109/ICSE43902.2021.00110(1198-1209)Online publication date: 22-May-2021
https://dl.acm.org/doi/10.1109/ICSE43902.2021.00110
Iooss GAlias CRajopadhye S(2021)Monoparametric Tiling of Polyhedral ProgramsInternational Journal of Parallel Programming10.1007/s10766-021-00694-249:3(376-409)Online publication date: 1-Jun-2021
https://dl.acm.org/doi/10.1007/s10766-021-00694-2
Liu CKulkarni M(2017)Evaluating Performance of Task and Data Coarsening in Concurrent CollectionsLanguages and Compilers for Parallel Computing10.1007/978-3-319-52709-3_24(331-345)Online publication date: 24-Jan-2017
https://doi.org/10.1007/978-3-319-52709-3_24
Kong MPouchet LSadayappan PSarkar VWest J(2016)PIPESProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014957(1-12)Online publication date: 13-Nov-2016
https://dl.acm.org/doi/10.5555/3014904.3014957
Drebes APop AHeydemann KCohen ADrach NZaks AMendelson BRauchwerger LHwu W(2016)Scalable Task Parallelism for NUMAProceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2967946(125-137)Online publication date: 11-Sep-2016
https://dl.acm.org/doi/10.1145/2967938.2967946
Dathathri RMullapudi RBondhugula U(2016)Compiling Affine Loop Nests for a Dynamic Scheduling Runtime on Shared and Distributed MemoryACM Transactions on Parallel Computing10.1145/29489753:2(1-28)Online publication date: 20-Jul-2016
https://dl.acm.org/doi/10.1145/2948975
Kong MPouchet LSadayappan PSarkar V(2016)PIPES: A Language and Compiler for Task-Based Programming on Distributed-Memory ClustersSC16: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC.2016.38(456-467)Online publication date: Dec-2016
https://doi.org/10.1109/SC.2016.38
Pradelle BMeister BBaskaran MKonstantinidis AHenretty TLethin R(2016)Scalable Hierarchical Polyhedral Compilation2016 45th International Conference on Parallel Processing (ICPP)10.1109/ICPP.2016.56(432-441)Online publication date: Aug-2016
https://doi.org/10.1109/ICPP.2016.56
Bernecky RScholz SHendren LMasuhara HSheeran MVitek J(2015)Abstract expressionism for parallel performanceProceedings of the 2nd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming10.1145/2774959.2774962(54-59)Online publication date: 13-Jun-2015
https://dl.acm.org/doi/10.1145/2774959.2774962

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents