Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Compiler/Runtime Framework for Dynamic Dataflow Parallelization of Tiled Programs

Published: 09 January 2015 Publication History
  • Get Citation Alerts
  • Abstract

    Task-parallel languages are increasingly popular. Many of them provide expressive mechanisms for intertask synchronization. For example, OpenMP 4.0 will integrate data-driven execution semantics derived from the StarSs research language. Compared to the more restrictive data-parallel and fork-join concurrency models, the advanced features being introduced into task-parallel models in turn enable improved scalability through load balancing, memory latency hiding, mitigation of the pressure on memory bandwidth, and, as a side effect, reduced power consumption.
    In this article, we develop a systematic approach to compile loop nests into concurrent, dynamically constructed graphs of dependent tasks. We propose a simple and effective heuristic that selects the most profitable parallelization idiom for every dependence type and communication pattern. This heuristic enables the extraction of interband parallelism (cross-barrier parallelism) in a number of numerical computations that range from linear algebra to structured grids and image processing. The proposed static analysis and code generation alleviates the burden of a full-blown dependence resolver to track the readiness of tasks at runtime. We evaluate our approach and algorithms in the PPCG compiler, targeting OpenStream, a representative dataflow task-parallel language with explicit intertask dependences and a lightweight runtime. Experimental results demonstrate the effectiveness of the approach.

    References

    [1]
    Randy Allen and Ken Kennedy. 2002. Optimizing Compilers for Modern Architectures. Morgan Kaufmann, San Francisco, CA.
    [2]
    Vinayaka Bandishti, Irshad Pananilath, and Uday Bondhugula. 2012. Tiling stencil computations to maximize parallelism. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 40.
    [3]
    Muthu Manikandan Baskaran, Nagavijayalakshmi Vydyanathan, Uday Kumar Reddy Bondhugula, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2009. Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors. ACM Sigplan Notices 44, 4, 219--228.
    [4]
    Cedric Bastoul. 2004. Code generation in the polyhedral model is easier than you think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 7--16.
    [5]
    Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. ACM SIGPLAN Notices 43, 6, 101--113.
    [6]
    George Bosilca, Aurelien Bouteiller, Anthony Danalis, Thomas Hérault, Pierre Lemarinier, and Jack Dongarra. 2012. DAGuE: A generic distributed DAG engine for high performance computing. Parallel Comput. 38, 1--2, 37--51.
    [7]
    Zoran Budimli&cgrave;, Michael Burke, Vincent Cavé, Kathleen Knobe, Geoff Lowney, Ryan Newton, Jens Palsberg, David Peixotto, Vivek Sarkar, Frank Schlimbach, and Sagnak Taşirlar. 2010. Concurrent collections. Sci. Program. 18, 3--4, 203--217. http://portal.acm.org/citation.cfm?id=1938482.1938486
    [8]
    Vincent Cavé, Jisheng Zhao, Jun Shirako, and Vivek Sarkar. 2011. Habanero-Java: The new adventures of old X10. In Proceedings of the 9th International Conference on Principles and Practice of Programming in Java. ACM, New York, NY, 51--61.
    [9]
    Alain Darte and Frédéric Vivien. 1997. Optimal fine and medium grain parallelism detection in polyhedral reduced dependence graphs. International Journal of Parallel Programming 25, 6, 447--496.
    [10]
    Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem, part II: Multidimensional time. Intl. J. of Parallel Programming 21, 6, 389--420.
    [11]
    Sylvain Girbal, Nicolas Vasilache, Cédric Bastoul, Albert Cohen, David Parello, Marc Sigler, and Olivier Temam. 2006. Semi-automatic composition of loop transformations. International Journal of Parallel Programming 34, 3, 261--317.
    [12]
    Martin Griebl, Paul Feautrier, and Christian Lengauer. 2000. Index set splitting. International Journal of Parallel Programming 28, 6 (2000).
    [13]
    Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2013. A stencil compiler for short-vector SIMD architectures. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. ACM Press, New York, NY, 13--24.
    [14]
    ETI International. 2014. SWARM (SWift Adaptive Runtime Machine). Retrieved November 17, 2014 from http://www.etinternational.com/index.php/products/swarmbeta.
    [15]
    Wesley M. Johnston, J. R. Paul Hanna, and Richard J. Millar. 2004. Advances in dataflow programming languages. Comput. Surveys 36, 1, 1--34.
    [16]
    Gilles Kahn. 1974. The semantics of a simple language for parallel programming. In IFIP’94, North Holland (Ed.). 471--475.
    [17]
    Martin Kong, Antoniu Pop, R. Govindarajan, Louis-Noël Pouchet, Albert Cohen, and P. Sadayappan. 2014. Compiler/Run-Time Framework for Dynamic Data-Flow Parallelization of Tiled Programs. Technical Report OSU-CISRC-7/14-TR14. Department of Computer Science and Engineering, The Ohio State University.
    [18]
    Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-Noël Pouchet, and P. Sadayappan. 2013. When polyhedral transformations meet SIMD code generation. ACM SIGPLAN Notices 48, 6, 127--138.
    [19]
    Costas Kyriacou, Paraskevas Evripidou, and Pedro Trancoso. 2006. Data-driven multithreading using conventional microprocessors. IEEE Trans. on Parallel Distributed Systems 17, 10, 1176--1188.
    [20]
    Samuel P. Midkiff and David A. Padua. 1986. Compiler generated synchronization for do loops. In ICPP. 544--551.
    [21]
    Samuel P. Midkiff and David A. Padua. 1987. Compiler algorithms for synchronization. IEEE Transactions on Computers 36, 12. 1485--1495.
    [22]
    Judit Planas, Rosa M. Badia, Eduard Ayguadé, and Jesús Labarta. 2009. Hierarchical task-based programming with StarSs. International Journal on High Performance Computing Architecture 23, 3, 284--299.
    [23]
    Antoniu Pop and Albert Cohen. 2012. Control-Driven Data Flow. Technical Report RR-8015. INRIA.
    [24]
    Antoniu Pop and Albert Cohen. 2013. OpenStream: Expressiveness and data-flow compilation of OpenMP streaming programs. ACM Transactions on Architecture and Code Optimization (TACO)
    [25]
    Louis-Noel Pouchet. 2012. PolyBench: The Polyhedral Benchmark suite. http://web.cse.ohio-state.edu/∼ pouchet/software/polybench.
    [26]
    Louis-Noël Pouchet, Uday Bondhugula, Cédric Bastoul, Albert Cohen, J. Ramanujam, and P. Sadayappan. 2010. Combined iterative and model-driven optimization in an automatic parallelization framework. In Conference on Supercomputing (SC’10). IEEE Computer Society Press, New Orleans, LA.
    [27]
    William Pugh and Evan Rosser. 1997. Iteration space slicing and its application to communication optimization. In Proceedings of the 11th International Conference on Supercomputing. ACM, New York, 221--228.
    [28]
    Joshua Suettlerlein, Stéphane Zuckerman, and Guang R. Gao. 2013. An implementation of the codelet model. In Euro-Par. 633--644.
    [29]
    Sven Verdoolaege. 2010. ISL: An integer set library for the polyhedral model. In Mathematical Software--ICMS 2010. Springer, New York, NY, 299--302.
    [30]
    Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization (TACO) 9, 4, 54.
    [31]
    Ian Watson and John R. Gurd. 1982. A practical data flow computer. IEEE Computer 15, 2, 51--57.

    Cited By

    View all
    • (2021)Tile size selection of affine programs for GPGPUs using polyhedral cross-compilationProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460369(13-26)Online publication date: 3-Jun-2021
    • (2021)Efficient Compiler Autotuning via Bayesian OptimizationProceedings of the 43rd International Conference on Software Engineering10.1109/ICSE43902.2021.00110(1198-1209)Online publication date: 22-May-2021
    • (2021)Monoparametric Tiling of Polyhedral ProgramsInternational Journal of Parallel Programming10.1007/s10766-021-00694-249:3(376-409)Online publication date: 1-Jun-2021
    • Show More Cited By

    Index Terms

    1. Compiler/Runtime Framework for Dynamic Dataflow Parallelization of Tiled Programs

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 11, Issue 4
      January 2015
      797 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/2695583
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 09 January 2015
      Accepted: 01 November 2014
      Revised: 01 November 2014
      Received: 01 June 2014
      Published in TACO Volume 11, Issue 4

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Dataflow
      2. auto-parallelization
      3. dependence partitioning
      4. dynamic wavefront
      5. point-to-point synchronization
      6. polyhedral compiler
      7. polyhedral framework
      8. tile dependences
      9. tiling

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • U.S. National Science Foundation award CCF-1321147
      • Intel's University Research Office Intel Strategic Research Alliance program
      • French “Investments for the Future” grant ManycoreLabs
      • European FP7 project CARP id. 287767

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)79
      • Downloads (Last 6 weeks)10
      Reflects downloads up to 26 Jul 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)Tile size selection of affine programs for GPGPUs using polyhedral cross-compilationProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460369(13-26)Online publication date: 3-Jun-2021
      • (2021)Efficient Compiler Autotuning via Bayesian OptimizationProceedings of the 43rd International Conference on Software Engineering10.1109/ICSE43902.2021.00110(1198-1209)Online publication date: 22-May-2021
      • (2021)Monoparametric Tiling of Polyhedral ProgramsInternational Journal of Parallel Programming10.1007/s10766-021-00694-249:3(376-409)Online publication date: 1-Jun-2021
      • (2017)Evaluating Performance of Task and Data Coarsening in Concurrent CollectionsLanguages and Compilers for Parallel Computing10.1007/978-3-319-52709-3_24(331-345)Online publication date: 24-Jan-2017
      • (2016)PIPESProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014957(1-12)Online publication date: 13-Nov-2016
      • (2016)Scalable Task Parallelism for NUMAProceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2967946(125-137)Online publication date: 11-Sep-2016
      • (2016)Compiling Affine Loop Nests for a Dynamic Scheduling Runtime on Shared and Distributed MemoryACM Transactions on Parallel Computing10.1145/29489753:2(1-28)Online publication date: 20-Jul-2016
      • (2016)PIPES: A Language and Compiler for Task-Based Programming on Distributed-Memory ClustersSC16: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC.2016.38(456-467)Online publication date: Dec-2016
      • (2016)Scalable Hierarchical Polyhedral Compilation2016 45th International Conference on Parallel Processing (ICPP)10.1109/ICPP.2016.56(432-441)Online publication date: Aug-2016
      • (2015)Abstract expressionism for parallel performanceProceedings of the 2nd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming10.1145/2774959.2774962(54-59)Online publication date: 13-Jun-2015

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media