Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1504176.1504209acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors

Published: 14 February 2009 Publication History

Abstract

Recent advances in polyhedral compilation technology have made it feasible to automatically transform affine sequential loop nests for tiled parallel execution on multi-core processors. However, for multi-statement input programs with statements of different dimensionalities, such as Cholesky or LU decomposition, the parallel tiled code generated by existing automatic parallelization approaches may suffer from significant load imbalance, resulting in poor scalability on multi-core systems. In this paper, we develop a completely automatic parallelization approach for transforming input affine sequential codes into efficient parallel codes that can be executed on a multi-core system in a load-balanced manner. In our approach, we employ a compile-time technique that enables dynamic extraction of inter-tile dependences at run-time, and dynamic scheduling of the parallel tiles on the processor cores for improved scalable execution. Our approach obviates the need for programmer intervention and re-writing of existing algorithms for efficient parallel execution on multi-cores. We demonstrate the usefulness of our approach through comparisons using linear algebra computations: LU and Cholesky decomposition.

References

[1]
T. L. Adam, K. M. Chandy, and J. R. Dickson. A comparison of list schedules for parallel processing systems. Commun. ACM, 17(12):685--690, 1974.
[2]
R. Allen and K. Kennedy. Automatic translation of Fortran programs to vector form. ACM Trans. on Programming Languages and Systems, 9(4):491--542, 1987.
[3]
C. Ancourt and F. Irigoin. Scanning polyhedra with do loops. In PPoPP'91, pages 39--50, 1991.
[4]
C. Bastoul. Code generation in the polyhedral model is easier than you think. In PACT'04, pages 7--16, 2004.
[5]
C. Bastoul, A. Cohen, S. Girbal, S. Sharma, and O. Temam. Putting polyhedral loop transformations to work. In Workshop on Languages and Compilers for Parallel Computing (LCPC'03), pages 23--30, 2003.
[6]
U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Affine transformations for communication minimal parallelization and locality optimization of arbitrarily nested loop sequences. Technical Report OSU-CISRC-5/07-TR43, Ohio State University, May 2007.
[7]
U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In International Conference on Compiler Construction (ETAPS CC), Apr. 2008.
[8]
U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In ACM SIGPLAN Programming Languages Design and Implementation (PLDI '08), 2008.
[9]
U. Bondhugula, J. Ramanujam, and P. Sadayappan. Pluto: A practical and fully automatic polyhedral parallelizer and locality optimizer. Technical Report OSU-CISRC-10/07-TR70, The Ohio State University, Oct. 2007.
[10]
P. Boulet, A. Darte, G.-A. Silber, and F. Vivien. Loop parallelization algorithms: From parallelism extraction to code generation. Parallel Computing, 24(3-4):421--444, 1998.
[11]
A. Buttari, J. Dongarra, P. Husbands, J. Kurzak, and K. Yelick. Multithreading for synchronization tolerance in matrix factorization. In Proceedings of the SciDAC 2007 Conference. Journal of Physics: Conference Series, 2007.
[12]
A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. Technical Report UT-CS-07-600, Innovative Computing Laboratory, University of Tennessee Knoxville, September 2007. Submitted to Parallel Computing. LAPACK Working Note 191.
[13]
D.-K. Chen, J. Torrellas, and P.-C. Yew. An efficient algorithm for the run-time parallelization of doacross loops. In Supercomputing '94: Proceedings of the 1994 conference on Supercomputing, pages 518--527, Los Alamitos, CA, USA, 1994. IEEE Computer Society Press.
[14]
M. Cintra and D. R. Llanos. Toward efficient and robust software speculative parallelization on multiprocessors. In PPoPP '03: Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 13--24, New York, NY, USA, 2003. ACM.
[15]
CLooG: The Chunky Loop Generator. http://www.cloog.org.
[16]
A. Darte, G.-A. Silber, and F. Vivien. Combining retiming and scheduling techniques for loop parallelization and loop tiling. Parallel Processing Letters, 7(4):379--392, 1997.
[17]
A. Darte and F. Vivien. Optimal fine and medium grain parallelism detection in polyhedral reduced dependence graphs. IJPP, 25(6):447--496, Dec. 1997.
[18]
J. Dongarra. Four important concepts that will effect math software. In 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing (PARA'08), 2008.
[19]
P. Feautrier. Dataflow analysis of array and scalar references. IJPP, 20(1):23--53, 1991.
[20]
P. Feautrier. Some efficient solutions to the affine scheduling problem, part I: one-dimensional time. IJPP, 21(5):313--348, 1992.
[21]
P. Feautrier. Some efficient solutions to the affine scheduling problem, part II: multidimensional time. IJPP, 21(6):389--420, 1992.
[22]
P. Feautrier. Automatic parallelization in the polytope model. In The Data Parallel Programming Model, pages 79--103, 1996.
[23]
A. Gerasoulis and T. Yang. On the granularity and clustering of directed acyclic task graphs. IEEE Trans. Parallel Distrib. Syst., 4(6):686--701, 1993.
[24]
S. Girbal, N. Vasilache, C. Bastoul, A. Cohen, D. Parello, M. Sigler, and O. Temam. Semi-automatic composition of loop transformations. IJPP, 34(3):261--317, June 2006.
[25]
M. Griebl. Automatic Parallelization of Loop Programs for Distributed Memory Architectures. FMI, University of Passau, 2004. Habilitation Thesis.
[26]
H. Kasahara, H. Honda, and S. Narita. Parallel processing of near fine grain tasks using static scheduling on oscar (optimally scheduled advanced multiprocessor). In Supercomputing '90: Proceedings of the 1990 ACM/IEEE conference on Supercomputing, pages 856--864, Washington, DC, USA, 1990. IEEE Computer Society.
[27]
Y.-K. Kwok and I. Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Comput. Surv., 31(4):406--471, 1999.
[28]
S.-T. Leung and J. Zahorjan. Improving the performance of runtime parallelization. SIGPLAN Not., 28(7):83--91, 1993.
[29]
A. Lim. Improving Parallelism And Data Locality With Affine Partitioning. PhD thesis, Stanford University, Aug. 2001.
[30]
A. Lim, S. Liao, and M. Lam. Blocking and array contraction across arbitrarily nested loops using affine partitioning. In ACM SIGPLAN PPoPP, pages 103--112, 2001.
[31]
A. W. Lim, G. I. Cheong, and M. S. Lam. An affine partitioning algorithm to maximize parallelism and minimize communication. In ACM Intl. Conf. on Supercomputing, pages 228--237, 1999.
[32]
A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine partitions. Parallel Computing, 24(3-4):445--475, 1998.
[33]
Parallel linear algebra for scalable multi-core architectures (PLASMA) project. http://icl.cs.utk.edu/plasma.
[34]
PLUTO: A polyhedral automatic parallelizer and locality optimizer for multicores. http://pluto-compiler.sourceforge.net.
[35]
R. Ponnusamy, J. Saltz, and A. Choudhary. Runtime compilation techniques for data partitioning and communication schedule reuse. In Supercomputing '93: Proceedings of the 1993 ACM/IEEE conference on Supercomputing, pages 361--370, New York, NY, USA, 1993. ACM.
[36]
W. Pugh. The Omega test: a fast and practical integer programming algorithm for dependence analysis. Communications of the ACM, 8:102--114, Aug. 1992.
[37]
F. Quilleré, S. V. Rajopadhye, and D. Wilde. Generation of efficient nested loops from polyhedra. IJPP, 28(5):469--498, 2000.
[38]
C. G. Quinones, C. Madriles, J. Sánchez, P. Marcuello, A. González, and D. M. Tullsen. Mitosis compiler: An infrastructure for speculative threading based on pre-computation slices. In PLDI '05: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, pages 269--279, 2005.
[39]
L. Rauchwerger and D. Padua. The lrpd test: speculative run-time parallelization of loops with privatization and reduction parallelization. SIGPLAN Not., 30(6):218--232, 1995.
[40]
P. Rundberg and P. S. Om. Low-cost thread-level data dependence speculation on multiprocessors. In In Fourth Workshop on Multithreaded Execution, Architecture and Compilation, pages 1--9, 2000.
[41]
S. Rus, M. Pennings, and L. Rauchwerger. Sensitivity analysis for automatic parallelization on multi-cores. In ICS '07: Proceedings of the 21st annual international conference on Supercomputing, pages 263--273, New York, NY, USA, 2007. ACM.
[42]
J. Saltz, K. Crowley, R. Mirchandaney, and H. Berryman. Run-time scheduling and execution of loops on message passing machines. J. Parallel Distrib. Comput., 8(4):303--312, 1990.
[43]
J. H. Salz, R. Mirchandaney, and K. Crowley. Run-time parallelization and scheduling of loops. IEEE Trans. Comput., 40(5):603--612, 1991.
[44]
V. Sarkar. Partitioning and Scheduling Parallel Programs for Multiprocessors. MIT Press, Cambridge, MA, USA, 1989.
[45]
V. Sarkar and J. Hennessy. Compile-time partitioning and scheduling of parallel programs. In SIGPLAN '86: Proceedings of the 1986 SIGPLAN symposium on Compiler construction, pages 17--26, New York, NY, USA, 1986. ACM.
[46]
N. Vasilache, C. Bastoul, and A. Cohen. Polyhedral code generation in the real world. In International Conference on Compiler Construction (ETAPS CC'06), pages 185--201, Mar. 2006.
[47]
N. Vasilache, C. Bastoul, S. Girbal, and A. Cohen. Violated dependence analysis. In ACM ICS, June 2006.
[48]
M. Wolf and M. S. Lam. A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans. Parallel Distrib. Syst., 2(4):452--471, 1991.

Cited By

View all
  • (2018)ParSyProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00065(1-15)Online publication date: 11-Nov-2018
  • (2017)Optimization of Triangular and Banded Matrix Operations Using 2d-Packed LayoutsACM Transactions on Architecture and Code Optimization10.1145/316201614:4(1-19)Online publication date: 18-Dec-2017
  • (2016)Compiling Affine Loop Nests for a Dynamic Scheduling Runtime on Shared and Distributed MemoryACM Transactions on Parallel Computing10.1145/29489753:2(1-28)Online publication date: 20-Jul-2016
  • Show More Cited By

Index Terms

  1. Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
      February 2009
      322 pages
      ISBN:9781605583976
      DOI:10.1145/1504176
      • cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 44, Issue 4
        PPoPP '09
        April 2009
        294 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/1594835
        Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 14 February 2009

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. compile-time optimization
      2. dynamic scheduling
      3. run-time optimization

      Qualifiers

      • Research-article

      Conference

      PPoPP09
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 230 of 1,014 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)13
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 15 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2018)ParSyProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00065(1-15)Online publication date: 11-Nov-2018
      • (2017)Optimization of Triangular and Banded Matrix Operations Using 2d-Packed LayoutsACM Transactions on Architecture and Code Optimization10.1145/316201614:4(1-19)Online publication date: 18-Dec-2017
      • (2016)Compiling Affine Loop Nests for a Dynamic Scheduling Runtime on Shared and Distributed MemoryACM Transactions on Parallel Computing10.1145/29489753:2(1-28)Online publication date: 20-Jul-2016
      • (2015)Optimal Parallelogram Selection for Hierarchical TilingACM Transactions on Architecture and Code Optimization10.1145/268741411:4(1-23)Online publication date: 9-Jan-2015
      • (2014)Author retrospective for PYRROSACM International Conference on Supercomputing 25th Anniversary Volume10.1145/2591635.2591647(18-20)Online publication date: 10-Jun-2014
      • (2014)Compiler Support for Optimizing Memory Bank-Level ParallelismProceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2014.34(571-582)Online publication date: 13-Dec-2014
      • (2013)Compiling affine loop nests for distributed-memory parallel architecturesProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.1145/2503210.2503289(1-12)Online publication date: 17-Nov-2013
      • (2013)Semi-automatic restructuring of offloadable tasks for many-core acceleratorsProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.1145/2503210.2503285(1-12)Online publication date: 17-Nov-2013
      • (2013)Proof-Directed Parallelization Synthesis by Separation LogicACM Transactions on Programming Languages and Systems10.1145/2491522.249152535:2(1-60)Online publication date: 1-Jul-2013
      • (2013)Sigma*ACM SIGPLAN Notices10.1145/2480359.242912348:1(443-456)Online publication date: 23-Jan-2013
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media