research-article

Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors

Authors:

Muthu Manikandan Baskaran,

Nagavijayalakshmi Vydyanathan,

Uday Kumar Reddy Bondhugula,

Atanas Rountev,

P. SadayappanAuthors Info & Claims

ACM SIGPLAN Notices, Volume 44, Issue 4

Pages 219 - 228

https://doi.org/10.1145/1594835.1504209

Published: 14 February 2009 Publication History

Abstract

Recent advances in polyhedral compilation technology have made it feasible to automatically transform affine sequential loop nests for tiled parallel execution on multi-core processors. However, for multi-statement input programs with statements of different dimensionalities, such as Cholesky or LU decomposition, the parallel tiled code generated by existing automatic parallelization approaches may suffer from significant load imbalance, resulting in poor scalability on multi-core systems. In this paper, we develop a completely automatic parallelization approach for transforming input affine sequential codes into efficient parallel codes that can be executed on a multi-core system in a load-balanced manner. In our approach, we employ a compile-time technique that enables dynamic extraction of inter-tile dependences at run-time, and dynamic scheduling of the parallel tiles on the processor cores for improved scalable execution. Our approach obviates the need for programmer intervention and re-writing of existing algorithms for efficient parallel execution on multi-cores. We demonstrate the usefulness of our approach through comparisons using linear algebra computations: LU and Cholesky decomposition.

References

[1]

T. L. Adam, K. M. Chandy, and J. R. Dickson. A comparison of list schedules for parallel processing systems. Commun. ACM, 17(12):685--690, 1974.

Digital Library

[2]

R. Allen and K. Kennedy. Automatic translation of Fortran programs to vector form. ACM Trans. on Programming Languages and Systems, 9(4):491--542, 1987.

Digital Library

[3]

C. Ancourt and F. Irigoin. Scanning polyhedra with do loops. In PPoPP'91, pages 39--50, 1991.

Digital Library

[4]

C. Bastoul. Code generation in the polyhedral model is easier than you think. In PACT'04, pages 7--16, 2004.

Digital Library

[5]

C. Bastoul, A. Cohen, S. Girbal, S. Sharma, and O. Temam. Putting polyhedral loop transformations to work. In Workshop on Languages and Compilers for Parallel Computing (LCPC'03), pages 23--30, 2003.

[6]

U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Affine transformations for communication minimal parallelization and locality optimization of arbitrarily nested loop sequences. Technical Report OSU-CISRC-5/07-TR43, Ohio State University, May 2007.

[7]

U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In International Conference on Compiler Construction (ETAPS CC), Apr. 2008.

Digital Library

[8]

U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In ACM SIGPLAN Programming Languages Design and Implementation (PLDI '08), 2008.

Digital Library

[9]

U. Bondhugula, J. Ramanujam, and P. Sadayappan. Pluto: A practical and fully automatic polyhedral parallelizer and locality optimizer. Technical Report OSU-CISRC-10/07-TR70, The Ohio State University, Oct. 2007.

[10]

P. Boulet, A. Darte, G.-A. Silber, and F. Vivien. Loop parallelization algorithms: From parallelism extraction to code generation. Parallel Computing, 24(3-4):421--444, 1998.

Digital Library

[11]

A. Buttari, J. Dongarra, P. Husbands, J. Kurzak, and K. Yelick. Multithreading for synchronization tolerance in matrix factorization. In Proceedings of the SciDAC 2007 Conference. Journal of Physics: Conference Series, 2007.

[12]

A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. Technical Report UT-CS-07-600, Innovative Computing Laboratory, University of Tennessee Knoxville, September 2007. Submitted to Parallel Computing. LAPACK Working Note 191.

[13]

D.-K. Chen, J. Torrellas, and P.-C. Yew. An efficient algorithm for the run-time parallelization of doacross loops. In Supercomputing '94: Proceedings of the 1994 conference on Supercomputing, pages 518--527, Los Alamitos, CA, USA, 1994. IEEE Computer Society Press.

Digital Library

[14]

M. Cintra and D. R. Llanos. Toward efficient and robust software speculative parallelization on multiprocessors. In PPoPP '03: Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 13--24, New York, NY, USA, 2003. ACM.

Digital Library

[15]

CLooG: The Chunky Loop Generator. http://www.cloog.org.

[16]

A. Darte, G.-A. Silber, and F. Vivien. Combining retiming and scheduling techniques for loop parallelization and loop tiling. Parallel Processing Letters, 7(4):379--392, 1997.

[17]

A. Darte and F. Vivien. Optimal fine and medium grain parallelism detection in polyhedral reduced dependence graphs. IJPP, 25(6):447--496, Dec. 1997.

Digital Library

[18]

J. Dongarra. Four important concepts that will effect math software. In 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing (PARA'08), 2008.

[19]

P. Feautrier. Dataflow analysis of array and scalar references. IJPP, 20(1):23--53, 1991.

[20]

P. Feautrier. Some efficient solutions to the affine scheduling problem, part I: one-dimensional time. IJPP, 21(5):313--348, 1992.

Digital Library

[21]

P. Feautrier. Some efficient solutions to the affine scheduling problem, part II: multidimensional time. IJPP, 21(6):389--420, 1992.

Digital Library

[22]

P. Feautrier. Automatic parallelization in the polytope model. In The Data Parallel Programming Model, pages 79--103, 1996.

[23]

A. Gerasoulis and T. Yang. On the granularity and clustering of directed acyclic task graphs. IEEE Trans. Parallel Distrib. Syst., 4(6):686--701, 1993.

Digital Library

[24]

S. Girbal, N. Vasilache, C. Bastoul, A. Cohen, D. Parello, M. Sigler, and O. Temam. Semi-automatic composition of loop transformations. IJPP, 34(3):261--317, June 2006.

Digital Library

[25]

M. Griebl. Automatic Parallelization of Loop Programs for Distributed Memory Architectures. FMI, University of Passau, 2004. Habilitation Thesis.

[26]

H. Kasahara, H. Honda, and S. Narita. Parallel processing of near fine grain tasks using static scheduling on oscar (optimally scheduled advanced multiprocessor). In Supercomputing '90: Proceedings of the 1990 ACM/IEEE conference on Supercomputing, pages 856--864, Washington, DC, USA, 1990. IEEE Computer Society.

Digital Library

[27]

Y.-K. Kwok and I. Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Comput. Surv., 31(4):406--471, 1999.

Digital Library

[28]

S.-T. Leung and J. Zahorjan. Improving the performance of runtime parallelization. SIGPLAN Not., 28(7):83--91, 1993.

Digital Library

[29]

A. Lim. Improving Parallelism And Data Locality With Affine Partitioning. PhD thesis, Stanford University, Aug. 2001.

Digital Library

[30]

A. Lim, S. Liao, and M. Lam. Blocking and array contraction across arbitrarily nested loops using affine partitioning. In ACM SIGPLAN PPoPP, pages 103--112, 2001.

Digital Library

[31]

A. W. Lim, G. I. Cheong, and M. S. Lam. An affine partitioning algorithm to maximize parallelism and minimize communication. In ACM Intl. Conf. on Supercomputing, pages 228--237, 1999.

Digital Library

[32]

A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine partitions. Parallel Computing, 24(3-4):445--475, 1998.

Digital Library

[33]

Parallel linear algebra for scalable multi-core architectures (PLASMA) project. http://icl.cs.utk.edu/plasma.

[34]

PLUTO: A polyhedral automatic parallelizer and locality optimizer for multicores. http://pluto-compiler.sourceforge.net.

[35]

R. Ponnusamy, J. Saltz, and A. Choudhary. Runtime compilation techniques for data partitioning and communication schedule reuse. In Supercomputing '93: Proceedings of the 1993 ACM/IEEE conference on Supercomputing, pages 361--370, New York, NY, USA, 1993. ACM.

Digital Library

[36]

W. Pugh. The Omega test: a fast and practical integer programming algorithm for dependence analysis. Communications of the ACM, 8:102--114, Aug. 1992.

Digital Library

[37]

F. Quilleré, S. V. Rajopadhye, and D. Wilde. Generation of efficient nested loops from polyhedra. IJPP, 28(5):469--498, 2000.

Digital Library

[38]

C. G. Quinones, C. Madriles, J. Sánchez, P. Marcuello, A. González, and D. M. Tullsen. Mitosis compiler: An infrastructure for speculative threading based on pre-computation slices. In PLDI '05: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, pages 269--279, 2005.

Digital Library

[39]

L. Rauchwerger and D. Padua. The lrpd test: speculative run-time parallelization of loops with privatization and reduction parallelization. SIGPLAN Not., 30(6):218--232, 1995.

Digital Library

[40]

P. Rundberg and P. S. Om. Low-cost thread-level data dependence speculation on multiprocessors. In In Fourth Workshop on Multithreaded Execution, Architecture and Compilation, pages 1--9, 2000.

[41]

S. Rus, M. Pennings, and L. Rauchwerger. Sensitivity analysis for automatic parallelization on multi-cores. In ICS '07: Proceedings of the 21st annual international conference on Supercomputing, pages 263--273, New York, NY, USA, 2007. ACM.

Digital Library

[42]

J. Saltz, K. Crowley, R. Mirchandaney, and H. Berryman. Run-time scheduling and execution of loops on message passing machines. J. Parallel Distrib. Comput., 8(4):303--312, 1990.

Digital Library

[43]

J. H. Salz, R. Mirchandaney, and K. Crowley. Run-time parallelization and scheduling of loops. IEEE Trans. Comput., 40(5):603--612, 1991.

Digital Library

[44]

V. Sarkar. Partitioning and Scheduling Parallel Programs for Multiprocessors. MIT Press, Cambridge, MA, USA, 1989.

Digital Library

[45]

V. Sarkar and J. Hennessy. Compile-time partitioning and scheduling of parallel programs. In SIGPLAN '86: Proceedings of the 1986 SIGPLAN symposium on Compiler construction, pages 17--26, New York, NY, USA, 1986. ACM.

Digital Library

[46]

N. Vasilache, C. Bastoul, and A. Cohen. Polyhedral code generation in the real world. In International Conference on Compiler Construction (ETAPS CC'06), pages 185--201, Mar. 2006.

Digital Library

[47]

N. Vasilache, C. Bastoul, S. Girbal, and A. Cohen. Violated dependence analysis. In ACM ICS, June 2006.

Digital Library

[48]

M. Wolf and M. S. Lam. A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans. Parallel Distrib. Syst., 2(4):452--471, 1991.

Digital Library

Cited By

Mahjoub SGolsorkhtabaramiri MAmiri S(2023)Optimal uniformization for non-uniform two-level loops using a hybrid methodThe Journal of Supercomputing10.1007/s11227-023-05194-379:11(12791-12814)Online publication date: 19-Mar-2023
https://doi.org/10.1007/s11227-023-05194-3
Abdollahi-Kalkhoran ALotfi SIzadkhah H(2022)TEA-SEAExpert Systems with Applications: An International Journal10.1016/j.eswa.2021.116152191:COnline publication date: 1-Apr-2022
https://dl.acm.org/doi/10.1016/j.eswa.2021.116152
Chang XShen LWang Q(2021)Optimizing Stencil Codes with Exploiting Data Reuse2021 International Conference on Information Control, Electrical Engineering and Rail Transit (ICEERT)10.1109/ICEERT53919.2021.00018(45-54)Online publication date: Oct-2021
https://doi.org/10.1109/ICEERT53919.2021.00018
Show More Cited By

Index Terms

Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments
      2. Source code generation

Recommendations

Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

Recent advances in polyhedral compilation technology have made it feasible to automatically transform affine sequential loop nests for tiled parallel execution on multi-core processors. However, for multi-statement input programs with statements of ...
Compiling Affine Loop Nests for a Dynamic Scheduling Runtime on Shared and Distributed Memory

Current de-facto parallel programming models like OpenMP and MPI make it difficult to extract task-level dataflow parallelism as opposed to bulk-synchronous parallelism. Task parallel approaches that use point-to-point synchronization between dependent ...
Tiling imperfectly-nested loop nests
SC '00: Proceedings of the 2000 ACM/IEEE conference on Supercomputing

Tiling is one of the more important transformations for enhancing loca lity of reference in programs. Intuitively, tiling a set of loops achieves the effect of interleaving iterations of these loops. Tiling of perfectly-nested loop nests (which are loop ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 44, Issue 4

PPoPP '09

April 2009

294 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/1594835

Issue’s Table of Contents

PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
February 2009
322 pages
ISBN:9781605583976
DOI:10.1145/1504176
General Chair:
Daniel Reed
Microsoft Research, USA
,
Program Chair:
Vivek Sarkar
Rice University, USA

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 February 2009

Published in SIGPLAN Volume 44, Issue 4

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

63
Total Citations
View Citations
913
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)1

Reflects downloads up to 12 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mahjoub SGolsorkhtabaramiri MAmiri S(2023)Optimal uniformization for non-uniform two-level loops using a hybrid methodThe Journal of Supercomputing10.1007/s11227-023-05194-379:11(12791-12814)Online publication date: 19-Mar-2023
https://doi.org/10.1007/s11227-023-05194-3
Abdollahi-Kalkhoran ALotfi SIzadkhah H(2022)TEA-SEAExpert Systems with Applications: An International Journal10.1016/j.eswa.2021.116152191:COnline publication date: 1-Apr-2022
https://dl.acm.org/doi/10.1016/j.eswa.2021.116152
Chang XShen LWang Q(2021)Optimizing Stencil Codes with Exploiting Data Reuse2021 International Conference on Information Control, Electrical Engineering and Rail Transit (ICEERT)10.1109/ICEERT53919.2021.00018(45-54)Online publication date: Oct-2021
https://doi.org/10.1109/ICEERT53919.2021.00018
Oki YMikami HNishida HUmeda DKimura KKasahara H(2021)Performance of Static and Dynamic Task Scheduling for Real-Time Engine Control System on Embedded Multicore ProcessorLanguages and Compilers for Parallel Computing10.1007/978-3-030-72789-5_1(1-14)Online publication date: 26-Mar-2021
https://doi.org/10.1007/978-3-030-72789-5_1
Thoman PZangerl PFahringer T(2019)Static Compiler Analyses for Application-specific Optimization of Task-Parallel Runtime SystemsJournal of Signal Processing Systems10.1007/s11265-018-1356-991:3-4(303-320)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1007/s11265-018-1356-9
Cheshmi KKamil SStrout MDehnavi M(2018)ParSyProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291739(1-15)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.5555/3291656.3291739
Vasilios KGeorgios KNikolaos V(2018)Combining Software Cache Partitioning and Loop Tiling for Effective Shared Cache ManagementACM Transactions on Embedded Computing Systems10.1145/320266317:3(1-25)Online publication date: 22-May-2018
https://dl.acm.org/doi/10.1145/3202663
Ain QAhmed SZafar AMehmood MWaheed AChen KZhao WMa Y(2018)Analysis of hotspot methods in JVM for best-effort run-time parallelizationProceedings of the 9th International Conference on E-Education, E-Business, E-Management and E-Learning10.1145/3183586.3183607(60-65)Online publication date: 11-Jan-2018
https://dl.acm.org/doi/10.1145/3183586.3183607
Reguly IMudalige GGiles M(2018)Loop Tiling in Large-Scale Stencil Codes at Run-Time with OPSIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.277816129:4(873-886)Online publication date: 1-Apr-2018
https://doi.org/10.1109/TPDS.2017.2778161
Cheshmi KKamil SStrout MDehnavi M(2018)ParSyProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00065(1-15)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.1109/SC.2018.00065
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents