article

Optimal task scheduling at run time to exploit intra-tile parallelism

Authors:

Fabrice Rastello,

Santosh PandeAuthors Info & Claims

Parallel Computing, Volume 29, Issue 2

Pages 209 - 239

https://doi.org/10.1016/S0167-8191(02)00223-5

Published: 01 February 2003 Publication History

Abstract

In this paper we address the issue of iteration space tiling to minimize the completion time of loops when executed on multicomputers. The previous work on tiling assumes atomic execution of tiles to minimize synchronization costs. In this work, we remove the restriction of atomicity of tiles so that internal parallelism within tiles is exploited by overlapping computation with communication on multicomputers. The effectiveness of tiling is then critically dependent on the execution order of tasks within a tile. In this paper we present a theoretical framework based on equivalence classes that provides an optimal task ordering under assumptions of fixed and variable orderings of tasks in individual tiles. Our framework is able to handle loop invariant compile-time unknown dependences by efficiently generating optimal task orderings at run-time and results in lower loop completion times. Our solution is an improvement over previous approaches [Proceedings of Euromicro Workshop on Parallel and Distributed Processing, IEEE Computer Society Press, 1995, pp. 571-580; Proceedings of the International Conference on Application Specific Array Processors (ASAP), 1993, pp. 53-64]. Unlike [Proceedings of Euromicro Workshop on Parallel and Distributed Processing, IEEE Computer Society Press, 1995, pp. 571-580; Proceedings of the International Conference on Application Specific Array Processors (ASAP), 1993, pp. 53-64], our approach is optimal for all problem instances with one dependence vector in one-dimension. We show that the performance improvement over previous results is good.

References

[1]

{1} A. Agarwal, D. Kranz, V. Natrajan, Automatic partitioning of parallel loops and data arrays for distributed shared-memory multiprocessors, IEEE Transactions on Parallel and Distributed Systems 6 (9) (1995) 943-962.

Digital Library

[2]

{2} J.M. Anderson, M.S. Lam, Global optmizations for parallelism and locality on scalable parallel machines, in: Proceedings of the ACM SIGPLAN '91 Conference on Programming Language Design and Implementation, June 1993, pp. 112-125.

[3]

{3} W.H. Chou, S.Y. Kung, Scheduling partitioned algorithms on processor arrays with limited communication supports, in: Proceedings of the International Conference on Application Specific Array Processors (ASAP), 1993, pp. 53-64.

[4]

{4} S. Coleman, K. Mckinley, Tile size selection using cache organization and data layout, in: Proceedings of the ACM SIGPLAN '95 Conference on Programming Language Design and Implementation, vol. 30(6), June 1995, pp. 279-290.

[5]

{5} F. Desprez, J. Dongarra, A. Petitet, C. Randriamaro, Y. Robert, Scheduling block-cyclic array redistribution, in: Parallel Computing '97 (ParCo97), North-Holland, Amsterdam, 1997.

[6]

{6} F. Desprez, J. Dongarra, F. Rastello, Y. Robert, Determining the idle time of a tiling: new results, in: Proceedings of the Conference on Parallel Architectures and Compilation Techniques (PACT '97), IEEE/ACM, 1997, pp. 307-321.

[7]

{7} F. Desprez, P. Ramet, J. Roman, Optimal grain size computation for pipelined algorithms, in: Europar'96 Parallel Processing, Lecture Notes in Computer Science, vol. 1123, Springer Verlag, 1996, pp. 165-172.

Digital Library

[8]

{8} E.H. D'Hollander, Partitioning and labeling of loops by unimodular transformations, IEEE Transactions on Parallel and Distributed Systems 3 (4) (1992) 465-476.

Digital Library

[9]

{9} M. Dion, Alignement et distribution en parallélisation Automatique, PhD thesis, Ecole Normale Supérieure de Lyon, January 1996.

[10]

{10} M. Dion, T. Risset, Y. Robert, Resource-constrained scheduling of partitioned algorithms on processor arrays, in: Proceedings of Euromicro Workshop on Parallel and Distributed Processing, IEEE Computer Society Press, Silver Spring, MD, 1995, pp. 571-580.

[11]

{11} J.S. Frame, G. de B. Robinson, R.M. Thrall, The hook graphs of the symmetric group, Canadian Journal of Mathematics (6) (1954) 316-325.

[12]

{12} F. Irigoin, R. Triolet, Supernode partitioning, in: 15th Symposium on Principles of Programming Languages (POPL XV), January 1988, pp. 319-329.

[13]

{13} W.K. Kaplow, B.K. Szymanski, Tiling for parallel execution--optimizing node cache performance, in: Workshop on Challenges in Compiling for Scaleable Parallel Systems, Eighth IEEE Symposium on Parallel and Distributed Processing, 1996.

[14]

{14} W. Li, Compiler cache optimizations for banded matrix problems, in: Conference proceedings of the 1995 International Conference on Supercomputing, July 1995, pp. 21-30.

[15]

{15} MPI Forum, MPI: a message passing interface standard, June 1995. Version 1.1, http:// www.mcs.anl.gov/mpi/.

[16]

{16} H. Ohta, Y. Saito, M. Kainaga, H. Ona, Optimal tile size adjustment in compiling general DOACROSS loop nests, in: Conference Proceedings of the 1995 International Conference on Supercomputing, July 1995, pp. 270-279.

[17]

{17} S.S. Pande, A compile time partitioning method for DOALL loops on distributed memory systems, in: Proceedings of the 1996 International Conference on Parallel Processing, vol. III (Software), IEEE Computer Society Press, Silver Spring, MD, 1996, pp. 35-44.

[18]

{18} P.M. Petersen, D.A. Padua, Experimental evaluation of some data dependence tests (extended abstract). Technical report, Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, February 1991, CSRD Report 1080.

[19]

{19} J. Ramanujam, P. Sadayappan, Tiling multidimensional iteration spaces for multicomputers, Journal of Parallel and Distributed Computing 16 (1992) 108-120.

[20]

{20} Stanford University. The SUIF Library, 1994. This manual is a part of the SUIF compiler documentation set, http://suif.stanford.edu/.

[21]

{21} P. Tang, J.N. Zigman, Reducing data communication overhead for DOACROSS loop nests, in: Conference proceedings of the 1994 International Conference on Supercomputing, July 1994, pp. 44-53.

[22]

{22} C.-W. Tseng, An optimizing Fortran D compiler for mired distributed memory machines, Ph.D. thesis, Technical report, Center for Research in Parallel Computing, Rice University, January 1993, CRPC-TR93291-S.

[23]

{23} M.E. Wolf, M.S. Lam, A data locality optimizing algorithm, in: Proceedings of the ACM SIGPLAN '91 Conference on Programming Language Design and Implementation, June 1991, pp. 30-44.

Digital Library

[24]

{24} M.J. Wolfe, More iteration space tiling, in: Proceedings of Supercomputing '89, November 1989, pp. 655-664.

[25]

{25} J. Xue, On tiling as a loop transformation, Parallel Processing Letters 7 (4) (1997) 409-424.

Cited By

Ciorba FAndronikos TRiakiotakis IChronopoulos APapakonstantinou G(2006)Dynamic multi phase scheduling for heterogeneous clusteProceedings of the 20th international conference on Parallel and distributed processing10.5555/1898953.1899004(72-72)Online publication date: 25-Apr-2006
https://dl.acm.org/doi/10.5555/1898953.1899004

Index Terms

Optimal task scheduling at run time to exploit intra-tile parallelism
1. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
  2. Parallel computing methodologies
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Process synchronization
        Scheduling

Recommendations

(R) Scheduling of Wavefront Parallelism on Scalable Shared-memory Multiprocessors
ICPP '96: Proceedings of the Proceedings of the 1996 International Conference on Parallel Processing - Volume 3

Abstract: Tiling exploits temporal reuse carried by an outer loop of a loop nest to enhance cache locality. Loop skewing is typically required to make tiling legal. This restricts parallelism to wavefronts in the tiled iteration space. For a small ...
Run-Time Parallelization and Scheduling of Loops

The authors study run-time methods to automatically parallelize and schedule iterations of a do loop in certain cases where compile-time information is inadequate. The methods presented involve execution time preprocessing of the loop. At compile-time, ...
Exploiting Wavefront Parallelism on Large-Scale Shared-Memory Multiprocessors

Wavefront parallelism, in which parallelism is limited to hyperplanes in an iteration space, can arise when compilers apply tiling to loop nests to enhance locality. Previous approaches for scheduling wavefront parallelism focused on maximizing ...

Comments

Information & Contributors

Information

Published In

cover image Parallel Computing

Parallel Computing Volume 29, Issue 2

February 2003

115 pages

ISSN:0167-8191

Issue’s Table of Contents

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 February 2003

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ciorba FAndronikos TRiakiotakis IChronopoulos APapakonstantinou G(2006)Dynamic multi phase scheduling for heterogeneous clusteProceedings of the 20th international conference on Parallel and distributed processing10.5555/1898953.1899004(72-72)Online publication date: 25-Apr-2006
https://dl.acm.org/doi/10.5555/1898953.1899004

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents