Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Optimal task scheduling at run time to exploit intra-tile parallelism

Published: 01 February 2003 Publication History

Abstract

In this paper we address the issue of iteration space tiling to minimize the completion time of loops when executed on multicomputers. The previous work on tiling assumes atomic execution of tiles to minimize synchronization costs. In this work, we remove the restriction of atomicity of tiles so that internal parallelism within tiles is exploited by overlapping computation with communication on multicomputers. The effectiveness of tiling is then critically dependent on the execution order of tasks within a tile. In this paper we present a theoretical framework based on equivalence classes that provides an optimal task ordering under assumptions of fixed and variable orderings of tasks in individual tiles. Our framework is able to handle loop invariant compile-time unknown dependences by efficiently generating optimal task orderings at run-time and results in lower loop completion times. Our solution is an improvement over previous approaches [Proceedings of Euromicro Workshop on Parallel and Distributed Processing, IEEE Computer Society Press, 1995, pp. 571-580; Proceedings of the International Conference on Application Specific Array Processors (ASAP), 1993, pp. 53-64]. Unlike [Proceedings of Euromicro Workshop on Parallel and Distributed Processing, IEEE Computer Society Press, 1995, pp. 571-580; Proceedings of the International Conference on Application Specific Array Processors (ASAP), 1993, pp. 53-64], our approach is optimal for all problem instances with one dependence vector in one-dimension. We show that the performance improvement over previous results is good.

References

[1]
{1} A. Agarwal, D. Kranz, V. Natrajan, Automatic partitioning of parallel loops and data arrays for distributed shared-memory multiprocessors, IEEE Transactions on Parallel and Distributed Systems 6 (9) (1995) 943-962.
[2]
{2} J.M. Anderson, M.S. Lam, Global optmizations for parallelism and locality on scalable parallel machines, in: Proceedings of the ACM SIGPLAN '91 Conference on Programming Language Design and Implementation, June 1993, pp. 112-125.
[3]
{3} W.H. Chou, S.Y. Kung, Scheduling partitioned algorithms on processor arrays with limited communication supports, in: Proceedings of the International Conference on Application Specific Array Processors (ASAP), 1993, pp. 53-64.
[4]
{4} S. Coleman, K. Mckinley, Tile size selection using cache organization and data layout, in: Proceedings of the ACM SIGPLAN '95 Conference on Programming Language Design and Implementation, vol. 30(6), June 1995, pp. 279-290.
[5]
{5} F. Desprez, J. Dongarra, A. Petitet, C. Randriamaro, Y. Robert, Scheduling block-cyclic array redistribution, in: Parallel Computing '97 (ParCo97), North-Holland, Amsterdam, 1997.
[6]
{6} F. Desprez, J. Dongarra, F. Rastello, Y. Robert, Determining the idle time of a tiling: new results, in: Proceedings of the Conference on Parallel Architectures and Compilation Techniques (PACT '97), IEEE/ACM, 1997, pp. 307-321.
[7]
{7} F. Desprez, P. Ramet, J. Roman, Optimal grain size computation for pipelined algorithms, in: Europar'96 Parallel Processing, Lecture Notes in Computer Science, vol. 1123, Springer Verlag, 1996, pp. 165-172.
[8]
{8} E.H. D'Hollander, Partitioning and labeling of loops by unimodular transformations, IEEE Transactions on Parallel and Distributed Systems 3 (4) (1992) 465-476.
[9]
{9} M. Dion, Alignement et distribution en parallélisation Automatique, PhD thesis, Ecole Normale Supérieure de Lyon, January 1996.
[10]
{10} M. Dion, T. Risset, Y. Robert, Resource-constrained scheduling of partitioned algorithms on processor arrays, in: Proceedings of Euromicro Workshop on Parallel and Distributed Processing, IEEE Computer Society Press, Silver Spring, MD, 1995, pp. 571-580.
[11]
{11} J.S. Frame, G. de B. Robinson, R.M. Thrall, The hook graphs of the symmetric group, Canadian Journal of Mathematics (6) (1954) 316-325.
[12]
{12} F. Irigoin, R. Triolet, Supernode partitioning, in: 15th Symposium on Principles of Programming Languages (POPL XV), January 1988, pp. 319-329.
[13]
{13} W.K. Kaplow, B.K. Szymanski, Tiling for parallel execution--optimizing node cache performance, in: Workshop on Challenges in Compiling for Scaleable Parallel Systems, Eighth IEEE Symposium on Parallel and Distributed Processing, 1996.
[14]
{14} W. Li, Compiler cache optimizations for banded matrix problems, in: Conference proceedings of the 1995 International Conference on Supercomputing, July 1995, pp. 21-30.
[15]
{15} MPI Forum, MPI: a message passing interface standard, June 1995. Version 1.1, http:// www.mcs.anl.gov/mpi/.
[16]
{16} H. Ohta, Y. Saito, M. Kainaga, H. Ona, Optimal tile size adjustment in compiling general DOACROSS loop nests, in: Conference Proceedings of the 1995 International Conference on Supercomputing, July 1995, pp. 270-279.
[17]
{17} S.S. Pande, A compile time partitioning method for DOALL loops on distributed memory systems, in: Proceedings of the 1996 International Conference on Parallel Processing, vol. III (Software), IEEE Computer Society Press, Silver Spring, MD, 1996, pp. 35-44.
[18]
{18} P.M. Petersen, D.A. Padua, Experimental evaluation of some data dependence tests (extended abstract). Technical report, Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, February 1991, CSRD Report 1080.
[19]
{19} J. Ramanujam, P. Sadayappan, Tiling multidimensional iteration spaces for multicomputers, Journal of Parallel and Distributed Computing 16 (1992) 108-120.
[20]
{20} Stanford University. The SUIF Library, 1994. This manual is a part of the SUIF compiler documentation set, http://suif.stanford.edu/.
[21]
{21} P. Tang, J.N. Zigman, Reducing data communication overhead for DOACROSS loop nests, in: Conference proceedings of the 1994 International Conference on Supercomputing, July 1994, pp. 44-53.
[22]
{22} C.-W. Tseng, An optimizing Fortran D compiler for mired distributed memory machines, Ph.D. thesis, Technical report, Center for Research in Parallel Computing, Rice University, January 1993, CRPC-TR93291-S.
[23]
{23} M.E. Wolf, M.S. Lam, A data locality optimizing algorithm, in: Proceedings of the ACM SIGPLAN '91 Conference on Programming Language Design and Implementation, June 1991, pp. 30-44.
[24]
{24} M.J. Wolfe, More iteration space tiling, in: Proceedings of Supercomputing '89, November 1989, pp. 655-664.
[25]
{25} J. Xue, On tiling as a loop transformation, Parallel Processing Letters 7 (4) (1997) 409-424.

Cited By

View all
  • (2006)Dynamic multi phase scheduling for heterogeneous clusteProceedings of the 20th international conference on Parallel and distributed processing10.5555/1898953.1899004(72-72)Online publication date: 25-Apr-2006

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Parallel Computing
Parallel Computing  Volume 29, Issue 2
February 2003
115 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 February 2003

Author Tags

  1. compile time unknown dependences
  2. task ordering
  3. tiling

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 09 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2006)Dynamic multi phase scheduling for heterogeneous clusteProceedings of the 20th international conference on Parallel and distributed processing10.5555/1898953.1899004(72-72)Online publication date: 25-Apr-2006

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media