article

Open access

Single-dimension software pipelining for multidimensional loops

Authors:

R. Govindarajan,

Alban Douillet,

Guang R. GaoAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 4, Issue 1

Pages 7 - es

https://doi.org/10.1145/1216544.1216550

Published: 01 March 2007 Publication History

Abstract

Traditionally, software pipelining is applied either to the innermost loop of a given loop nest or from the innermost loop to outer loops. This paper proposes a three-step approach, called single-dimension software pipelining (SSP), to software pipeline a loop nest at an arbitrary loop level that has a rectangular iteration space and contains no sibling inner loops in it. The first step identifies the most profitable loop level for software pipelining in terms of initiation rate, data reuse potential, or any other optimization criteria. The second step simplifies the multidimensional data-dependence graph (DDG) of the selected loop level into a one-dimensional DDG and constructs a one-dimensional (1D) schedule. Based on the one-dimensional schedule, the third step derives a simple mapping function that specifies the schedule time for the operation instances in the multidimensional loop. The classical modulo scheduling is subsumed by SSP as a special case. SSP is also closely related to hyperplane scheduling, and, in fact, extends it to be resource constrained. We prove that SSP schedules are correct and at least as efficient as those schedules generated by traditional modulo scheduling methods. We extend SSP to schedule imperfect loop nests, which are most common at the instruction level. Multiple initiation intervals are naturally allowed to improve execution efficiency. Feasibility and correctness of our approach are verified by a prototype implementation in the ORC compiler for the IA-64 architecture, tested with loop nests from Livermore and SPEC2000 floating-point benchmarks. Preliminary experimental results reveal that, compared to modulo scheduling, software pipelining at an appropriate loop level results in significant performance improvement. Software pipelining is beneficial even with prior loop transformations.

References

[1]

Aiken, A. and Nicolau, A. 1990. Fine-grain parallelization and the wavefront method. In Selected Papers of the 2nd Workshop on Languages and Compilers for Parallel Computing. Pitman Publishing, London. 1--16.

Digital Library

[2]

Allan, V. H., Jones, R. B., Lee, R. M., and Allan, S. J. 1995. Software pipelining. ACM Computing Surveys 27, 3 (Sept.), 367--432.

Digital Library

[3]

Allen, J. R., Kennedy, K., Porterfield, C., and Warren, J. 1983. Conversion of control dependence to data dependence. In POPL '83: Proceedings of the 10th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages. ACM Press, New York. 177--189.

Digital Library

[4]

Banerjee, U. K. 1993. Loop Transformations for Restructuring Compilers: The Foundations. Kluwer Academic Publ., Norwell, MA.

Digital Library

[5]

Carr, S. and Kennedy, K. 1994. Improving the ratio of memory operations to floating-point operations in loops. ACM Trans. on Prog. Lang. and Systems 16, 6 (Nov.), 1768--1810.

Digital Library

[6]

Carr, S., McKinley, K. S., and Tseng, C.-W. 1994. Compiler optimizations for improving data locality. In ASPLOS-VI: Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM Press, New York. 252--262.

Digital Library

[7]

Carr, S., Ding, C., and Sweany, P. 1996. Improving software pipelining with unroll-and-jam. In HICSS'96: Proceedings of the 29th Hawaii International Conference on System Sciences (HICSS'96) Volume 1: Software Technology and Architecture. IEEE Computer Society, Washington, D.C. 183.

Digital Library

[8]

Darte, A. and Robert, Y. 1994. Constructive methods for scheduling uniform loop nests. IEEE Transactions on Parallel and Distributed Systems 5, 8 (Aug.), 814--822.

Digital Library

[9]

Darte, A., Schreiber, R., Rau, B. R., and Vivien, F. 2002. Constructing and exploiting linear schedules with prescribed parallelism. ACM Trans. Des. Autom. Electron. Syst. 7, 1, 159--172.

Digital Library

[10]

Feautrier, P. 1996. Automatic parallelization in the polytope model. Lecture Notes in Computer Science 1132, 79--103.

Digital Library

[11]

Gao, G. R., Ning, Q., and Van Dongen, V. 1993. Software pipelining for nested loops. ACAPS Tech Memo 53, School of Computer Science, McGill Univ., Montréal, Québec.

[12]

Ghosh, S., Martonosi, M., and Malik, S. 1999. Cache miss equations: a compiler framework for analyzing and tuning memory behavior. ACM Transactions on Prog. Lang. and Syst. 21, 4, 703--746.

Digital Library

[13]

Govindarajan, R., Altman, E. R., and Gao, G. R. 1996. A framework for resource-constrained rate-optimal software pipelining. IEEE Trans. on Parallel and Distributed Systems 7, 11 (Nov.), 1133--1149.

Digital Library

[14]

Huff, R. A. 1993. Lifetime-sensitive modulo scheduling. In PLDI'93: Proc. of the ACM SIGPLAN 1993 Conf. on Programming Language Design and Implementation. ACM Press, New York. 258--267.

Digital Library

[15]

Intel. 2001. Intel IA-64 Architecture Software Developer's Manual, Vol. 1: IA-64 Application Architecture. Intel Corporation, Santa Clara, CA.

[16]

Kennedy, K. and McKinley, K. S. 1992. Optimizing for parallelism and data locality. In ICS'92: Proceedings of the 6th International Conference on Supercomputing. ACM Press, New York. 323--334.

Digital Library

[17]

Lam, M. 1988. Software pipelining: an effective scheduling technique for vliw machines. In PLDI '88: Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation. ACM Press, New York. 318--328.

Digital Library

[18]

Lamport, L. 1974. The parallel execution of DO loops. Communications of the ACM 17, 2 (Feb.), 83--93.

Digital Library

[19]

Moon, S.-M. and Ebcioğlu, K. 1997. Parallelizing nonnumerical code with selective scheduling and software pipelining. ACM Transactions on Programming Languages and Systems 19, 6 (Nov.), 853--898.

Digital Library

[20]

Muthukumar, K. and Doshi, G. 2001. Software pipelining of nested loops. Lecture Notes in Computer Science 2027, 165--181.

Digital Library

[21]

Passos, N. L. and Sha, E. H.-M. 1996. Achieving full parallelism using multidimensional retiming. IEEE Trans. Parallel Distrib. Syst. 7, 11, 1150--1163.

Digital Library

[22]

Petkov, D., Harr, R., and Amarasinghe, S. 2002. Efficient pipelining of nested loops: unroll-and-squash. In 16th Intl. Parallel and Distributed Processing Symposium (IPDPS '02). IEEE, Washigton, D.C.

Digital Library

[23]

Ramanujam, J. 1994. Optimal software pipelining of nested loops. In Proceedings of the 8th International Symposium on Parallel Processing. IEEE Computer Society, Washington, D.C. 335--342.

Digital Library

[24]

Rau, B. R. 1994. Iterative modulo scheduling: an algorithm for software pipelining loops. In Proc. of the 27th Annual International Symposium on Microarchitecture. ACM Press, New York. 63--74.

Digital Library

[25]

Rau, B. R. and Fisher, J. A. 1993. Instruction-level parallel processing: History, overview and perspective. Journal of Supercomputing 7, 9--50.

Digital Library

[26]

Rong, H., Douillet, A., Govindarajan, R., and Gao, G. R. 2004a. Code generation for single-dimension software pipelining of multi-dimensional loops. In CGO '04: Proceedings of the International Symposium on Code Generation and Optimization. IEEE Computer Society, Washington, D.C. 175--186.

Digital Library

[27]

Rong, H., Tang, Z., Govindarajan, R., Douillet, A., and Gao, G. R. 2004b. Single-dimension software pipelining for multi-dimensional loops. In CGO '04: Proceedings of the International Symposium on Code Generation and Optimization. IEEE Computer Society, Washington, D.C. 163--174.

Digital Library

[28]

Rong, H., Douillet, A., and R.Gao, G. 2005. Register allocation for software pipelined multi-dimensional loops. In PLDI'05: Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation. ACM Press, New York.

Digital Library

[29]

Rong, H., Tang, Z., Govindarajan, R., Douillet, A., and Gao, G. R. 2007. Single-dimension software pipelining for multi-dimensional loops. CAPSL technical memo, Department of Electrical and Computer Engineering, University of Delaware, Newark, Delaware. January. In ftp://ftp.capsl.udel.edu/pub/doc/memos/memo049.ps.gz.

[30]

Wang, J. and Gao, G. R. 1996. Pipelining-dovetailing: A transformation to enhance software pipelining for nested loops. In CC '96: Proceedings of the 6th International Conference on Compiler Construction. Springer-Verlag, New York. 1--17.

Digital Library

[31]

Wolf, M. E. and Lam, M. S. 1991. A data locality optimizing algorithm. In PLDI'91: Proc. of the ACM SIGPLAN 1991 Conf. on Prog. Lang. Design and Implementation. ACM Press, New York. 30--44.

Digital Library

[32]

Wolf, M. E., Maydan, D. E., and Chen, D.-K. 1996. Combining loop transformations considering caches and scheduling. In MICRO 29: Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, Washington, D.C. 274--286.

Digital Library

Cited By

Wei SLiu LZhu JDeng CWei SLiu LZhu JDeng C(2022)Compilation SystemSoftware Defined Chips10.1007/978-981-19-6994-2_4(197-311)Online publication date: 21-Oct-2022
https://doi.org/10.1007/978-981-19-6994-2_4
Yin CWang QJiang JSheng WHe GMao ZJing N(2021)Subgraph Decoupling and Rescheduling for Increased Utilization in CGRA Architecture2021 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE51398.2021.9474195(1394-1399)Online publication date: 1-Feb-2021
https://doi.org/10.23919/DATE51398.2021.9474195
Chetverina O(2016)Alternatives of profile-guided code optimizations for one-stage compilationProgramming and Computing Software10.1134/S036176881601003542:1(34-40)Online publication date: 1-Jan-2016
https://dl.acm.org/doi/10.1134/S0361768816010035
Show More Cited By

Index Terms

Single-dimension software pipelining for multidimensional loops
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Modulo scheduling of loops in control-intensive non-numeric programs
MICRO 29: Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture

Much of the previous work on modulo scheduling has targeted numeric programs, in which, often, the majority of the loops are well-behaved loop-counter-based loops without early exits. In control-intensive non-numeric programs, the loops frequently have ...
Single-Dimension Software Pipelining for Multi-Dimensional Loops
CGO '04: Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization

Traditionally, software pipelining is applied either to theinnermost loop of a given loop nest or from the innermostloop to outer loops. In this paper, we propose a three-stepapproach, called Single-dimension Software Pipelining(SSP), to software ...
Software Pipelining Irregular Loops On the TMS320C6000 VLIW DSP Architecture
LCTES '01: Proceedings of the ACM SIGPLAN workshop on Languages, compilers and tools for embedded systems

The TMS320C6000 architecture is a leading family of Digital Signal Processors (DSPs). To achieve peak performance, this VLIW architecture relies heavily on software pipelining. Traditionally, software pipelining has been restricted to regular (FOR) ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 4, Issue 1

March 2007

206 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/1216544

Issue’s Table of Contents

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 March 2007

Published in TACO Volume 4, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
1,145
Total Downloads

Downloads (Last 12 months)82
Downloads (Last 6 weeks)19

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wei SLiu LZhu JDeng CWei SLiu LZhu JDeng C(2022)Compilation SystemSoftware Defined Chips10.1007/978-981-19-6994-2_4(197-311)Online publication date: 21-Oct-2022
https://doi.org/10.1007/978-981-19-6994-2_4
Yin CWang QJiang JSheng WHe GMao ZJing N(2021)Subgraph Decoupling and Rescheduling for Increased Utilization in CGRA Architecture2021 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE51398.2021.9474195(1394-1399)Online publication date: 1-Feb-2021
https://doi.org/10.23919/DATE51398.2021.9474195
Chetverina O(2016)Alternatives of profile-guided code optimizations for one-stage compilationProgramming and Computing Software10.1134/S036176881601003542:1(34-40)Online publication date: 1-Jan-2016
https://dl.acm.org/doi/10.1134/S0361768816010035
Shouyi Yin Dajiang Liu Yu Peng Leibo Liu Shaojun Wei (2016)Improving Nested Loop Pipelining on Coarse-Grained Reconfigurable ArchitecturesIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2015.240021924:2(507-520)Online publication date: Feb-2016
https://doi.org/10.1109/TVLSI.2015.2400219
Yin SLin XLiu LWei S(2016)Exploiting Parallelism of Imperfect Nested Loops on Coarse-Grained Reconfigurable ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2016.253167827:11(3199-3213)Online publication date: 1-Nov-2016
https://dl.acm.org/doi/10.1109/TPDS.2016.2531678
Sim HLee HSeo SLee J(2016)Mapping Imperfect Loops to Coarse-Grained Reconfigurable ArchitecturesIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2015.250491835:7(1092-1104)Online publication date: 1-Jul-2016
https://dl.acm.org/doi/10.1109/TCAD.2015.2504918
Witterauf MTanase AHannig FTeich J(2016)Modulo scheduling of symbolically tiled loops for tightly coupled processor arrays2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP.2016.7760773(58-66)Online publication date: Jul-2016
https://doi.org/10.1109/ASAP.2016.7760773
Gregory RMuntermann J(2014)Research Note-Heuristic TheorizingInformation Systems Research10.1287/isre.2014.053325:3(639-653)Online publication date: 1-Sep-2014
https://dl.acm.org/doi/10.1287/isre.2014.0533
Morvan ADerrien SQuinton P(2013)Polyhedral Bubble InsertionIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2012.222827032:3(339-352)Online publication date: 1-Mar-2013
https://dl.acm.org/doi/10.1109/TCAD.2012.2228270
Yuki TMorvan ADerrien S(2013)Derivation of efficient FSM from loop nests2013 International Conference on Field-Programmable Technology (FPT)10.1109/FPT.2013.6718367(286-293)Online publication date: Dec-2013
https://doi.org/10.1109/FPT.2013.6718367
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents