Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Open access

Single-dimension software pipelining for multidimensional loops

Published: 01 March 2007 Publication History

Abstract

Traditionally, software pipelining is applied either to the innermost loop of a given loop nest or from the innermost loop to outer loops. This paper proposes a three-step approach, called single-dimension software pipelining (SSP), to software pipeline a loop nest at an arbitrary loop level that has a rectangular iteration space and contains no sibling inner loops in it. The first step identifies the most profitable loop level for software pipelining in terms of initiation rate, data reuse potential, or any other optimization criteria. The second step simplifies the multidimensional data-dependence graph (DDG) of the selected loop level into a one-dimensional DDG and constructs a one-dimensional (1D) schedule. Based on the one-dimensional schedule, the third step derives a simple mapping function that specifies the schedule time for the operation instances in the multidimensional loop. The classical modulo scheduling is subsumed by SSP as a special case. SSP is also closely related to hyperplane scheduling, and, in fact, extends it to be resource constrained. We prove that SSP schedules are correct and at least as efficient as those schedules generated by traditional modulo scheduling methods. We extend SSP to schedule imperfect loop nests, which are most common at the instruction level. Multiple initiation intervals are naturally allowed to improve execution efficiency. Feasibility and correctness of our approach are verified by a prototype implementation in the ORC compiler for the IA-64 architecture, tested with loop nests from Livermore and SPEC2000 floating-point benchmarks. Preliminary experimental results reveal that, compared to modulo scheduling, software pipelining at an appropriate loop level results in significant performance improvement. Software pipelining is beneficial even with prior loop transformations.

References

[1]
Aiken, A. and Nicolau, A. 1990. Fine-grain parallelization and the wavefront method. In Selected Papers of the 2nd Workshop on Languages and Compilers for Parallel Computing. Pitman Publishing, London. 1--16.
[2]
Allan, V. H., Jones, R. B., Lee, R. M., and Allan, S. J. 1995. Software pipelining. ACM Computing Surveys 27, 3 (Sept.), 367--432.
[3]
Allen, J. R., Kennedy, K., Porterfield, C., and Warren, J. 1983. Conversion of control dependence to data dependence. In POPL '83: Proceedings of the 10th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages. ACM Press, New York. 177--189.
[4]
Banerjee, U. K. 1993. Loop Transformations for Restructuring Compilers: The Foundations. Kluwer Academic Publ., Norwell, MA.
[5]
Carr, S. and Kennedy, K. 1994. Improving the ratio of memory operations to floating-point operations in loops. ACM Trans. on Prog. Lang. and Systems 16, 6 (Nov.), 1768--1810.
[6]
Carr, S., McKinley, K. S., and Tseng, C.-W. 1994. Compiler optimizations for improving data locality. In ASPLOS-VI: Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM Press, New York. 252--262.
[7]
Carr, S., Ding, C., and Sweany, P. 1996. Improving software pipelining with unroll-and-jam. In HICSS'96: Proceedings of the 29th Hawaii International Conference on System Sciences (HICSS'96) Volume 1: Software Technology and Architecture. IEEE Computer Society, Washington, D.C. 183.
[8]
Darte, A. and Robert, Y. 1994. Constructive methods for scheduling uniform loop nests. IEEE Transactions on Parallel and Distributed Systems 5, 8 (Aug.), 814--822.
[9]
Darte, A., Schreiber, R., Rau, B. R., and Vivien, F. 2002. Constructing and exploiting linear schedules with prescribed parallelism. ACM Trans. Des. Autom. Electron. Syst. 7, 1, 159--172.
[10]
Feautrier, P. 1996. Automatic parallelization in the polytope model. Lecture Notes in Computer Science 1132, 79--103.
[11]
Gao, G. R., Ning, Q., and Van Dongen, V. 1993. Software pipelining for nested loops. ACAPS Tech Memo 53, School of Computer Science, McGill Univ., Montréal, Québec.
[12]
Ghosh, S., Martonosi, M., and Malik, S. 1999. Cache miss equations: a compiler framework for analyzing and tuning memory behavior. ACM Transactions on Prog. Lang. and Syst. 21, 4, 703--746.
[13]
Govindarajan, R., Altman, E. R., and Gao, G. R. 1996. A framework for resource-constrained rate-optimal software pipelining. IEEE Trans. on Parallel and Distributed Systems 7, 11 (Nov.), 1133--1149.
[14]
Huff, R. A. 1993. Lifetime-sensitive modulo scheduling. In PLDI'93: Proc. of the ACM SIGPLAN 1993 Conf. on Programming Language Design and Implementation. ACM Press, New York. 258--267.
[15]
Intel. 2001. Intel IA-64 Architecture Software Developer's Manual, Vol. 1: IA-64 Application Architecture. Intel Corporation, Santa Clara, CA.
[16]
Kennedy, K. and McKinley, K. S. 1992. Optimizing for parallelism and data locality. In ICS'92: Proceedings of the 6th International Conference on Supercomputing. ACM Press, New York. 323--334.
[17]
Lam, M. 1988. Software pipelining: an effective scheduling technique for vliw machines. In PLDI '88: Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation. ACM Press, New York. 318--328.
[18]
Lamport, L. 1974. The parallel execution of DO loops. Communications of the ACM 17, 2 (Feb.), 83--93.
[19]
Moon, S.-M. and Ebcioğlu, K. 1997. Parallelizing nonnumerical code with selective scheduling and software pipelining. ACM Transactions on Programming Languages and Systems 19, 6 (Nov.), 853--898.
[20]
Muthukumar, K. and Doshi, G. 2001. Software pipelining of nested loops. Lecture Notes in Computer Science 2027, 165--181.
[21]
Passos, N. L. and Sha, E. H.-M. 1996. Achieving full parallelism using multidimensional retiming. IEEE Trans. Parallel Distrib. Syst. 7, 11, 1150--1163.
[22]
Petkov, D., Harr, R., and Amarasinghe, S. 2002. Efficient pipelining of nested loops: unroll-and-squash. In 16th Intl. Parallel and Distributed Processing Symposium (IPDPS '02). IEEE, Washigton, D.C.
[23]
Ramanujam, J. 1994. Optimal software pipelining of nested loops. In Proceedings of the 8th International Symposium on Parallel Processing. IEEE Computer Society, Washington, D.C. 335--342.
[24]
Rau, B. R. 1994. Iterative modulo scheduling: an algorithm for software pipelining loops. In Proc. of the 27th Annual International Symposium on Microarchitecture. ACM Press, New York. 63--74.
[25]
Rau, B. R. and Fisher, J. A. 1993. Instruction-level parallel processing: History, overview and perspective. Journal of Supercomputing 7, 9--50.
[26]
Rong, H., Douillet, A., Govindarajan, R., and Gao, G. R. 2004a. Code generation for single-dimension software pipelining of multi-dimensional loops. In CGO '04: Proceedings of the International Symposium on Code Generation and Optimization. IEEE Computer Society, Washington, D.C. 175--186.
[27]
Rong, H., Tang, Z., Govindarajan, R., Douillet, A., and Gao, G. R. 2004b. Single-dimension software pipelining for multi-dimensional loops. In CGO '04: Proceedings of the International Symposium on Code Generation and Optimization. IEEE Computer Society, Washington, D.C. 163--174.
[28]
Rong, H., Douillet, A., and R.Gao, G. 2005. Register allocation for software pipelined multi-dimensional loops. In PLDI'05: Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation. ACM Press, New York.
[29]
Rong, H., Tang, Z., Govindarajan, R., Douillet, A., and Gao, G. R. 2007. Single-dimension software pipelining for multi-dimensional loops. CAPSL technical memo, Department of Electrical and Computer Engineering, University of Delaware, Newark, Delaware. January. In ftp://ftp.capsl.udel.edu/pub/doc/memos/memo049.ps.gz.
[30]
Wang, J. and Gao, G. R. 1996. Pipelining-dovetailing: A transformation to enhance software pipelining for nested loops. In CC '96: Proceedings of the 6th International Conference on Compiler Construction. Springer-Verlag, New York. 1--17.
[31]
Wolf, M. E. and Lam, M. S. 1991. A data locality optimizing algorithm. In PLDI'91: Proc. of the ACM SIGPLAN 1991 Conf. on Prog. Lang. Design and Implementation. ACM Press, New York. 30--44.
[32]
Wolf, M. E., Maydan, D. E., and Chen, D.-K. 1996. Combining loop transformations considering caches and scheduling. In MICRO 29: Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, Washington, D.C. 274--286.

Cited By

View all
  • (2022)Compilation SystemSoftware Defined Chips10.1007/978-981-19-6994-2_4(197-311)Online publication date: 21-Oct-2022
  • (2021)Subgraph Decoupling and Rescheduling for Increased Utilization in CGRA Architecture2021 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE51398.2021.9474195(1394-1399)Online publication date: 1-Feb-2021
  • (2016)Alternatives of profile-guided code optimizations for one-stage compilationProgramming and Computing Software10.1134/S036176881601003542:1(34-40)Online publication date: 1-Jan-2016
  • Show More Cited By

Index Terms

  1. Single-dimension software pipelining for multidimensional loops

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 4, Issue 1
    March 2007
    206 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/1216544
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 March 2007
    Published in TACO Volume 4, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Software pipelining
    2. loop transformation
    3. modulo scheduling

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)82
    • Downloads (Last 6 weeks)19
    Reflects downloads up to 01 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Compilation SystemSoftware Defined Chips10.1007/978-981-19-6994-2_4(197-311)Online publication date: 21-Oct-2022
    • (2021)Subgraph Decoupling and Rescheduling for Increased Utilization in CGRA Architecture2021 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE51398.2021.9474195(1394-1399)Online publication date: 1-Feb-2021
    • (2016)Alternatives of profile-guided code optimizations for one-stage compilationProgramming and Computing Software10.1134/S036176881601003542:1(34-40)Online publication date: 1-Jan-2016
    • (2016)Improving Nested Loop Pipelining on Coarse-Grained Reconfigurable ArchitecturesIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2015.240021924:2(507-520)Online publication date: Feb-2016
    • (2016)Exploiting Parallelism of Imperfect Nested Loops on Coarse-Grained Reconfigurable ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2016.253167827:11(3199-3213)Online publication date: 1-Nov-2016
    • (2016)Mapping Imperfect Loops to Coarse-Grained Reconfigurable ArchitecturesIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2015.250491835:7(1092-1104)Online publication date: 1-Jul-2016
    • (2016)Modulo scheduling of symbolically tiled loops for tightly coupled processor arrays2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP.2016.7760773(58-66)Online publication date: Jul-2016
    • (2014)Research Note-Heuristic TheorizingInformation Systems Research10.1287/isre.2014.053325:3(639-653)Online publication date: 1-Sep-2014
    • (2013)Polyhedral Bubble InsertionIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2012.222827032:3(339-352)Online publication date: 1-Mar-2013
    • (2013)Derivation of efficient FSM from loop nests2013 International Conference on Field-Programmable Technology (FPT)10.1109/FPT.2013.6718367(286-293)Online publication date: Dec-2013
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media