research-article

Compact multi-dimensional kernel extraction for register tiling

Authors:

Lakshminarayanan Renganarayana,

Uday Bondhugula,

Salem Derisavi,

Alexandre E. Eichenberger,

Kevin O'BrienAuthors Info & Claims

SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

Article No.: 45, Pages 1 - 12

https://doi.org/10.1145/1654059.1654105

Published: 14 November 2009 Publication History

Abstract

To achieve high performance on multi-cores, modern loop optimizers apply long sequences of transformations that produce complex loop structures. Downstream optimizations such as register tiling (unroll-and-jam plus scalar promotion) typically provide a significant performance improvement. Typical register tilers provide this performance improvement only when applied on simple loop structures. They often fail to operate on complex loop structures leaving a significant amount of performance on the table. We present a technique called compact multi-dimensional kernel extraction (COMDEX) which can make register tilers operate on arbitrarily complex loop structures and enable them to provide the performance benefits. COMDEX extracts compact unrollable kernels from complex loops. We show that by using COMDEX as a pre-processing to register tiling we can (i) enable register tiling on complex loop structures and (ii) realize a significant performance improvement on a variety of codes.

References

[1]

J. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures. Morgan Kaufmann Publishers, 2002.

Digital Library

[2]

Uday Bondhugula, M. Baskaran, Sriram Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In International conference on Compiler Construction (ETAPS CC), April 2008.

Digital Library

[3]

Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral program optimization system. In ACM SIGPLAN Conference on Programming Language Design and Implementation, June 2008.

Digital Library

[4]

David Callahan, Steve Carr, and Ken Kennedy. Improving register allocation for subscripted variables. In PLDI '90: Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation, pages 53--65, New York, NY, USA, 1990. ACM Press.

Digital Library

[5]

Steve Carr and Yiping Guan. Unroll-and-jam using uniformly generated sets. In MICRO 30: Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 349--357, Washington, DC, USA, 1997. IEEE Computer Society.

Digital Library

[6]

Steve Carr and Philip Sweany. An experimental evaluation of scalar replacement on scientific benchmarks. Software Practice and Experience, 33(15):1419--1445, 2003.

Digital Library

[7]

L. Carter, J. Ferrante, F. Hummel, B. Alpern, and K. S. Gatlin. Hierarchical tiling: A methodology for high performance. Technical Report CS96-508, UCSD, Nov. 1996.

[8]

Larry Carter, Jeanne Ferrante, and Susan Flynn Hummel. Hierarchical tiling for improved superscalar performance. In IPPS '95: Proceedings of the 9th International Symposium on Parallel Processing, pages 239--245, Washington, DC, USA, 1995. IEEE Computer Society.

Digital Library

[9]

Albert Cohen, Sylvain Girbal, David Parello, M. Sigler, Olivier Temam, and Nicolas Vasilache. Facilitating the search for compositions of program transformations. In ACM International conference on Supercomputing, pages 151--160, June 2005.

Digital Library

[10]

Albert Hartono, Muthu Manikandan Baskaran, Cédric Bastoul, Albert Cohen, Sriram Krishnamoorthy, Boyana Norris, J. Ramanujam, and P. Sadayappan. Parametric multi-level tiling of imperfectly nested loops. In ICS '09: Proceedings of the 23rd international conference on Supercomputing, pages 147--157, New York, NY, USA, 2009. ACM.

Digital Library

[11]

HiTLoG: Hierarchical Tiled Loop Generator. Available at: http://www.cs.colostate.edu/MMAlpha/HiTLoG/.

[12]

F. Irigoin and R. Triolet. Supernode partitioning. In ACM SIGPLAN Principles of Programming Languages, pages 319--329, 1988.

Digital Library

[13]

Marta Jiménez, José M. Llabería, and Agustín Fernández. Register tiling in nonrectangular iteration spaces. ACM Trans. Program. Lang. Syst., 24(4):409--453, 2002.

Digital Library

[14]

P. M. W. Knijnenburg, T. Kisuki, K. Gallivan, and M. F. P. O'Boyle. The effect of cache models on iterative compilation for combined tiling and unrolling: Research articles. Concurr. Comput.: Pract. Exper., 16(2--3):247--270, 2004.

Digital Library

[15]

N. Mitchell, K. Hogstedt, L. Carter, and J. Ferrante. Quantifying the multi-level nature of tiling interactions. International Journal of Parallel Programming, 26(6):641--670, June 1998.

Digital Library

[16]

Fabien Quilleré, Sanjay Rajopadhye, and Doran Wilde. Generation of efficient nested loops from polyhedra. International Journal Parallel Programming, 28(5):469--498, 2000.

Digital Library

[17]

Lakshminarayanan Renganarayana, Ramakrishna Upadrasta, and Sanjay Rajopadhye. Optimal ILP and register tiling: Analytical model and optimization framework. In LCPC 2005: 12th International Workshop on Languages and Compilers for Parallel Computing. Springer Verlag, 2005.

Digital Library

[18]

Lakshminarayanan Renganarayanan, DaeGon Kim, Sanjay Rajopadhye, and Michelle Mills Strout. Parameterized tiled loops for free. In PLDI '07: ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 405--414, New York, NY, USA, 2007. ACM Press.

Digital Library

[19]

Vivek Sarkar. Optimized unrolling of nested loops. In ICS '00: Proceedings of the 14th international conference on Supercomputing, pages 153--166, New York, NY, USA, 2000. ACM Press.

Digital Library

[20]

Vivek Sarkar. Optimized unrolling of nested loops. International Journal of Parallel Programming, 29(5):545--581, 2001.

Digital Library

[21]

Nicolas Vasilache. Scalable Program Optimization Techniques in the Polyhedral Model. PhD thesis, Université de Paris-Sud, INRIA Futurs, September 2007.

[22]

R. Clint Whaley and Jack J. Dongarra. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing (CDROM), pages 1--27. IEEE Computer Society, 1998.

Digital Library

[23]

Jingling Xue. Loop tiling for parallelism. Kluwer Academic Publishers, Norwell, MA, USA, 2000.

Digital Library

[24]

K. Yotov, Xiaoming Li, Gang Ren, M. J. S. Garzaran, D. Padua, K. Pingali, and P. Stodghill. Is search really necessary to generate high-performance BLAS? Proceedings of the IEEE, 93:358--386, 2005.

Cited By

Hao RWang QYin SZhou TZhang QMei SShen SLiu J(2023)Optimizing Depthwise Convolutions on ARMv8 ArchitectureParallel and Distributed Computing, Applications and Technologies10.1007/978-3-031-29927-8_34(441-452)Online publication date: 8-Apr-2023
https://doi.org/10.1007/978-3-031-29927-8_34
Khan AMewes HGrosser THoefler TCastrillon J(2020)Polyhedral Compilation for Racetrack MemoriesIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2020.3012266(1-1)Online publication date: 2020
https://doi.org/10.1109/TCAD.2020.3012266
Moll SSharma SKurtenacker MHack S(2019)Multi-dimensional Vectorization in LLVMProceedings of the 5th Workshop on Programming Models for SIMD/Vector Processing10.1145/3303117.3306172(1-8)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3303117.3306172
Show More Cited By

Index Terms

Compact multi-dimensional kernel extraction for register tiling

Recommendations

Register tiling in nonrectangular iteration spaces

Loop tiling is a well-known loop transformation generally used to expose coarse-grain parallelism and to exploit data reuse at the cache level. Tiling can also be used to exploit data reuse at the register level and to improve a program's ILP. However, ...
Register allocation for software pipelined multi-dimensional loops
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation

Software pipelining of a multi-dimensional loop is an important optimization that overlaps the execution of successive outermost loop iterations to explore instruction-level parallelism from the entire n-dimensional iteration space. This paper ...
Register allocation for software pipelined multi-dimensional loops
PLDI '05: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation

Software pipelining of a multi-dimensional loop is an important optimization that overlaps the execution of successive outermost loop iterations to explore instruction-level parallelism from the entire n-dimensional iteration space. This paper ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

November 2009

778 pages

ISBN:9781605587448

DOI:10.1145/1654059

Conference Chair:
Wilfred Pinfold

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 November 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Defense Advanced Research Projects Agency

Conference

SC '09

Sponsor:

SIGARCH
IEEE-CS

SC '09: International Conference for High Performance Computing, Networking, Storage and Analysis

November 14 - 20, 2009

Oregon, Portland

Acceptance Rates

SC '09 Paper Acceptance Rate 59 of 261 submissions, 23%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
27
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hao RWang QYin SZhou TZhang QMei SShen SLiu J(2023)Optimizing Depthwise Convolutions on ARMv8 ArchitectureParallel and Distributed Computing, Applications and Technologies10.1007/978-3-031-29927-8_34(441-452)Online publication date: 8-Apr-2023
https://doi.org/10.1007/978-3-031-29927-8_34
Khan AMewes HGrosser THoefler TCastrillon J(2020)Polyhedral Compilation for Racetrack MemoriesIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2020.3012266(1-1)Online publication date: 2020
https://doi.org/10.1109/TCAD.2020.3012266
Moll SSharma SKurtenacker MHack S(2019)Multi-dimensional Vectorization in LLVMProceedings of the 5th Workshop on Programming Models for SIMD/Vector Processing10.1145/3303117.3306172(1-8)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3303117.3306172
Carretero JDistefano SPetcu DPop DRauber TRunger GSingh D(2015)Energy-efficient Algorithms for Ultrascale SystemsSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1502052:2(77-104)Online publication date: 6-Apr-2015
https://dl.acm.org/doi/10.14529/jsfi150205
Renganarayanan LKim DStrout MRajopadhye S(2012)Parameterized loop tilingACM Transactions on Programming Languages and Systems10.1145/2160910.216091234:1(1-41)Online publication date: 4-May-2012
https://dl.acm.org/doi/10.1145/2160910.2160912
Bondhugula UGunluk ODash SRenganarayanan LSalapura VGschwind MKnoop J(2010)A model for fusion and code motion in an automatic parallelizing compilerProceedings of the 19th international conference on Parallel architectures and compilation techniques10.1145/1854273.1854317(343-352)Online publication date: 11-Sep-2010
https://dl.acm.org/doi/10.1145/1854273.1854317

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents