article

Open access

Automatic tiling of iterative stencil loops

Authors:

Yonghong SongAuthors Info & Claims

ACM Transactions on Programming Languages and Systems (TOPLAS), Volume 26, Issue 6

Pages 975 - 1028

https://doi.org/10.1145/1034774.1034777

Published: 01 November 2004 Publication History

Abstract

Iterative stencil loops are used in scientific programs to implement relaxation methods for numerical simulation and signal processing. Such loops iteratively modify the same array elements over different time steps, which presents opportunities for the compiler to improve the temporal data locality through loop tiling. This article presents a compiler framework for automatic tiling of iterative stencil loops, with the objective of improving the cache performance. The article first presents a technique which allows loop tiling to satisfy data dependences in spite of the difficulty created by imperfectly nested inner loops. It does so by skewing the inner loops over the time steps and by applying a uniform skew factor to all loops at the same nesting level. Based on a memory cost analysis, the article shows that the skew factor must be minimized at every loop level in order to minimize cache misses. A graph-theoretical algorithm, which takes polynomial time, is presented to determine the minimum skew factor. Furthermore, the memory-cost analysis derives the tile size which minimizes capacity misses. Given the tile size, an efficient and general <i>array-padding</i> scheme is applied to remove conflict misses. Experiments were conducted on 16 test programs and preliminary results showed an average speedup of 1.58 and a maximum speedup of 5.06 across those test programs.

References

[1]

Admas, J. C. 1999. MUDPACK: Multigrid Software for Elliptic Partial Differential Equations. Available on line at http://www.scd.ucar.edu/css/software/mudpack/.]]

[2]

Ahmed, N., Mateev, N., and Pingali, K. 2000. Synthesizing transformations for locality enhancement of imperfectly-nested loop nests. In Proceedings of the 2000 International Conference on Supercomputing (Santa FE, NM). 141--152.]]

[3]

Ahuja, R., Magnanti, T., and Orlin, J. 1993. Network Flows: Theory, Algorithms, and Applications. Prentice-Hall Inc., Englewood Cliffs, NJ.]]

[4]

Allan, V., Jones, R., Lee, R., and Allan, S. 1995. Software pipelining. ACM Comput. Surv. 27, 3 (Sept.), 367--432.]]

[5]

Allen, J. R. and Kennedy, K. 1984. Automatic translation of FORTRAN programs to vector form. ACM Trans. Programm. Lang. Syst. 9, 4 (Oct.), 491--542.]]

[6]

Andersen, B. S., Gustavson, F. G., Wasniewski, J., and Yalamov, P. Y. 1999. Recursive formulation of some dense linear algebra algorithms. In Proceedings of the SIAM Conference on Parallel Processing for Scientific Computing (San Antonio, TX).]]

[7]

Anderson, J. M., Amarasinghe, S. P., and Lam, M. S. 1995. Data and computation transformation for multiprocessors. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Santa Barbara, CA). 166--178.]]

[8]

Bacon, D., Chow, J.-H., Ju, D., Muthukumar, K., and Sarkar, V. 1994. A compiler framework for restructuring data declarations to enhance cache and tlb effectiveness. In Proceedings of CASCON'94 (Toronto, Ont., Canada).]]

[9]

Blume, W. and Eigenmann, R. 1998. Non-linear and symbolic data dependence testing. IEEE Trans. Parall. Distrib. Syst. 9, 12 (Dec.), 1180--1194.]]

[10]

Boulet, P., Dongarra, J., Robert, Y., and Vivien, F. 1999. Static tiling for heterogeneous computing platforms. Parall. Comput. 25, 547--568.]]

[11]

Briggs, P., Cooper, K., Kennedy, K., and Torcson, L. 1989. Coloring heuristics for register allocation. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation. 275--384.]]

[12]

Burger, D. C., Goodman, J. R., and kägi, A. 1996. Memory bandwidth limitations of future microprocessors. In Proceedings of the 23rd International Symposium on Computer Architecture (Philadelphia, PA). 78--89.]]

[13]

Chame, J. and Moon, S. 1999. A tile selection algorithm for data locality and cache interference. In Proceedings of the Thirteenth ACM International Conference on Supercomputing (Rhodes, Greece). 492--499.]]

[14]

Chatterjee, S., Jain, V., Lebeck, A., Mundhra, S., and Thottethodi, M. 1999a. Nonlinear array layouts for hierarchical memory systems. In Proceedings of the Thirteenth ACM International Conference on Supercomputing (Rhodes, Greece). 444--453.]]

[15]

Chatterjee, S., Lebeck, A., Patnala, P. K., and Thottethodi, M. 1999b. Recursive array layouts and fast parallel matrix multiplication. In Proceedings of the 11th ACM Symposium on Parallel Algorithms and Architectures (Saint Malo, France).]]

[16]

Cociorva, D., Wilkins, J. W., Lam, C., Baumgartner, G., Ramanujam, J., and Sadayappan, P. 2001. Loop optimization for a class of memory-constrained computations. In Proceedings of the 15th ACM International Conference on Supercomputing (Naples, Italy).]]

[17]

Coleman, S. and McKinley, K. S. 1995. Tile size selection using cache organization and data layout. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (La Jolla, CA). 279--290.]]

[18]

Collard, J.-F. 1994. Space-time transformation of while-loops using speculative execution. In Proceedings of the Scalable High Performance Computing Conference (Knoxville, TN). 429--436.]]

[19]

Cormen, T., Leiserson, C., and Rivest, R. 1990. Introduction to Algorithms. MIT Press, Cambridge, MA, and McGraw-Hill Book Company, New York, NY.]]

[20]

Ding, C. and Kennedy, K. 2001. Reducing effective bandwidth through compiler enhancement of global cache reuse. In Proceedings of the International Parallel and Distributed Processing Symposium.]]

[21]

Ferrante, J., Sarkar, V., and Thrash, W. 1991. On estimating and enhancing cache effectiveness. In Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, vol. 1863. Springer-Verlag, Berlin, Germany, 328--341. August 1991.]]

[22]

Gary, M. R. and Johnson, D. S. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, New York, NY.]]

[23]

Ghosh, S., Martonosi, M., and Malik, S. 1998. Precise miss analysis for program transformations with caches of arbitrary associativity. In Proceedings of the Eighth ACM Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, CA). 228--239.]]

[24]

Gu, J., Li, Z., and Lee, G. 1997. Experience with efficient array data flow analysis for array privatization. In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Las Vegas, NV). 157--167.]]

[25]

Haghighat, M. R. 1990. Symbolic dependence analysis for high performance parallelizing compilers. Ph.D. dissertation. Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL.]]

[26]

Hennessy, J. and Patterson, D. 1996. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, San Francisco, CA.]]

[27]

Jin, G., Mellor-Crummey, J., and Fowler, R. 2001. Increasing temporal locality with skewing and recursive blocking. In Proceedings of IEEE/ACM SC 2001 (Denver, CO).]]

[28]

Kandemir, M., Choudhary, A., Ramanujam, J., and Banerjee, P. 1998. A matrix-based approach to the global locality optimization problem. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT'98, Paris, France).]]

[29]

Kennedy, K. 2000. Fast greedy weighted fusion. In Proceedings of the 2000 International Conference on Supercomputing (Santa Fe, NM).]]

[30]

Kennedy, K. and McKinley, K. S. 1993. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Proceedings of the Sixth Workhsop on Languages and Compilers for Parallel Computing (Portland, OR, Aug. 1993). Lecture Notes in Computer Science, vol. 768, Springer-Verlag, Berlin, Germany.]]

[31]

Kodukula, I., Ahmed, N., and Pingali, K. 1997. Data-centric multi-level blocking. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (Las Vegas, NV). 346--357.]]

[32]

Kodukula, I. and Pingali, K. 1996. Transformations of imperfectly nested loops. In Proceedings of Supercomputing.]]

[33]

Lam, M. S., Rothberg, E. E., and Wolf, M. E. 1991. The cache performance and optimizations of blocked algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (Santa Clara, CA). 63--74.]]

[34]

Manjikian, N. and Abdelrahman, T. 1997. Fusion of loops for parallelism and locality. IEEE Trans. Parall. and Distribut. Syst. 8, 2 (Feb.), 193--209.]]

[35]

Matula, D. and Beck, L. 1981. Smallest-last ordering and clustering and graph coloring algorithms. Tech. rep. TR CSE 8104. Department of Computer Science and Engineering, Southern Methodist University, Dallas, TX.]]

[36]

Mitchell, N., Högstedt, K., Carter, L., and Ferrante, J. 1998. Quantifying the multi-level nature of tiling interactions. Int. J. Parall. Programm. 26, 6 (Dec.), 641--670.]]

[37]

Nguyen, T. and Li, Z. 1998. Interprocedural analysis for loop scheduling and data allocation. Parall. Comput. 24, 3, 477--504.]]

[38]

Object-Oriented Scientific Computing. 2001. Blitz++. Object-Oriented Scientific Computing, Available online at http://www.oonumerics.org/blitz/benchmarks/.]]

[39]

O'Boyle, M. and Knijnenburg, P. 1997. Non-singular data transformations: Definition, validity and applications. In Proceedings of the ACM International Conference on Supercomputing (Vienna, Austria). 309--316.]]

[40]

Panda, P., Nakamura, H., Dutt, N., and Nicolau, A. 1999. Augmenting loop tiling with data alignment for improved cache performance. IEEE Trans. Comput. 48, 2 (Feb.), 142--149.]]

[41]

Park, N., Hong, B., and Prasanna, V. K. 2002. Analysis of memory hierarchy performance of block data layout. In Proceedings of the International Conference on Parallel Processing (Vancouver, B.C., Canada). 34--44.]]

[42]

Pugh, W. 1992. A practical algorithm for exact array dependence analysis. Commun. ACM 35, 8 (Aug.), 102--114.]]

[43]

Pugh, W. and Rosser, E. 1999. Iteration space slicing for locality. In Proceedings of the Twelfth International Workshop on Languages and Compilers for Parallel Computing (San Diego, CA).]]

[44]

Pugh, W., Rosser, E., and Shpeisman, T. 1996. Exploiting monotone convergence functions in parallel programs. Tech. rep. CS-TR-3636. University of Maryland, College Park, MD.]]

[45]

Rivera, G. and Tseng, C.-W. 1999. A comparison of compiler tiling algorithms. In Proceedings of the Eighth International Conference on Compiler Construction (Amsterdam, The Netherlands).]]

[46]

Rivera, G. and Tseng, C.-W. 2000. Tiling optimizations for 3D scientific computations. In Proceedings of the IEEE/ACM SC 2000.]]

[47]

Rosser, E. 1998. Fine-grained analysis of array computations. Ph.D. dissertation. Department of Computer Science, University of Maryland at College Park, MD.]]

[48]

Sarkar, V. 1998. Loop transformations for hierarchical parallelism and locality. In Proceedings of the Fourth Workshop on Languages, Compilers, and Run-time Systems for Scalable Computers (Pittsburgh, PA).]]

[49]

Song, Y. and Li, Z. 1999. New tiling techniques to improve cache temporal locality. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (Atlanta, GA). 215--228.]]

[50]

Song, Y., Xu, R., Wang, C., and Li, Z. 2001. Data locality enhancement by memory reduction. In Proceedings of the 15th ACM International Conference on Supercomputing (Naples, Italy).]]

[51]

Strout, M., Carter, L., Ferrante, J., and Simon, B. 1998. Schedule-independent storage mapping for loops. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, CA). 24--33.]]

[52]

Temam, O., Fricker, C., and Jalby, W. 1994. Cache interference phenomena. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (Nashville, TN). 261--271.]]

[53]

Wise, D. S., Alexander, G. A., Frens, J. D., and Gu, Y. 2001. Language support for Morton-order matrices. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Snowbird, UT).]]

[54]

Wolf, M. 1992. Improving locality and parallelism in nested loops. Ph.D. dissertation. Department of Computer Science, Stanford University, Stanford, CA.]]

[55]

Wolfe, M. 1995. High Performance Compilers for Parallel Computing. Addison-Wesley Publishing Company, Reading, MA.]]

[56]

Wonnacott, D. 2002. Achieving scalable locality with time skewing. Int. J. Parall. Programm. 30, 3 (June), 181--221.]]

[57]

Xue, J. 2000. Loop Tiling for Parallelism. Kluwer Academic Publishers, Dordrecht, The Netherlands.]]

Cited By

Jia MSha EZhuge QGu S(2022)Transient computing for energy harvesting systemsJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2022.102743132:COnline publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1016/j.sysarc.2022.102743
Diaz-del-Rio FCagigas-Muñiz DGuisado-Lizar JSevillano-Ramos J(2022)Efficient Parallel Implementation of Cellular Automata and Stencil Computations in Current ProcessorsAdvances in Computing, Informatics, Networking and Cybersecurity10.1007/978-3-030-87049-2_4(93-120)Online publication date: 3-Mar-2022
https://doi.org/10.1007/978-3-030-87049-2_4
Qiu KLi QHu JZhang WXue C(2020)Write Mode Aware Loop Tiling for High-Performance Low-Power Volatile PCM in Embedded SystemsSmart Sensors and Systems10.1007/978-3-030-42234-9_10(171-198)Online publication date: 11-Jun-2020
https://doi.org/10.1007/978-3-030-42234-9_10
Show More Cited By

Index Terms

Automatic tiling of iterative stencil loops
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

New tiling techniques to improve cache temporal locality
PLDI '99: Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation

Tiling is a well-known loop transformation to improve temporal locality of nested loops. Current compiler algorithms for tiling are limited to loops which are perfectly nested or can be transformed, in trivial ways, into a perfect nest. This paper ...
Read More
New tiling techniques to improve cache temporal locality

Tiling is a well-known loop transformation to improve temporal locality of nested loops. Current compiler algorithms for tiling are limited to loops which are perfectly nested or can be transformed, in trivial ways, into a perfect nest. This paper ...
Read More
Optimized Unrolling of Nested Loops

Loop unrolling is a well known loop transformation that has been used in optimizing compilers for over three decades. In this paper, we address the problems of automatically selecting unroll factors for perfectly nested loops, and generating compact ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Programming Languages and Systems

ACM Transactions on Programming Languages and Systems Volume 26, Issue 6

November 2004

142 pages

ISSN:0164-0925

EISSN:1558-4593

DOI:10.1145/1034774

Issue’s Table of Contents

Copyright © 2004 ACM.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2004

Published in TOPLAS Volume 26, Issue 6

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

60
Total Citations
View Citations
1,109
Total Downloads

Downloads (Last 12 months)69
Downloads (Last 6 weeks)2

Other Metrics

View Author Metrics

Citations

Cited By

Jia MSha EZhuge QGu S(2022)Transient computing for energy harvesting systemsJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2022.102743132:COnline publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1016/j.sysarc.2022.102743
Diaz-del-Rio FCagigas-Muñiz DGuisado-Lizar JSevillano-Ramos J(2022)Efficient Parallel Implementation of Cellular Automata and Stencil Computations in Current ProcessorsAdvances in Computing, Informatics, Networking and Cybersecurity10.1007/978-3-030-87049-2_4(93-120)Online publication date: 3-Mar-2022
https://doi.org/10.1007/978-3-030-87049-2_4
Qiu KLi QHu JZhang WXue C(2020)Write Mode Aware Loop Tiling for High-Performance Low-Power Volatile PCM in Embedded SystemsSmart Sensors and Systems10.1007/978-3-030-42234-9_10(171-198)Online publication date: 11-Jun-2020
https://doi.org/10.1007/978-3-030-42234-9_10
Koraei MFatemi OJahre M(2019)DCMIACM Transactions on Architecture and Code Optimization10.1145/335281316:4(1-24)Online publication date: 11-Oct-2019
https://dl.acm.org/doi/10.1145/3352813
Li FQiu KZhao MHu JLiu YGuan YXue C(2019)Checkpointing-Aware Loop Tiling for Energy Harvesting Powered Nonvolatile ProcessorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2018.280362438:1(15-28)Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.1109/TCAD.2018.2803624
Qiu KZhu YXu YHuo QXue C(2019)BRLoop: Constructing balanced retimed loop to architect STT-RAM-based hybrid cache for VLIW processorsMicroelectronics Journal10.1016/j.mejo.2018.11.01183(137-146)Online publication date: Jan-2019
https://doi.org/10.1016/j.mejo.2018.11.011
Luo YGhose SCai YHaratsch EMutlu O(2018)Improving 3D NAND Flash Memory Lifetime by Tolerating Early Retention Loss and Process VariationProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/32244322:3(1-48)Online publication date: 21-Dec-2018
https://dl.acm.org/doi/10.1145/3224432
Ghose SYaglikçi AGupta RLee DKudrolli KLiu WHassan HChang KChatterjee NAgrawal AO'Connor MMutlu O(2018)What Your DRAM Power Models Are Not Telling YouProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/32244192:3(1-41)Online publication date: 21-Dec-2018
https://dl.acm.org/doi/10.1145/3224419
Götzelmann T(2018)Visually Augmented Audio-Tactile Graphics for Visually Impaired PeopleACM Transactions on Accessible Computing10.1145/318689411:2(1-31)Online publication date: 8-Jun-2018
https://dl.acm.org/doi/10.1145/3186894
Dong RRatliff LCárdenas AOhlsson HSastry S(2018)Quantifying the Utility--Privacy Tradeoff in the Internet of ThingsACM Transactions on Cyber-Physical Systems10.1145/31855112:2(1-28)Online publication date: 23-May-2018
https://dl.acm.org/doi/10.1145/3185511
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents