Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Open access

Automatic tiling of iterative stencil loops

Published: 01 November 2004 Publication History
  • Get Citation Alerts
  • Abstract

    Iterative stencil loops are used in scientific programs to implement relaxation methods for numerical simulation and signal processing. Such loops iteratively modify the same array elements over different time steps, which presents opportunities for the compiler to improve the temporal data locality through loop tiling. This article presents a compiler framework for automatic tiling of iterative stencil loops, with the objective of improving the cache performance. The article first presents a technique which allows loop tiling to satisfy data dependences in spite of the difficulty created by imperfectly nested inner loops. It does so by skewing the inner loops over the time steps and by applying a uniform skew factor to all loops at the same nesting level. Based on a memory cost analysis, the article shows that the skew factor must be minimized at every loop level in order to minimize cache misses. A graph-theoretical algorithm, which takes polynomial time, is presented to determine the minimum skew factor. Furthermore, the memory-cost analysis derives the tile size which minimizes capacity misses. Given the tile size, an efficient and general <i>array-padding</i> scheme is applied to remove conflict misses. Experiments were conducted on 16 test programs and preliminary results showed an average speedup of 1.58 and a maximum speedup of 5.06 across those test programs.

    References

    [1]
    Admas, J. C. 1999. MUDPACK: Multigrid Software for Elliptic Partial Differential Equations. Available on line at http://www.scd.ucar.edu/css/software/mudpack/.]]
    [2]
    Ahmed, N., Mateev, N., and Pingali, K. 2000. Synthesizing transformations for locality enhancement of imperfectly-nested loop nests. In Proceedings of the 2000 International Conference on Supercomputing (Santa FE, NM). 141--152.]]
    [3]
    Ahuja, R., Magnanti, T., and Orlin, J. 1993. Network Flows: Theory, Algorithms, and Applications. Prentice-Hall Inc., Englewood Cliffs, NJ.]]
    [4]
    Allan, V., Jones, R., Lee, R., and Allan, S. 1995. Software pipelining. ACM Comput. Surv. 27, 3 (Sept.), 367--432.]]
    [5]
    Allen, J. R. and Kennedy, K. 1984. Automatic translation of FORTRAN programs to vector form. ACM Trans. Programm. Lang. Syst. 9, 4 (Oct.), 491--542.]]
    [6]
    Andersen, B. S., Gustavson, F. G., Wasniewski, J., and Yalamov, P. Y. 1999. Recursive formulation of some dense linear algebra algorithms. In Proceedings of the SIAM Conference on Parallel Processing for Scientific Computing (San Antonio, TX).]]
    [7]
    Anderson, J. M., Amarasinghe, S. P., and Lam, M. S. 1995. Data and computation transformation for multiprocessors. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Santa Barbara, CA). 166--178.]]
    [8]
    Bacon, D., Chow, J.-H., Ju, D., Muthukumar, K., and Sarkar, V. 1994. A compiler framework for restructuring data declarations to enhance cache and tlb effectiveness. In Proceedings of CASCON'94 (Toronto, Ont., Canada).]]
    [9]
    Blume, W. and Eigenmann, R. 1998. Non-linear and symbolic data dependence testing. IEEE Trans. Parall. Distrib. Syst. 9, 12 (Dec.), 1180--1194.]]
    [10]
    Boulet, P., Dongarra, J., Robert, Y., and Vivien, F. 1999. Static tiling for heterogeneous computing platforms. Parall. Comput. 25, 547--568.]]
    [11]
    Briggs, P., Cooper, K., Kennedy, K., and Torcson, L. 1989. Coloring heuristics for register allocation. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation. 275--384.]]
    [12]
    Burger, D. C., Goodman, J. R., and kägi, A. 1996. Memory bandwidth limitations of future microprocessors. In Proceedings of the 23rd International Symposium on Computer Architecture (Philadelphia, PA). 78--89.]]
    [13]
    Chame, J. and Moon, S. 1999. A tile selection algorithm for data locality and cache interference. In Proceedings of the Thirteenth ACM International Conference on Supercomputing (Rhodes, Greece). 492--499.]]
    [14]
    Chatterjee, S., Jain, V., Lebeck, A., Mundhra, S., and Thottethodi, M. 1999a. Nonlinear array layouts for hierarchical memory systems. In Proceedings of the Thirteenth ACM International Conference on Supercomputing (Rhodes, Greece). 444--453.]]
    [15]
    Chatterjee, S., Lebeck, A., Patnala, P. K., and Thottethodi, M. 1999b. Recursive array layouts and fast parallel matrix multiplication. In Proceedings of the 11th ACM Symposium on Parallel Algorithms and Architectures (Saint Malo, France).]]
    [16]
    Cociorva, D., Wilkins, J. W., Lam, C., Baumgartner, G., Ramanujam, J., and Sadayappan, P. 2001. Loop optimization for a class of memory-constrained computations. In Proceedings of the 15th ACM International Conference on Supercomputing (Naples, Italy).]]
    [17]
    Coleman, S. and McKinley, K. S. 1995. Tile size selection using cache organization and data layout. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (La Jolla, CA). 279--290.]]
    [18]
    Collard, J.-F. 1994. Space-time transformation of while-loops using speculative execution. In Proceedings of the Scalable High Performance Computing Conference (Knoxville, TN). 429--436.]]
    [19]
    Cormen, T., Leiserson, C., and Rivest, R. 1990. Introduction to Algorithms. MIT Press, Cambridge, MA, and McGraw-Hill Book Company, New York, NY.]]
    [20]
    Ding, C. and Kennedy, K. 2001. Reducing effective bandwidth through compiler enhancement of global cache reuse. In Proceedings of the International Parallel and Distributed Processing Symposium.]]
    [21]
    Ferrante, J., Sarkar, V., and Thrash, W. 1991. On estimating and enhancing cache effectiveness. In Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, vol. 1863. Springer-Verlag, Berlin, Germany, 328--341. August 1991.]]
    [22]
    Gary, M. R. and Johnson, D. S. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, New York, NY.]]
    [23]
    Ghosh, S., Martonosi, M., and Malik, S. 1998. Precise miss analysis for program transformations with caches of arbitrary associativity. In Proceedings of the Eighth ACM Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, CA). 228--239.]]
    [24]
    Gu, J., Li, Z., and Lee, G. 1997. Experience with efficient array data flow analysis for array privatization. In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Las Vegas, NV). 157--167.]]
    [25]
    Haghighat, M. R. 1990. Symbolic dependence analysis for high performance parallelizing compilers. Ph.D. dissertation. Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL.]]
    [26]
    Hennessy, J. and Patterson, D. 1996. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, San Francisco, CA.]]
    [27]
    Jin, G., Mellor-Crummey, J., and Fowler, R. 2001. Increasing temporal locality with skewing and recursive blocking. In Proceedings of IEEE/ACM SC 2001 (Denver, CO).]]
    [28]
    Kandemir, M., Choudhary, A., Ramanujam, J., and Banerjee, P. 1998. A matrix-based approach to the global locality optimization problem. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT'98, Paris, France).]]
    [29]
    Kennedy, K. 2000. Fast greedy weighted fusion. In Proceedings of the 2000 International Conference on Supercomputing (Santa Fe, NM).]]
    [30]
    Kennedy, K. and McKinley, K. S. 1993. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Proceedings of the Sixth Workhsop on Languages and Compilers for Parallel Computing (Portland, OR, Aug. 1993). Lecture Notes in Computer Science, vol. 768, Springer-Verlag, Berlin, Germany.]]
    [31]
    Kodukula, I., Ahmed, N., and Pingali, K. 1997. Data-centric multi-level blocking. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (Las Vegas, NV). 346--357.]]
    [32]
    Kodukula, I. and Pingali, K. 1996. Transformations of imperfectly nested loops. In Proceedings of Supercomputing.]]
    [33]
    Lam, M. S., Rothberg, E. E., and Wolf, M. E. 1991. The cache performance and optimizations of blocked algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (Santa Clara, CA). 63--74.]]
    [34]
    Manjikian, N. and Abdelrahman, T. 1997. Fusion of loops for parallelism and locality. IEEE Trans. Parall. and Distribut. Syst. 8, 2 (Feb.), 193--209.]]
    [35]
    Matula, D. and Beck, L. 1981. Smallest-last ordering and clustering and graph coloring algorithms. Tech. rep. TR CSE 8104. Department of Computer Science and Engineering, Southern Methodist University, Dallas, TX.]]
    [36]
    Mitchell, N., Högstedt, K., Carter, L., and Ferrante, J. 1998. Quantifying the multi-level nature of tiling interactions. Int. J. Parall. Programm. 26, 6 (Dec.), 641--670.]]
    [37]
    Nguyen, T. and Li, Z. 1998. Interprocedural analysis for loop scheduling and data allocation. Parall. Comput. 24, 3, 477--504.]]
    [38]
    Object-Oriented Scientific Computing. 2001. Blitz++. Object-Oriented Scientific Computing, Available online at http://www.oonumerics.org/blitz/benchmarks/.]]
    [39]
    O'Boyle, M. and Knijnenburg, P. 1997. Non-singular data transformations: Definition, validity and applications. In Proceedings of the ACM International Conference on Supercomputing (Vienna, Austria). 309--316.]]
    [40]
    Panda, P., Nakamura, H., Dutt, N., and Nicolau, A. 1999. Augmenting loop tiling with data alignment for improved cache performance. IEEE Trans. Comput. 48, 2 (Feb.), 142--149.]]
    [41]
    Park, N., Hong, B., and Prasanna, V. K. 2002. Analysis of memory hierarchy performance of block data layout. In Proceedings of the International Conference on Parallel Processing (Vancouver, B.C., Canada). 34--44.]]
    [42]
    Pugh, W. 1992. A practical algorithm for exact array dependence analysis. Commun. ACM 35, 8 (Aug.), 102--114.]]
    [43]
    Pugh, W. and Rosser, E. 1999. Iteration space slicing for locality. In Proceedings of the Twelfth International Workshop on Languages and Compilers for Parallel Computing (San Diego, CA).]]
    [44]
    Pugh, W., Rosser, E., and Shpeisman, T. 1996. Exploiting monotone convergence functions in parallel programs. Tech. rep. CS-TR-3636. University of Maryland, College Park, MD.]]
    [45]
    Rivera, G. and Tseng, C.-W. 1999. A comparison of compiler tiling algorithms. In Proceedings of the Eighth International Conference on Compiler Construction (Amsterdam, The Netherlands).]]
    [46]
    Rivera, G. and Tseng, C.-W. 2000. Tiling optimizations for 3D scientific computations. In Proceedings of the IEEE/ACM SC 2000.]]
    [47]
    Rosser, E. 1998. Fine-grained analysis of array computations. Ph.D. dissertation. Department of Computer Science, University of Maryland at College Park, MD.]]
    [48]
    Sarkar, V. 1998. Loop transformations for hierarchical parallelism and locality. In Proceedings of the Fourth Workshop on Languages, Compilers, and Run-time Systems for Scalable Computers (Pittsburgh, PA).]]
    [49]
    Song, Y. and Li, Z. 1999. New tiling techniques to improve cache temporal locality. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (Atlanta, GA). 215--228.]]
    [50]
    Song, Y., Xu, R., Wang, C., and Li, Z. 2001. Data locality enhancement by memory reduction. In Proceedings of the 15th ACM International Conference on Supercomputing (Naples, Italy).]]
    [51]
    Strout, M., Carter, L., Ferrante, J., and Simon, B. 1998. Schedule-independent storage mapping for loops. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, CA). 24--33.]]
    [52]
    Temam, O., Fricker, C., and Jalby, W. 1994. Cache interference phenomena. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (Nashville, TN). 261--271.]]
    [53]
    Wise, D. S., Alexander, G. A., Frens, J. D., and Gu, Y. 2001. Language support for Morton-order matrices. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Snowbird, UT).]]
    [54]
    Wolf, M. 1992. Improving locality and parallelism in nested loops. Ph.D. dissertation. Department of Computer Science, Stanford University, Stanford, CA.]]
    [55]
    Wolfe, M. 1995. High Performance Compilers for Parallel Computing. Addison-Wesley Publishing Company, Reading, MA.]]
    [56]
    Wonnacott, D. 2002. Achieving scalable locality with time skewing. Int. J. Parall. Programm. 30, 3 (June), 181--221.]]
    [57]
    Xue, J. 2000. Loop Tiling for Parallelism. Kluwer Academic Publishers, Dordrecht, The Netherlands.]]

    Cited By

    View all
    • (2022)Transient computing for energy harvesting systemsJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2022.102743132:COnline publication date: 1-Nov-2022
    • (2022)Efficient Parallel Implementation of Cellular Automata and Stencil Computations in Current ProcessorsAdvances in Computing, Informatics, Networking and Cybersecurity10.1007/978-3-030-87049-2_4(93-120)Online publication date: 3-Mar-2022
    • (2020)Write Mode Aware Loop Tiling for High-Performance Low-Power Volatile PCM in Embedded SystemsSmart Sensors and Systems10.1007/978-3-030-42234-9_10(171-198)Online publication date: 11-Jun-2020
    • Show More Cited By

    Index Terms

    1. Automatic tiling of iterative stencil loops

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Programming Languages and Systems
      ACM Transactions on Programming Languages and Systems  Volume 26, Issue 6
      November 2004
      142 pages
      ISSN:0164-0925
      EISSN:1558-4593
      DOI:10.1145/1034774
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 November 2004
      Published in TOPLAS Volume 26, Issue 6

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Caches
      2. loop transformations
      3. optimizing compilers

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)69
      • Downloads (Last 6 weeks)2

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)Transient computing for energy harvesting systemsJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2022.102743132:COnline publication date: 1-Nov-2022
      • (2022)Efficient Parallel Implementation of Cellular Automata and Stencil Computations in Current ProcessorsAdvances in Computing, Informatics, Networking and Cybersecurity10.1007/978-3-030-87049-2_4(93-120)Online publication date: 3-Mar-2022
      • (2020)Write Mode Aware Loop Tiling for High-Performance Low-Power Volatile PCM in Embedded SystemsSmart Sensors and Systems10.1007/978-3-030-42234-9_10(171-198)Online publication date: 11-Jun-2020
      • (2019)DCMIACM Transactions on Architecture and Code Optimization10.1145/335281316:4(1-24)Online publication date: 11-Oct-2019
      • (2019)Checkpointing-Aware Loop Tiling for Energy Harvesting Powered Nonvolatile ProcessorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2018.280362438:1(15-28)Online publication date: 1-Jan-2019
      • (2019)BRLoop: Constructing balanced retimed loop to architect STT-RAM-based hybrid cache for VLIW processorsMicroelectronics Journal10.1016/j.mejo.2018.11.01183(137-146)Online publication date: Jan-2019
      • (2018)Improving 3D NAND Flash Memory Lifetime by Tolerating Early Retention Loss and Process VariationProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/32244322:3(1-48)Online publication date: 21-Dec-2018
      • (2018)What Your DRAM Power Models Are Not Telling YouProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/32244192:3(1-41)Online publication date: 21-Dec-2018
      • (2018)Visually Augmented Audio-Tactile Graphics for Visually Impaired PeopleACM Transactions on Accessible Computing10.1145/318689411:2(1-31)Online publication date: 8-Jun-2018
      • (2018)Quantifying the Utility--Privacy Tradeoff in the Internet of ThingsACM Transactions on Cyber-Physical Systems10.1145/31855112:2(1-28)Online publication date: 23-May-2018
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media