Article

Increasing temporal locality with skewing and recursive blocking

Authors:

John Mellor-Crummey, and

Robert FowlerAuthors Info & Claims

SC '01: Proceedings of the 2001 ACM/IEEE conference on Supercomputing

November 2001

Page 43

https://doi.org/10.1145/582034.582077

Published: 10 November 2001 Publication History

Abstract

We present a strategy, called recursive prismatic time skewing, that increase temporal reuse at all memory hierarchy levels, thus improving the performance of scientific codes that use iterative methods. Prismatic time skewing partitions iteration space of multiple loops into skewed prisms with both spatial and temporal (or convergence) dimensions. Novel aspects of this work include: multi-dimensional loop skewing; handling carried data dependences in the skewed loops without additional storage; bi-directional skewing to accommodate periodic boundary conditions; and an analysis and transformation strategy that works inter-procedurally. We combine prismatic skewing with a recursive blocking strategy to boost reuse at all levels in a memory hierarchy. A preliminary evaluation of these techniques shows significant performance improvements compared both to original codes and to methods described previously in the literature. With an inter-procedural application of our techniques, we were able to reduce total primary cache misses of a large application code by 27% and secondary cache misses by 119%.

References

[1]

V. Adve, G. Jin, J. Mellor-Crummey, and Q. Yi. High Performance Fortran Compilation Techniques for Parallelizing Scientific Codes. In Proceedings of SC98: High Performance Computing and Networking, Orlando, FL, Nov 1998.]]

Digital Library

[2]

V. Adve and J. Mellor-Crummey. Using Integer Sets for Data-Parallel Program Analysis and Optimization. In Proceedings of the SIGPLAN '98 Conference on Programming Language Design and Implementation, Montreal, Canada, June 1998.]]

Digital Library

[3]

N. Ahmed, N. Mateev, and K. Pingali. Synthesizing transformations for locality enhancement of imperfectly-nested loop nests. In Proceedings of the 2000 ACM International Conference on Supercomputing, Santa Fe, NM, May 2000.]]

Digital Library

[4]

N. Ahmed, N. Mateev, and K. Pingali. Tiling imperfectly-nested loop nests. In Proceedings of SC'00: High Performance Networking and Computing, Dallas, TX, Nov. 2000.]]

Digital Library

[5]

N. Ahmed and K. Pingali. Automatic generation of block-recursive codes. In Proceedings of the Euro-Par2000, Munich, Germany, Aug. 2000.]]

Digital Library

[6]

P. N. Brown, R. D. Falgout, and J. E. Jones. Semicoarsening multigrid on distributed memory machines. SIAM J. Sci. Comput, 21(5):1823-1834, 1999.]]

Digital Library

[7]

D. Callahan, S. Carr, and K. Kennedy. Improving register allocation for subscripted variables. In Proceedings of the SIGPLAN '90 Conference on Programming Language Design and Implementation, White Plains, NY, June 1990.]]

Digital Library

[8]

S. Carr and K. Kennedy. Compiler blockability of numerical algorithms. In Proceedings of Supercomputing '92, Minneapolis, MN, Nov. 1992.]]

Digital Library

[9]

S. Carr and K. Kennedy. Improving the ratio of memory operations to floating-point operations in loops. ACM Transactions on Programming Languages and Systems, 16(6):1768-1810, 1994.]]

Digital Library

[10]

S. Coleman and K. S. McKinley. Tile size selection using cache organization. In Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation, La Jolla, CA, June 1995.]]

Digital Library

[11]

J. Frens and D. Wise. Auto-blocking matrix multiplication or tracking blas3 performance from source code. In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 206-216, Las Vegas, NV, June 1997.]]

Digital Library

[12]

M. W. Hall, K. Kennedy, and K. S. McKinley. Interprocedural transformations for parallel code generation. In Proceedings of Supercomputing '91, Albuquerque, NM, Nov. 1991.]]

Digital Library

[13]

I. Kodukula, N. Ahmed, and K. Pingali. Data-centric multi-level blocking. In Proceedings of the SIGPLAN '97 Conference on Programming Language Design and Implementation, Las Vegas, NV, June 1997.]]

Digital Library

[14]

M. Lam, E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV), Santa Clara, CA, Apr. 1991.]]

Digital Library

[15]

K. S. McKinley, S. Carr, and C.-W. Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, 18(4):424-453, July 1996.]]

Digital Library

[16]

N. Mitchell, K. Högstedt, L. Carter, and J. Ferrante. Quantifying the multi-level nature of tiling interactions. International Journal of Parallel Programming, 26(5), 1998.]]

Digital Library

[17]

H. Prokop. Cache-oblivious algorithms. Master's thesis, Department of Electrical Engineering, MIT, June 1999.]]

[18]

Y. Song and Z. Li. A compiler framework for tiling imperfectly-nested loops. In Proceedings of the Twelfth International Workshop on Languages and Compilers for Parallel Computing, La Jolla, CA, Aug. 1999.]]

Digital Library

[19]

Y. Song and Z. Li. New tiling techniques to improve cache temporal locality. In Proceedings of the SIGPLAN '99 Conference on Programming Language Design and Implementation, Atlanta, GA, May 1999.]]

Digital Library

[20]

M. J. Wolfe. Loop skewing: The wavefront method revisited. International Journal of Parallel Programming, 15(4):279-293, Aug. 1986.]]

Digital Library

[21]

M. J. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley, Redwood City, CA, 1996.]]

Digital Library

[22]

D. Wonnacott. Time skewing: A value-based approach to optimizing for memory locality. Submitted for publication.]]

[23]

Q. Yi, V. Adve, and K. Kennedy. Transforming loops to recursion for multi-level memory hierarchies. In Proceedings of the SIGPLAN '00 Conference on Programming Language Design and Implementation, Vancouver, Canada, June 2000.]]

Digital Library

Cited By

Li KYuan LZhang YYue YCao H(2022)An Efficient Vectorization Scheme for Stencil Computation2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00069(650-660)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00069
Sai RMellor-Crummey JMeng XAraya-Polo MMeng J(2021)Using the Semi-Stencil Algorithm to Accelerate High-Order Stencils on GPUs2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)10.1109/PMBS54543.2021.00012(63-68)Online publication date: Nov-2021
https://doi.org/10.1109/PMBS54543.2021.00012
Bisbas GLuporini FLouboutin MNelson RGorman GKelly P(2021)Temporal blocking of finite-difference stencil operators with sparse “off-the-grid” sources2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00058(497-506)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00058
Show More Cited By

Index Terms

Increasing temporal locality with skewing and recursive blocking
1. Hardware
  1. Electronic design automation
    1. High-level and register-transfer level synthesis
  2. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory
2. Information systems
  1. Information storage systems
    1. Record storage systems
      1. Record storage alternatives
        Hashed file organization
        Indexed file organization

Recommendations

Locality Transformations for Nested Recursive Iteration Spaces
ASPLOS '17

There has been a significant amount of effort invested in designing scheduling transformations such as loop tiling and loop fusion that rearrange the execution of dynamic instances of loop nests to place operations that access the same data close ...
Read More
Locality Transformations for Nested Recursive Iteration Spaces
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems

There has been a significant amount of effort invested in designing scheduling transformations such as loop tiling and loop fusion that rearrange the execution of dynamic instances of loop nests to place operations that access the same data close ...
Read More
Locality Transformations for Nested Recursive Iteration Spaces
Asplos'17

There has been a significant amount of effort invested in designing scheduling transformations such as loop tiling and loop fusion that rearrange the execution of dynamic instances of loop nests to place operations that access the same data close ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '01: Proceedings of the 2001 ACM/IEEE conference on Supercomputing

November 2001

756 pages

ISBN:158113293X

DOI:10.1145/582034

Conference Chair:
Charles Slocomb
Los Alamos National Laboratory

Copyright © 2001 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 November 2001

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

SC '01

Sponsor:

SIGARCH
IEEE-CS

SC '01: International Conference for High Performance Computing, Networking, Storage and Analysis

November 10 - 16, 2001

Colorado, Denver

Acceptance Rates

SC '01 Paper Acceptance Rate 60 of 240 submissions, 25%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
313
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Other Metrics

View Author Metrics

Citations

Cited By

Li KYuan LZhang YYue YCao H(2022)An Efficient Vectorization Scheme for Stencil Computation2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00069(650-660)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00069
Sai RMellor-Crummey JMeng XAraya-Polo MMeng J(2021)Using the Semi-Stencil Algorithm to Accelerate High-Order Stencils on GPUs2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)10.1109/PMBS54543.2021.00012(63-68)Online publication date: Nov-2021
https://doi.org/10.1109/PMBS54543.2021.00012
Bisbas GLuporini FLouboutin MNelson RGorman GKelly P(2021)Temporal blocking of finite-difference stencil operators with sparse “off-the-grid” sources2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00058(497-506)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00058
Sai RMellor‐Crummey JMeng XZhou KAraya‐Polo MMeng J(2021)Accelerating high‐order stencils on GPUsConcurrency and Computation: Practice and Experience10.1002/cpe.646734:20Online publication date: 22-Aug-2021
https://doi.org/10.1002/cpe.6467
Yuan LHuang SZhang YCao H(2019)Tessellating Star StencilsProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337835(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337835
Yuan LZhang YGuo PHuang SMohr BRaghavan P(2017)Tessellating stencilsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126920(1-13)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3126908.3126920
Midkiff S(2012)Automatic Parallelization: An Overview of Fundamental Compiler TechniquesSynthesis Lectures on Computer Architecture10.2200/S00340ED1V01Y201201CAC0197:1(1-169)Online publication date: 28-Jan-2012
https://doi.org/10.2200/S00340ED1V01Y201201CAC019
Qasem AGuo MHuang Z(2012)Efficient execution of time-step computations with pipelined parallelism and inter-thread data locality optimizaitionsProceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/2141702.2141706(27-35)Online publication date: 26-Feb-2012
https://dl.acm.org/doi/10.1145/2141702.2141706
Yang YCui HFeng XXue J(2012)A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUsJournal of Computer Science and Technology10.1007/s11390-012-1206-327:1(57-74)Online publication date: 9-Jan-2012
https://doi.org/10.1007/s11390-012-1206-3
Beg Mvan Beek P(2010)A graph theoretic approach to cache-conscious placement of data for direct mapped cachesACM SIGPLAN Notices10.1145/1837855.180667045:8(113-120)Online publication date: 5-Jun-2010
https://dl.acm.org/doi/10.1145/1837855.1806670
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents