Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/582034.582077acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
Article

Increasing temporal locality with skewing and recursive blocking

Published: 10 November 2001 Publication History
  • Get Citation Alerts
  • Abstract

    We present a strategy, called recursive prismatic time skewing, that increase temporal reuse at all memory hierarchy levels, thus improving the performance of scientific codes that use iterative methods. Prismatic time skewing partitions iteration space of multiple loops into skewed prisms with both spatial and temporal (or convergence) dimensions. Novel aspects of this work include: multi-dimensional loop skewing; handling carried data dependences in the skewed loops without additional storage; bi-directional skewing to accommodate periodic boundary conditions; and an analysis and transformation strategy that works inter-procedurally. We combine prismatic skewing with a recursive blocking strategy to boost reuse at all levels in a memory hierarchy. A preliminary evaluation of these techniques shows significant performance improvements compared both to original codes and to methods described previously in the literature. With an inter-procedural application of our techniques, we were able to reduce total primary cache misses of a large application code by 27% and secondary cache misses by 119%.

    References

    [1]
    V. Adve, G. Jin, J. Mellor-Crummey, and Q. Yi. High Performance Fortran Compilation Techniques for Parallelizing Scientific Codes. In Proceedings of SC98: High Performance Computing and Networking, Orlando, FL, Nov 1998.]]
    [2]
    V. Adve and J. Mellor-Crummey. Using Integer Sets for Data-Parallel Program Analysis and Optimization. In Proceedings of the SIGPLAN '98 Conference on Programming Language Design and Implementation, Montreal, Canada, June 1998.]]
    [3]
    N. Ahmed, N. Mateev, and K. Pingali. Synthesizing transformations for locality enhancement of imperfectly-nested loop nests. In Proceedings of the 2000 ACM International Conference on Supercomputing, Santa Fe, NM, May 2000.]]
    [4]
    N. Ahmed, N. Mateev, and K. Pingali. Tiling imperfectly-nested loop nests. In Proceedings of SC'00: High Performance Networking and Computing, Dallas, TX, Nov. 2000.]]
    [5]
    N. Ahmed and K. Pingali. Automatic generation of block-recursive codes. In Proceedings of the Euro-Par2000, Munich, Germany, Aug. 2000.]]
    [6]
    P. N. Brown, R. D. Falgout, and J. E. Jones. Semicoarsening multigrid on distributed memory machines. SIAM J. Sci. Comput, 21(5):1823-1834, 1999.]]
    [7]
    D. Callahan, S. Carr, and K. Kennedy. Improving register allocation for subscripted variables. In Proceedings of the SIGPLAN '90 Conference on Programming Language Design and Implementation, White Plains, NY, June 1990.]]
    [8]
    S. Carr and K. Kennedy. Compiler blockability of numerical algorithms. In Proceedings of Supercomputing '92, Minneapolis, MN, Nov. 1992.]]
    [9]
    S. Carr and K. Kennedy. Improving the ratio of memory operations to floating-point operations in loops. ACM Transactions on Programming Languages and Systems, 16(6):1768-1810, 1994.]]
    [10]
    S. Coleman and K. S. McKinley. Tile size selection using cache organization. In Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation, La Jolla, CA, June 1995.]]
    [11]
    J. Frens and D. Wise. Auto-blocking matrix multiplication or tracking blas3 performance from source code. In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 206-216, Las Vegas, NV, June 1997.]]
    [12]
    M. W. Hall, K. Kennedy, and K. S. McKinley. Interprocedural transformations for parallel code generation. In Proceedings of Supercomputing '91, Albuquerque, NM, Nov. 1991.]]
    [13]
    I. Kodukula, N. Ahmed, and K. Pingali. Data-centric multi-level blocking. In Proceedings of the SIGPLAN '97 Conference on Programming Language Design and Implementation, Las Vegas, NV, June 1997.]]
    [14]
    M. Lam, E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV), Santa Clara, CA, Apr. 1991.]]
    [15]
    K. S. McKinley, S. Carr, and C.-W. Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, 18(4):424-453, July 1996.]]
    [16]
    N. Mitchell, K. Högstedt, L. Carter, and J. Ferrante. Quantifying the multi-level nature of tiling interactions. International Journal of Parallel Programming, 26(5), 1998.]]
    [17]
    H. Prokop. Cache-oblivious algorithms. Master's thesis, Department of Electrical Engineering, MIT, June 1999.]]
    [18]
    Y. Song and Z. Li. A compiler framework for tiling imperfectly-nested loops. In Proceedings of the Twelfth International Workshop on Languages and Compilers for Parallel Computing, La Jolla, CA, Aug. 1999.]]
    [19]
    Y. Song and Z. Li. New tiling techniques to improve cache temporal locality. In Proceedings of the SIGPLAN '99 Conference on Programming Language Design and Implementation, Atlanta, GA, May 1999.]]
    [20]
    M. J. Wolfe. Loop skewing: The wavefront method revisited. International Journal of Parallel Programming, 15(4):279-293, Aug. 1986.]]
    [21]
    M. J. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley, Redwood City, CA, 1996.]]
    [22]
    D. Wonnacott. Time skewing: A value-based approach to optimizing for memory locality. Submitted for publication.]]
    [23]
    Q. Yi, V. Adve, and K. Kennedy. Transforming loops to recursion for multi-level memory hierarchies. In Proceedings of the SIGPLAN '00 Conference on Programming Language Design and Implementation, Vancouver, Canada, June 2000.]]

    Cited By

    View all
    • (2022)An Efficient Vectorization Scheme for Stencil Computation2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00069(650-660)Online publication date: May-2022
    • (2021)Using the Semi-Stencil Algorithm to Accelerate High-Order Stencils on GPUs2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)10.1109/PMBS54543.2021.00012(63-68)Online publication date: Nov-2021
    • (2021)Temporal blocking of finite-difference stencil operators with sparse “off-the-grid” sources2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00058(497-506)Online publication date: May-2021
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '01: Proceedings of the 2001 ACM/IEEE conference on Supercomputing
    November 2001
    756 pages
    ISBN:158113293X
    DOI:10.1145/582034
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 November 2001

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Article

    Conference

    SC '01
    Sponsor:

    Acceptance Rates

    SC '01 Paper Acceptance Rate 60 of 240 submissions, 25%;
    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)1

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)An Efficient Vectorization Scheme for Stencil Computation2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00069(650-660)Online publication date: May-2022
    • (2021)Using the Semi-Stencil Algorithm to Accelerate High-Order Stencils on GPUs2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)10.1109/PMBS54543.2021.00012(63-68)Online publication date: Nov-2021
    • (2021)Temporal blocking of finite-difference stencil operators with sparse “off-the-grid” sources2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00058(497-506)Online publication date: May-2021
    • (2021)Accelerating high‐order stencils on GPUsConcurrency and Computation: Practice and Experience10.1002/cpe.646734:20Online publication date: 22-Aug-2021
    • (2019)Tessellating Star StencilsProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337835(1-10)Online publication date: 5-Aug-2019
    • (2017)Tessellating stencilsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126920(1-13)Online publication date: 12-Nov-2017
    • (2012)Automatic Parallelization: An Overview of Fundamental Compiler TechniquesSynthesis Lectures on Computer Architecture10.2200/S00340ED1V01Y201201CAC0197:1(1-169)Online publication date: 28-Jan-2012
    • (2012)Efficient execution of time-step computations with pipelined parallelism and inter-thread data locality optimizaitionsProceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/2141702.2141706(27-35)Online publication date: 26-Feb-2012
    • (2012)A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUsJournal of Computer Science and Technology10.1007/s11390-012-1206-327:1(57-74)Online publication date: 9-Jan-2012
    • (2010)A graph theoretic approach to cache-conscious placement of data for direct mapped cachesACM SIGPLAN Notices10.1145/1837855.180667045:8(113-120)Online publication date: 5-Jun-2010
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media