article

The memory behavior of cache oblivious stencil computations

Authors:

Volker StrumpenAuthors Info & Claims

The Journal of Supercomputing, Volume 39, Issue 2

Pages 93 - 112

https://doi.org/10.1007/s11227-007-0111-y

Published: 01 February 2007 Publication History

Abstract

We present and evaluate a cache oblivious algorithm for stencil computations, which arise for example in finite-difference methods. Our algorithm applies to arbitrary stencils in n -dimensional spaces. On an "ideal cache" of size Z , our algorithm saves a factor of ( Z ^{1/ n}) cache misses compared to a naive algorithm, and it exploits temporal locality optimally throughout the entire memory hierarchy. We evaluate our algorithm in terms of the number of cache misses, and demonstrate that the memory behavior agrees with our theoretical predictions. Our experimental evaluation is based on a finite-difference solution of a heat diffusion problem, as well as a Gauss-Seidel iteration and a 2-dimensional LBMHD program, both reformulated as cache oblivious stencil computations.

References

[1]

1. Aggarwal A, Alpern B, Chandra AK, Snir M (1987) A model for hierarchical memory. In: 19th ACM symposium on theory of computing, New York, May 1987, pp 305-314

[2]

2. Aggarwal A, Vitter JS (1988) The input/output complexity of sorting and related problems. Commun ACM 31(9):1116-1127

Digital Library

[3]

3. Alpern B, Carter L, Ferrante J (1995) Space-limited procedures: a methodology for portable high-performance. In: Conference on programming models for massively parallel computers, Berlin, Germany, October 1995. IEEE Computer Society, pp 10-17

[4]

4. Anderson E, Bai Z, Bischof C, Blackford S, Demmel J, Dongarra J, Du Croz J, Greenbaum A, Hammarling S, McKenney A, Sorensen D (1999) LAPACK users' guide. Society for Industrial and Applied Mathematics, Philadelphia, 3rd edn. http://www.netlib.org/lapack/lug/lapack_lug.html

[5]

5. Arge L, Bender MA, Demalne ED, Holland-Minkley B, Munro JI (2002) Cache-oblivious priority queue and graph algorithm applications. In: 34th ACM symposium on theory of computing. ACM Press, Montréal, Canada, 2002, pp 268-276

[6]

6. Bailey DH (1993) RISC microprocessors and scientific computing. In: Supercomputing'93, Portland, OR, November 1993, pp 645-654

[7]

7. Bender MA, Demaine ED, Farach-Colton M (2000) Cache-oblivious B-trees. In: Symposium on foundations of computer science, IEEE Computer Society, Redondo Beach, CA, November 2000, pp 399- 409

[8]

8. Bilardi G, Preparata FP (1995) Upper bounds to processor-time tradeoffs under bounded-speed message propagation. In: 7th ACM symposium on parallel algorithms and architectures, ACM Press, Santa Barbara, 1995, pp 185-194

[9]

9. Blumofe RD, Frigo M, Joerg CF, Leiserson CE, Randall KH (1996) An analysis of dag-consistent distributed shared-memory algorithms. In: 8th ACM symposium on parallel algorithms and architectures, Padua, Italy, June 1996, pp 297-308

[10]

10. Bohrer P, Elnozahy M, Gheith A, Lefurgy C, Nakra T, Peterson J, Rajamony R, Rockhold R, Shaft H, Simpson R, Speight E, Sudeep K, Van Hensbergen E, Zhang L (2004) Mambo: a full system simulator for the PowerPC architecture. SIGMETRICS Perform Eval Rev 31(4):8-12

Digital Library

[11]

11. Brodal GS, Fagerberg R, Vinther K (2004) Engineering a cache-oblivious sorting algorithm. In: 6th Workshop on algorithm engineering and experiments SIAM, New Orleans, LA, January 2004, pp 4-17

[12]

12. Chen S, Doolen GD, Eggert KG (1994) Lattice-Boltzmann fluid dynamics: a versatile tool for multiphase and other complicated flows. Los Alamos Sci 22:98-19

[13]

13. Dongarra JJ, Moler CB, Bunch JR, Stewart GW (1979) LINPACK users' guide. Society for Industrial and Applied Mathematics, Philadelphia

[14]

14. Frigo M, Leiserson CE, Prokop H, Ramachandran S (1999) Cache-oblivious algorithms. In: 40th symposium on foundations of computer science, New York, NY, October 1999. ACM Press

[15]

15. Frigo M, Strumpen V (2005) Cache oblivious stencil computations. In: International conference on supercomputing, Boston, MA, June 2005. ACM Press, pp 361-366

Digital Library

[16]

16. Golub GH, van Loan CF (1996) Matrix computations, 3rd edn. Johns Hopkins University Press, Baltimore

[17]

17. Goto K, van de Geijn R (2001) On Reducing TLB Misses in Matrix Multiplication. Technical Report TR-2002-55, Department of Computer Sciences, The University of Texas at Austin (FLAME Working Note #9)

[18]

18. Hong J-W, Kung HT (1981) I/O complexity: the red-blue pebbling game. In: 13th ACM Symposium on Theory of Computing, Milwaukee, WI, May 1981, pp 326-333

[19]

19. Kowarschik M (2004) Data locality optimizations for iterative numerical algorithms and cellular automata on hierarchical memory architectures. PhD thesis, Lehrstuhl für Informatik 10 (Systems simulation), Institut für Informatik, Universitat Erlangen-Nürnberg, Erlangen, Germany, July 2004

[20]

20. Macnab A, Vahala G, Vahala L, Pavlo P (2002) Lattice Boltzmann model for dissipative MHD. In: 29th EPS conference on controlled fusion and plasma physics, vol 26B, Montreux, Switzerland, June 2002

[21]

21. Oliker L, Canning A, Carter J, Shalf J, Ethier S (2004) Scientific computations on modern parallel vector systems. In: Supercomputing'04, Pittsburgh, PA, November 2004, IEEE. http://www. sc-conference.org/sc2004/papers.html

[22]

22. Pohl T, Deserno F, Thürey N, Rüde U, Lammers P, Wellein G, Zeiser T (2004) Performance evaluation of parallel large-scale lattice Boltzmann applications on three supercomputing architectures. In: Supercomputing'04, Pittsburgh, PA, November 2004, IEEE. http://www.sc-conference.org/sc2004/ papers.html

[23]

23. Prokop H (1999) Cache-oblivious algorithms. Master's thesis, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, June 1999

[24]

24. Smith GD (1985) Numerical solution of partial differential equations: finite difference methods, 3rd edn. Oxford University Press, Oxford

[25]

25. Toledo S (1997) Locality of reference in LU decomposition with partial pivoting. SIAM J Matrix Anal Appl 18(4):1065-1081

Digital Library

Cited By

Del Sozzo EConficconi DSano K(2024)Across Time and Space: Senju’s Approach for Scaling Iterative Stencil Loop Accelerators on Single and Multiple FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/363492017:2(1-33)Online publication date: 30-Apr-2024
https://dl.acm.org/doi/10.1145/3634920
Singh GKhodamoradi ADenolf KLo JGomez-Luna JMelber JBisca ACorporaal HMutlu OGallivan KNikolopoulos DBeivide RGallopoulos E(2023)SPARTA: Spatial Acceleration for Efficient and Scalable Horizontal Diffusion Weather Stencil ComputationProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593719(463-476)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593719
Reggiani EDel Sozzo EConficconi DNatale GMoroni CSantambrogio M(2021)Enhancing the Scalability of Multi-FPGA Stencil Computations via Highly Optimized HDL ComponentsACM Transactions on Reconfigurable Technology and Systems10.1145/346147814:3(1-33)Online publication date: 12-Aug-2021
https://dl.acm.org/doi/10.1145/3461478
Show More Cited By

Index Terms

The memory behavior of cache oblivious stencil computations

Recommendations

Cache oblivious stencil computations
ICS '05: Proceedings of the 19th annual international conference on Supercomputing

We present a cache oblivious algorithm for stencil computations, which arise for example in finite-difference methods. Our algorithm applies to arbitrary stencils in n-dimensional spaces. On an "ideal cache" of size Z, our algorithm saves a factor of Θ(...
The Cache Complexity of Multithreaded Cache Oblivious Algorithms
Special Issue: Symposium on Parallelism in Algorithms and Architectures 2006; Guest Editors: Robert Kleinberg and Christian Scheideler

We present a technique for analyzing the number of cache misses incurred by multithreaded cache oblivious algorithms on an idealized parallel machine in which each processor has a private cache. We specialize this technique to computations executed by ...
Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors

Stencil-based kernels constitute the core of many important scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory ...

Comments

Information & Contributors

Information

Published In

cover image The Journal of Supercomputing

The Journal of Supercomputing Volume 39, Issue 2

February 2007

154 pages

ISSN:0920-8542

Issue’s Table of Contents

Copyright © Copyright © 2007 Springer Science+Business Media, LLC.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 February 2007

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Del Sozzo EConficconi DSano K(2024)Across Time and Space: Senju’s Approach for Scaling Iterative Stencil Loop Accelerators on Single and Multiple FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/363492017:2(1-33)Online publication date: 30-Apr-2024
https://dl.acm.org/doi/10.1145/3634920
Singh GKhodamoradi ADenolf KLo JGomez-Luna JMelber JBisca ACorporaal HMutlu OGallivan KNikolopoulos DBeivide RGallopoulos E(2023)SPARTA: Spatial Acceleration for Efficient and Scalable Horizontal Diffusion Weather Stencil ComputationProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593719(463-476)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593719
Reggiani EDel Sozzo EConficconi DNatale GMoroni CSantambrogio M(2021)Enhancing the Scalability of Multi-FPGA Stencil Computations via Highly Optimized HDL ComponentsACM Transactions on Reconfigurable Technology and Systems10.1145/346147814:3(1-33)Online publication date: 12-Aug-2021
https://dl.acm.org/doi/10.1145/3461478
Alappat CSeiferth JHager GKorch MRauber TWellein GLee J(2021)YaskSiteProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370316(174-186)Online publication date: 27-Feb-2021
https://dl.acm.org/doi/10.1109/CGO51591.2021.9370316
Yount CDuran ATobin J(2019)Multi-level spatial and temporal tiling for efficient HPC stencil computation on many-core processors with large shared cachesFuture Generation Computer Systems10.1016/j.future.2017.10.04192:C(903-919)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1016/j.future.2017.10.041
Zou YRajopadhye S(2018)A Code Generator for Energy-Efficient Wavefront Parallelization of Uniform Dependence ComputationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.270974829:9(1923-1936)Online publication date: 1-Sep-2018
https://dl.acm.org/doi/10.1109/TPDS.2017.2709748
Yount CTobin JBreuer ADuran A(2016)YASK-yet another stencil kernelProceedings of the Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for HPC10.5555/3019129.3019133(30-39)Online publication date: 13-Nov-2016
https://dl.acm.org/doi/10.5555/3019129.3019133
Yount CDuran A(2016)Effective use of large high-bandwidth memory caches in HPC stencil computation via temporal wave-front tilingProceedings of the 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems10.5555/3019057.3019064(65-75)Online publication date: 13-Nov-2016
https://dl.acm.org/doi/10.5555/3019057.3019064
Shrestha SGao GManzano JMarquez AFeo JOlukotun KSmith AHundt RMars J(2015)Locality aware concurrent start for stencil applicationsProceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization10.5555/2738600.2738620(157-166)Online publication date: 7-Feb-2015
https://dl.acm.org/doi/10.5555/2738600.2738620
Cattaneo RNatale GSicignano CSciuto DSantambrogio M(2015)On How to Accelerate Iterative Stencil LoopsACM Transactions on Architecture and Code Optimization10.1145/284261512:4(1-26)Online publication date: 8-Dec-2015
https://dl.acm.org/doi/10.1145/2842615
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents