Article

Impact of modern memory subsystems on cache optimizations for stencil computations

Authors:

Katherine YelickAuthors Info & Claims

MSP '05: Proceedings of the 2005 workshop on Memory system performance

Pages 36 - 43

https://doi.org/10.1145/1111583.1111589

Published: 12 June 2005 Publication History

Get Access

Abstract

In this work we investigate the impact of evolving memory system features, such as large on-chip caches, automatic prefetch, and the growing distance to main memory on 3D stencil computations. These calculations form the basis for a wide range of scientific applications from simple Jacobi iterations to complex multigrid and block structured adaptive PDE solvers. First we develop a simple benchmark to evaluate the effectiveness of prefetching in cache-based memory systems. Next we present a small parameterized probe and validate its use as a proxy for general stencil computations on three modern microprocessors. We then derive an analytical memory cost model for quantifying cache-blocking behavior and demonstrate its effectiveness in predicting the stencil-computation performance. Overall results demonstrate that recent trends memory system organization have reduced the efficacy of traditional cache-blocking optimizations.

References

[1]

S. Sellappa and S. Chatterjee, "Cache-efficient multigrid algorithms," International Journal of High Performance Computing Applications, vol. 18, no. 1, pp. 115--133, 2004.

Digital Library

Google Scholar

[2]

G. Rivera and C. Tseng, "Tiling optimizations for 3d scientific computations," in Proceedings of SC'00, (Dallas, TX), Supercomputing 2000, November 2000.

Digital Library

Google Scholar

[3]

A. Lim, S. Liao, and M. Lam, "Blocking and array contraction across arbitrarily nested loops using affine partitioning," in Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, June 2001.

Digital Library

Google Scholar

[4]

D. Bailey, "Littleś law and high performance computing," RNR Technical Report, 1997.

Google Scholar

[5]

J. McCalpin, "Memory bandwidth and machine balance in current high performance computers," IEEE TCAA Newsletter, December 1995.

Google Scholar

[6]

"Chombo homepage." http://seesar.lbl.gov/anag/chombo/, 2004.

Google Scholar

[7]

"Cactus Homepage." http://www.cactuscode.org, 2004.

Google Scholar

[8]

W. Benger, I. Foster, J. Novotny, E. Seidel, J. Shalf, W. Smith, and P. Walker, "Numerical relativity in a distributed environment," in Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing, 1999.

Google Scholar

[9]

M. Alcubierre, G. Allen, B. Brgmann, E. Seidel, and W.-M. Suen, "Towards an understanding of the stability properties of the 3+1 evolution equations in general relativity," Phys. Rev. D, vol. (gr-qc/9908079), 2000.

Google Scholar

[10]

J. A. Font, M. Miller, W. M. Suen, and M. Tobias, "Three dimensional numerical general relativistic hydrodynamics: Formulations, methods, and code tests," Phys. Rev. D, vol. Phys. Rev. D61, 2000.

Google Scholar

[11]

"Performance API homepage." http://icl.cs.utk.edu/papi, 2005.

Google Scholar

[12]

"CHUD homepage." http://developer.apple.com/tools/performance/, 2005.

Google Scholar

[13]

Z. Li and Y. Song, "Automatic tiling of iterative stencil loops," ACM Trans. Program. Lang. Syst., vol. 26, no. 6, pp. 975--1028, 2004.

Digital Library

Google Scholar

[14]

M. M. Strout, L. Carter, J. Ferrante, J. Freeman, and B. Kreaseck, "Combining performance aspects of irregular gauss-seidel via sparse tiling," in 15th Workshop on Languages and Compilers for Parallel Computing (LCPC), (College Park, Maryland), July 25-27, 2002.

Google Scholar

Cited By

View all

Ahmad ZChowdhury RDas RGanapathi PGregory AZhu Y(2023)A Fast Algorithm for Aperiodic Linear Stencil Computation using Fast Fourier TransformsACM Transactions on Parallel Computing10.1145/360633810:4(1-34)Online publication date: 24-Jul-2023
https://dl.acm.org/doi/10.1145/3606338
Xie ZLiu JLi JLi DDehnavi MKulkarni MKrishnamoorthy S(2023)MerchandiserProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577497(204-217)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577497
Denzler AOliveira GHajinazar NBera RSingh GGómez-Luna JMutlu O(2023)Casper: Accelerating Stencil Computations Using Near-Cache ProcessingIEEE Access10.1109/ACCESS.2023.325200211(22136-22154)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3252002
Show More Cited By

Index Terms

Impact of modern memory subsystems on cache optimizations for stencil computations

Recommendations

Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors

Stencil-based kernels constitute the core of many important scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory ...
Reducing Cache Pollution via Dynamic Data Prefetch Filtering

In order to bridge the gap of the growing speed disparity between processors and their memory subsystems, aggressive prefetch mechanisms, either hardware-based or compiler-assisted, are employed to hide memory latencies. As the first-level cache gets ...
Improving l2 cache performance through stream-directed optimizations

Comments

Information & Contributors

Information

Published In

MSP '05: Proceedings of the 2005 workshop on Memory system performance

June 2005

74 pages

ISBN:1595931473

DOI:10.1145/1111583

General Chair:
Brad Calder
U. C, San Diego
,
Program Chair:
Ben Zorn
Microsoft

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

MSP05

MSP05: Memory Systems Performance Workshop

June 12, 2005

Illinois, Chicago

Acceptance Rates

Overall Acceptance Rate 6 of 20 submissions, 30%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

82
Total Citations
View Citations
509
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)2

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Ahmad ZChowdhury RDas RGanapathi PGregory AZhu Y(2023)A Fast Algorithm for Aperiodic Linear Stencil Computation using Fast Fourier TransformsACM Transactions on Parallel Computing10.1145/360633810:4(1-34)Online publication date: 24-Jul-2023
https://dl.acm.org/doi/10.1145/3606338
Xie ZLiu JLi JLi DDehnavi MKulkarni MKrishnamoorthy S(2023)MerchandiserProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577497(204-217)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577497
Denzler AOliveira GHajinazar NBera RSingh GGómez-Luna JMutlu O(2023)Casper: Accelerating Stencil Computations Using Near-Cache ProcessingIEEE Access10.1109/ACCESS.2023.325200211(22136-22154)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3252002
Li KYuan LZhang YYue YCao H(2022)An Efficient Vectorization Scheme for Stencil Computation2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00069(650-660)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00069
Ahmad ZChowdhury RDas RGanapathi PGregory AZhu YAgrawal KAzar Y(2021)Fast Stencil Computations using Fast Fourier TransformsProceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3409964.3461803(8-21)Online publication date: 6-Jul-2021
https://dl.acm.org/doi/10.1145/3409964.3461803
Chang XShen LWang Q(2021)Optimizing Stencil Codes with Exploiting Data Reuse2021 International Conference on Information Control, Electrical Engineering and Rail Transit (ICEERT)10.1109/ICEERT53919.2021.00018(45-54)Online publication date: Oct-2021
https://doi.org/10.1109/ICEERT53919.2021.00018
Fang JLiao XHuang CDong D(2021)Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+Journal of Computer Science and Technology10.1007/s11390-020-0741-636:1(33-43)Online publication date: 30-Jan-2021
https://doi.org/10.1007/s11390-020-0741-6
Nguyen TMacLean CSiracusa MDoerfler DWright NWilliams S(2021)FPGA‐based HPC accelerators: An evaluation on performance and energy efficiencyConcurrency and Computation: Practice and Experience10.1002/cpe.657034:20Online publication date: 22-Aug-2021
https://doi.org/10.1002/cpe.6570
Nguyen TWilliams SSiracusa MMacLean CDoerfler DWright N(2020)The Performance and Energy Efficiency Potential of FPGAs in Scientific Computing2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)10.1109/PMBS51919.2020.00007(8-19)Online publication date: Nov-2020
https://doi.org/10.1109/PMBS51919.2020.00007
Guerrera DMaffia ABurkhart H(2019)Reproducible stencil compiler benchmarks using prova! Future Generation Computer Systems10.1016/j.future.2018.05.02392:C(933-946)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1016/j.future.2018.05.023
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors

Reducing Cache Pollution via Dynamic Data Prefetch Filtering

Improving l2 cache performance through stream-directed optimizations

Comments

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Other Metrics

Article Metrics

Other Metrics

Cited By

Login options

Full Access

PDF

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors

Reducing Cache Pollution via Dynamic Data Prefetch Filtering

Improving l2 cache performance through stream-directed optimizations

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations