article

Exact analysis of the cache behavior of nested loops

Authors:

Siddhartha Chatterjee,

Philip J. Hanlon,

Alvin R. LebeckAuthors Info & Claims

ACM SIGPLAN Notices, Volume 36, Issue 5

Pages 286 - 297

https://doi.org/10.1145/381694.378859

Published: 01 May 2001 Publication History

Abstract

We develop from first principles an exact model of the behavior of loop nests executing in a memory hicrarchy, by using a nontraditional classification of misses that has the key property of composability. We use Presburger formulas to express various kinds of misses as well as the state of the cache at the end of the loop nest. We use existing tools to simplify these formulas and to count cache misses. The model is powerful enough to handle imperfect loop nests and various flavors of non-linear array layouts based on bit interleaving of array indices. We also indicate how to handle modest levels of associativity, and how to perform limited symbolic analysis of cache behavior. The complexity of the formulas relates to the static structure of the loop nest rather than to its dynamic trip count, allowing our model to gain efficiency in counting cache misses by exploiting repetitive patterns of cache behavior. Validation against cache simulation confirms the exactness of our formulation. Our method can serve as the basis for a static performance predictor to guide program and data transformations to improve performance.

References

[1]

A. Agarwal, M. Horowitz, and J. Hennessy. An analytical cache model. ACM Trans. Comput. Syst., 7(2):184-215, May 1989.]]

Digital Library

[2]

N. Ahmed. Locality Enhancement of Imperfectly-nested Loop Nests. PhD thesis, Department of Computer Science, Cornell University, Aug. 2000.]]

Digital Library

[3]

N. Ahmed, N. Mateev, and K. Pingali. Tiling imperfectly-nested loop nests. Technical Report TR2000-1782, Cornell University, 2000.]]

Digital Library

[4]

N. Ahmed and K. Pingali. Automatic generation of block-recursive codes. In Proceedings of Europar 2000, pages 125-134, 2000.]]

Digital Library

[5]

M. Alt, C. Ferdinand, F. Martin, and R. Wilhelm. Cache behavior prediction by abstract interpretation. In R. Cousot and D. A. Schmidt, editors, SAS'96, Static Analysis Symposium, volume 1145 of Lecture Notes in Computer Science, pages 51-66. Springer, September 1996.]]

Digital Library

[6]

T. Amon, G. Borriello, T. Hu, and J. Liu. Symbolic timing verification of timing diagrams using Presburger formulas. In Proceedings of DAC 97, pages 226-231, Anaheim, CA, June 1997.]]

Digital Library

[7]

T. Amon, G. Borriello, and J. Liu. Making complex timing relationships readable: Presburger formula simplification using don't cares. In Proceedings of DAC 98, pages 586-590, San Francisco, CA, June 1998.]]

Digital Library

[8]

B. Boigelot and P. Wolper. An automata-theoretic approach to Presburger arithmetic. In A. Mycroft, editor, Proceedings of the Second International Symposium on Static Analysis (SAS '95), volume 983 of Lecture Notes in Computer Science, pages 1-18. Springer Verlag, Sept. 1995.]]

Digital Library

[9]

A. Boudet and H. Comon. Diophantine equations, Presburger arithmetic and finite automata. In H. Kirchner, editor, Proc. Coll. on Trees in Algebra and Programming (CAAP'96), volume 1059 of Lecture Notes in Computer Science, pages 30-43. Springer Verlag, 1996.]]

Digital Library

[10]

P. Boulet and X. Redon. SPPoC: fonctionnemen et applications. Research Report 00-04, LIFL (Laboratoire de Recherche en Informatique de l'Universite des Sciences et Technologies de Lille), 2000. In French. Also see http://www.lifl.fr/west/sppoc/.]]

[11]

M. Brehob and R. Enbody. A mathematical model of locality and caching. Technical Report TR-MSU-CPS-96-TBD, Michigan State University, Nov. 1996.]]

[12]

G. C. Cascaval. Compile-Time Performance Prediction of Scientific Programs. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 2000.]]

Digital Library

[13]

S. Carr, K. S. McKinley, and C.-W. Tseng. Compiler optimizations for improving data locality. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 252-262, San Jose, CA, Oct. 1994.]]

Digital Library

[14]

S. Chatterjee, V. V. Jain, A. R. Lebeck, S. Mundhra, and M. Thottethodi. Nonlinear array layouts for hierarchical memory systems. In Proceedings of the 1999 ACM International Conference on Supercomputing, pages 444-453, Rhodes, Greece, June 1999.]]

Digital Library

[15]

S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In Proceedings of Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, pages 222-231, Saint-Malo, France, June 1999.]]

Digital Library

[16]

S. Chatterjee and S. Sen. Cache-efficient matrix transposition. In Proceedings of HPCA-6, pages 195-205, Toulouse, France, Jan. 2000.]]

[17]

M. Cierniak and W. Li. Unifying data and control transformations for distributed shared-memory machines. In Proceedings of the ACM SIGPLAN'95 Conference on Programming Language Design and Implementation, pages 205-217, La Jolla, CA, June 1995.]]

Digital Library

[18]

P. Clauss. Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: Applications to analyze and transform scientific programs. In Proceedings of International Conference on Supercomputing, pages 278-285, May 1996.]]

Digital Library

[19]

S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In Proceedings of the ACM SIGPLAN'95 Conference on Programming Language Design and Implementation, pages 279-290, La Jolla, CA, June 1995.]]

Digital Library

[20]

P. Feautrier. Dataflow analysis of array and scalar references. International Journal of Parallel Programming, 20(1):23-54, 1991.]]

Digital Library

[21]

J. D. Frens and D. S. Wise. Auto-blocking matrix-multiplication or tracking BLAS3 performance with source code. In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 206-216, Las Vegas, NV, June 1997.]]

Digital Library

[22]

C. Fricker, O. Temam, and W. Jalby. Influence of cross-interference on blocked loops: A case study with matrix-vector multiply. ACM Trans. Prog. Lang. Syst., 17(4):561-575, July 1995.]]

Digital Library

[23]

S. Ghosh. Cache Miss Equations: Compiler analysis framework for tuning memory behavior. PhD thesis, Department of Electrical Engineering, Princeton University, Nov. 1999.]]

Digital Library

[24]

S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: An analytical representation of cache misses. In Proceedings of the 1997 International Conference on Supercomputing, pages 317-324, Vienna, Austria, July 1997.]]

Digital Library

[25]

S. Ghosh, M. Martonosi, and S. Malik. Precise miss analysis for program transformations with caches of arbitrary associativity. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 228-239, San Jose, CA, Oct. 1998.]]

Digital Library

[26]

S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: A compiler framework for analyzing and tuning memory behavior. ACM Trans. Prog. Lang. Syst., 21(4):703-746, July 1999.]]

Digital Library

[27]

P. J. Hanlon, D. Chung, S. Chatterjee, D. Genius, A. R. Lebeck, and E. Parker. The combinatorics of cache misses during matrix multiplication. J. Comput. Syst. Sci., 2000. To appear.]]

Digital Library

[28]

J. S. Harper, D. J. Kerbyson, and G. R. Nudd. Analytical modeling of set-associative cache behavior. IEEE Trans. Comput., 48(10):1009-1024, Oct. 1999.]]

Digital Library

[29]

M. D. Hill, J. R. Larus, A. R. Lebeck, M. Talluri, and D. A. Wood. Wisconsin architectural research tool set. Computer Architecture News, 21(4):8-10, August 1993.]]

Digital Library

[30]

M. D. Hill and A. J. Smith. Evaluating associativity in CPU caches. IEEE Trans. Comput., C-38(12):1612-1630, Dec. 1989.]]

Digital Library

[31]

J. E. Hopcroft and J. D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley Publishing Company, 1979.]]

Digital Library

[32]

M. T. Kandemir, A. N. Choudhary, N. Shenoy, P. Banerjee, and J. Ramanujam. A linear algebra framework for automatic determination of optimal data layouts. IEEE Transactions on Parallel and Distributed Systems, 10(2):115-135, Feb. 1999.]]

Digital Library

[33]

W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott. The Omega Calculator and Library, version 1.1.0, Nov. 1996.]]

[34]

W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott. The Omega Library Version 1.1.0 Interface Guide, Nov. 1996.]]

[35]

W. Kelly and W. Pugh. A framework for unifying reordering transformations. Technical Report CS-TR-3193, Department of Compute Science, University of Maryland, College Park, MD, Apr. 1993.]]

Digital Library

[36]

W. Kelly and W. Pugh. Finding legal reordering transformations using mappings. Technical Report CS-TR-3297, Department of Compute Science, University of Maryland, College Park, MD, June 1994.]]

[37]

R. E. Kessler and M. D. Hill. Page placement algorithms for large real-index caches. ACM Trans. Comput. Syst., 10(4):338-359, 1992.]]

Digital Library

[38]

M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63-74, Apr. 1991.]]

Digital Library

[39]

A. R. Lebeck and D. A. Wood. Cache profiling and the SPEC benchmarks: A case study. IEEE Computer, 27(10):15-26, Oct. 1994.]]

Digital Library

[40]

A. R. Lebeck and D. A. Wood. Active memory: A new abstraction for memory system simulation. ACM Transactions on Modeling and Computer Simulation, 7(1):42-77, Jan. 1997.]]

Digital Library

[41]

A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine transforms. In Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languagess, pages 201-214, Paris, France, Jan. 1997.]]

Digital Library

[42]

V. Loechner. PolyLib: A Library for Manipulating Parameterized Polyhedra, Mar. 1999.]]

[43]

M. Martonosi, A. Gupta, and T. Anderson. Memspy: Analyzing memory system bottlenecks in programs. In SIGMETRICS92, pages 1-12, June 1992.]]

Digital Library

[44]

K. S. McKinley and O. Temam. Quantifying loop nest locality using SPEC'95 and the Perfect benchmarks. ACM Trans. Comput. Syst., 17(4):288-336, Nov. 1999.]]

Digital Library

[45]

N. Mitchell, L. Carter, J. Ferrante, and K. H ogstedt. Quantifying the multi-level nature of tiling interactions. In Languages and Compilers for Parallel Computing: 10th Annual Workshop, LCPC'97, number 1366 in Lecture Notes in Computer Science, pages 1-15. Springer, 1998.]]

Digital Library

[46]

D. C. Oppen. A 2 2 2 pn upper bound on the complexity of Presburger arithmetic. J. Comput. Syst. Sci., 16(3):323-332, July 1978.]]

[47]

Y. Paek, J. Hoeflinger, and D. Padua. Simplification of array access patterns for compiler optimizations. In Proceedings of ACM PLDI, volume 33, pages 60-71, May 1998.]]

Digital Library

[48]

A. K. Porterfield. Software Methods for Improvement of Cache Performance on Supercomputer Applications. PhD thesis, Rice University, Houston, TX, May 1989. Available as technical report CRPC-TR89009.]]

Digital Library

[49]

W. Pugh. Counting solutions to Presburger formulas: How and why. In Proceedings of the ACM SIGPLAN'94 Conference on Programming Language Design and Implementation, pages 121-134, Orlando, FL, June 1994.]]

Digital Library

[50]

G. Rivera and C.-W. Tseng. Data transformations for eliminating conflict misses. In Proceedings of the ACM SIGPLAN'98 Conference on Programming Language Design and Implementation, pages 38-49, Montreal, Canada, June 1998.]]

Digital Library

[51]

G. Rivera and C.-W. Tseng. Eliminating conflict misses for high performance architectures. In Proceedings of the 1998 International Conference on Supercomputing, pages 353-360, Melbourne, Australia, July 1998.]]

Digital Library

[52]

U. Sch~ning. Complexity of Presburger arithmetic with fixed quantifier dimension. Theory of Computing Systems, 30:423-428, 1997.]]

[53]

N. Shibata, K. Okana, T. Higashino, and K. Taniguchi. A decision algorithm dor prenex form rational Presburger sentences based on combinatorial geometry. In Proceedings of the 2nd International Conference on Discrete Mathematics and Theoretical Computer Science and the 5th Australasian Theory Symposium (DMTCS'99+CATS'99), pages 344-359, Jan. 1999.]]

[54]

A. Srivastava and A. Eustace. ATOM: A system for building customized program analysis tools. In Proceedings of the ACM SIGPLAN'94 Conference on Programming Language Design and Implementation, pages 196-205, June 1994.]]

Digital Library

[55]

The Stanford Compiler Group. SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers. http://suif.stanford.edu.]]

[56]

R. A. Sugumar and S. G. Abraham. Efficient simulation of multiple cache configurations using binomial trees. Technical Report CSE-TR-111-91, 1991.]]

[57]

D. Thiebaut and H. Stone. Footprints in the cache. ACM Trans. Comput. Syst., 5(4):305-329, Nov. 1987.]]

Digital Library

[58]

D. A. B. Weikle, S. A. McKee, and W. A. Wulf. Caches as filters: A new approach to cache analysis. In MASCOTS'98, Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, July 1998.]]

Digital Library

[59]

D. A. B. Weikle, K. Skadron, S. A. McKee, and W. A. Wulf. Caches as filters: A unifying model for memory hierarchy analysis. Technical Report CS-2000-16, University of Virginia, June 2000.]]

Digital Library

[60]

V. Weispfenning. Complexity and uniformity of elimination in Presburger arithmetic. In Proceedings of the 1997 International Symposium on Symbolic and Algebraic Computation, pages 48-53, Kihei, Maui, HI, July 1997.]]

Digital Library

[61]

M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN'91 Conference on Programming Language Design and Implementation, pages 30-44, Toronto, Canada, June 1991.]]

Digital Library

[62]

M. J. Wolfe. More iteration space tiling. In Proceedings of Supercomputing'89, pages 655-664, Reno, NV, Nov. 1989.]]

Digital Library

[63]

D. A. Wood, M. D. Hill, and R. E. Kessler. A model for estimating trace-sample miss ratios. In Proceedings of ACM SIGMETRICS, May 1991.]]

Digital Library

[64]

H. Zhang and M. Martonosi. Mathematical cache miss analysis for pointer data structures. In Proceedings of the SIAM Conference on Parallel Processing for Scientific Computing, Portsmouth, VA, Mar. 2001. CD-ROM.]]

Cited By

Pitchanathan AGrover KGrosser T(2024)Falcon: A Scalable Analytical Cache ModelProceedings of the ACM on Programming Languages10.1145/36564528:PLDI(1854-1878)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656452
Kelefouras VDjemame KKeramidas GVoros N(2022)A Methodology for Efficient Tile Size Selection for Affine Loop KernelsInternational Journal of Parallel Programming10.1007/s10766-022-00734-550:3-4(405-432)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1007/s10766-022-00734-5
Ma MLiu Z(2022)LLC Access Behavior Prediction Method for Multi-core ProcessorsAdvances in Natural Computation, Fuzzy Systems and Knowledge Discovery10.1007/978-3-030-89698-0_142(1387-1394)Online publication date: 4-Jan-2022
https://doi.org/10.1007/978-3-030-89698-0_142
Show More Cited By

Index Terms

Exact analysis of the cache behavior of nested loops
1. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language features
        Concurrent programming structures

Recommendations

Exact analysis of the cache behavior of nested loops
PLDI '01: Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation

We develop from first principles an exact model of the behavior of loop nests executing in a memory hicrarchy, by using a nontraditional classification of misses that has the key property of composability. We use Presburger formulas to express various ...
Improving locality and parallelism in nested loops
Software Pipelining of Nested Loops
CC '01: Proceedings of the 10th International Conference on Compiler Construction

Software pipelining is a technique to improve the performance of a loop by overlapping the execution of several iterations. The execution of a software-pipelined loop goes through three phases: prolog, kernel, and epilog. Software pipelining works best ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 36, Issue 5

May 2001

330 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/381694

Issue’s Table of Contents

PLDI '01: Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
June 2001
331 pages
ISBN:1581134142
DOI:10.1145/378795
Chairmen:
Michael Burke
IBM T.J. Watson Research Center
,
Mary Lou Soffa
Univ. of Pittsburgh

Copyright © 2001 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2001

Published in SIGPLAN Volume 36, Issue 5

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

131
Total Citations
View Citations
955
Total Downloads

Downloads (Last 12 months)24
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Pitchanathan AGrover KGrosser T(2024)Falcon: A Scalable Analytical Cache ModelProceedings of the ACM on Programming Languages10.1145/36564528:PLDI(1854-1878)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656452
Kelefouras VDjemame KKeramidas GVoros N(2022)A Methodology for Efficient Tile Size Selection for Affine Loop KernelsInternational Journal of Parallel Programming10.1007/s10766-022-00734-550:3-4(405-432)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1007/s10766-022-00734-5
Ma MLiu Z(2022)LLC Access Behavior Prediction Method for Multi-core ProcessorsAdvances in Natural Computation, Fuzzy Systems and Knowledge Discovery10.1007/978-3-030-89698-0_142(1387-1394)Online publication date: 4-Jan-2022
https://doi.org/10.1007/978-3-030-89698-0_142
Shealy BFeltus FSmith M(2021)Intelligent Resource Provisioning for Scientific Workflows and HPC2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS)10.1109/WORKS54523.2021.00007(9-16)Online publication date: Nov-2021
https://doi.org/10.1109/WORKS54523.2021.00007
Kelefouras VDjemame KKeramidas GVoros N(2021)An Analytical Model for Loop Tiling TransformationEmbedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-031-04580-6_7(95-107)Online publication date: 4-Jul-2021
https://dl.acm.org/doi/10.1007/978-3-031-04580-6_7
Pfaffe PGrosser TTillmann MEigenmann RDing CMcKee S(2019)Efficient hierarchical online-autotuningProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330377(354-366)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3330377
Gysi TGrosser TBrandner LHoefler TMcKinley KFisher K(2019)A fast analytical model of fully associative cachesProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314606(816-829)Online publication date: 8-Jun-2019
https://dl.acm.org/doi/10.1145/3314221.3314606
Roy PSong SKrishnamoorthy SLiu X(2018)Lightweight detection of cache conflictsProceedings of the 2018 International Symposium on Code Generation and Optimization - CGO 201810.1145/3179541.3168819(200-213)Online publication date: 2018
https://doi.org/10.1145/3179541.3168819
Roy PSong SKrishnamoorthy SLiu XKnoop JSchordan MJohnson TO'Boyle M(2018)Lightweight detection of cache conflictsProceedings of the 2018 International Symposium on Code Generation and Optimization10.1145/3168819(200-213)Online publication date: 24-Feb-2018
https://dl.acm.org/doi/10.1145/3168819
Zhu GAgrawal G(2018)A Performance Prediction Framework for Irregular Applications2018 IEEE 25th International Conference on High Performance Computing (HiPC)10.1109/HiPC.2018.00042(304-313)Online publication date: Dec-2018
https://doi.org/10.1109/HiPC.2018.00042
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents