Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Exact analysis of the cache behavior of nested loops

Published: 01 May 2001 Publication History

Abstract

We develop from first principles an exact model of the behavior of loop nests executing in a memory hicrarchy, by using a nontraditional classification of misses that has the key property of composability. We use Presburger formulas to express various kinds of misses as well as the state of the cache at the end of the loop nest. We use existing tools to simplify these formulas and to count cache misses. The model is powerful enough to handle imperfect loop nests and various flavors of non-linear array layouts based on bit interleaving of array indices. We also indicate how to handle modest levels of associativity, and how to perform limited symbolic analysis of cache behavior. The complexity of the formulas relates to the static structure of the loop nest rather than to its dynamic trip count, allowing our model to gain efficiency in counting cache misses by exploiting repetitive patterns of cache behavior. Validation against cache simulation confirms the exactness of our formulation. Our method can serve as the basis for a static performance predictor to guide program and data transformations to improve performance.

References

[1]
A. Agarwal, M. Horowitz, and J. Hennessy. An analytical cache model. ACM Trans. Comput. Syst., 7(2):184-215, May 1989.]]
[2]
N. Ahmed. Locality Enhancement of Imperfectly-nested Loop Nests. PhD thesis, Department of Computer Science, Cornell University, Aug. 2000.]]
[3]
N. Ahmed, N. Mateev, and K. Pingali. Tiling imperfectly-nested loop nests. Technical Report TR2000-1782, Cornell University, 2000.]]
[4]
N. Ahmed and K. Pingali. Automatic generation of block-recursive codes. In Proceedings of Europar 2000, pages 125-134, 2000.]]
[5]
M. Alt, C. Ferdinand, F. Martin, and R. Wilhelm. Cache behavior prediction by abstract interpretation. In R. Cousot and D. A. Schmidt, editors, SAS'96, Static Analysis Symposium, volume 1145 of Lecture Notes in Computer Science, pages 51-66. Springer, September 1996.]]
[6]
T. Amon, G. Borriello, T. Hu, and J. Liu. Symbolic timing verification of timing diagrams using Presburger formulas. In Proceedings of DAC 97, pages 226-231, Anaheim, CA, June 1997.]]
[7]
T. Amon, G. Borriello, and J. Liu. Making complex timing relationships readable: Presburger formula simplification using don't cares. In Proceedings of DAC 98, pages 586-590, San Francisco, CA, June 1998.]]
[8]
B. Boigelot and P. Wolper. An automata-theoretic approach to Presburger arithmetic. In A. Mycroft, editor, Proceedings of the Second International Symposium on Static Analysis (SAS '95), volume 983 of Lecture Notes in Computer Science, pages 1-18. Springer Verlag, Sept. 1995.]]
[9]
A. Boudet and H. Comon. Diophantine equations, Presburger arithmetic and finite automata. In H. Kirchner, editor, Proc. Coll. on Trees in Algebra and Programming (CAAP'96), volume 1059 of Lecture Notes in Computer Science, pages 30-43. Springer Verlag, 1996.]]
[10]
P. Boulet and X. Redon. SPPoC: fonctionnemen et applications. Research Report 00-04, LIFL (Laboratoire de Recherche en Informatique de l'Universite des Sciences et Technologies de Lille), 2000. In French. Also see http://www.lifl.fr/west/sppoc/.]]
[11]
M. Brehob and R. Enbody. A mathematical model of locality and caching. Technical Report TR-MSU-CPS-96-TBD, Michigan State University, Nov. 1996.]]
[12]
G. C. Cascaval. Compile-Time Performance Prediction of Scientific Programs. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 2000.]]
[13]
S. Carr, K. S. McKinley, and C.-W. Tseng. Compiler optimizations for improving data locality. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 252-262, San Jose, CA, Oct. 1994.]]
[14]
S. Chatterjee, V. V. Jain, A. R. Lebeck, S. Mundhra, and M. Thottethodi. Nonlinear array layouts for hierarchical memory systems. In Proceedings of the 1999 ACM International Conference on Supercomputing, pages 444-453, Rhodes, Greece, June 1999.]]
[15]
S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In Proceedings of Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, pages 222-231, Saint-Malo, France, June 1999.]]
[16]
S. Chatterjee and S. Sen. Cache-efficient matrix transposition. In Proceedings of HPCA-6, pages 195-205, Toulouse, France, Jan. 2000.]]
[17]
M. Cierniak and W. Li. Unifying data and control transformations for distributed shared-memory machines. In Proceedings of the ACM SIGPLAN'95 Conference on Programming Language Design and Implementation, pages 205-217, La Jolla, CA, June 1995.]]
[18]
P. Clauss. Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: Applications to analyze and transform scientific programs. In Proceedings of International Conference on Supercomputing, pages 278-285, May 1996.]]
[19]
S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In Proceedings of the ACM SIGPLAN'95 Conference on Programming Language Design and Implementation, pages 279-290, La Jolla, CA, June 1995.]]
[20]
P. Feautrier. Dataflow analysis of array and scalar references. International Journal of Parallel Programming, 20(1):23-54, 1991.]]
[21]
J. D. Frens and D. S. Wise. Auto-blocking matrix-multiplication or tracking BLAS3 performance with source code. In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 206-216, Las Vegas, NV, June 1997.]]
[22]
C. Fricker, O. Temam, and W. Jalby. Influence of cross-interference on blocked loops: A case study with matrix-vector multiply. ACM Trans. Prog. Lang. Syst., 17(4):561-575, July 1995.]]
[23]
S. Ghosh. Cache Miss Equations: Compiler analysis framework for tuning memory behavior. PhD thesis, Department of Electrical Engineering, Princeton University, Nov. 1999.]]
[24]
S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: An analytical representation of cache misses. In Proceedings of the 1997 International Conference on Supercomputing, pages 317-324, Vienna, Austria, July 1997.]]
[25]
S. Ghosh, M. Martonosi, and S. Malik. Precise miss analysis for program transformations with caches of arbitrary associativity. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 228-239, San Jose, CA, Oct. 1998.]]
[26]
S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: A compiler framework for analyzing and tuning memory behavior. ACM Trans. Prog. Lang. Syst., 21(4):703-746, July 1999.]]
[27]
P. J. Hanlon, D. Chung, S. Chatterjee, D. Genius, A. R. Lebeck, and E. Parker. The combinatorics of cache misses during matrix multiplication. J. Comput. Syst. Sci., 2000. To appear.]]
[28]
J. S. Harper, D. J. Kerbyson, and G. R. Nudd. Analytical modeling of set-associative cache behavior. IEEE Trans. Comput., 48(10):1009-1024, Oct. 1999.]]
[29]
M. D. Hill, J. R. Larus, A. R. Lebeck, M. Talluri, and D. A. Wood. Wisconsin architectural research tool set. Computer Architecture News, 21(4):8-10, August 1993.]]
[30]
M. D. Hill and A. J. Smith. Evaluating associativity in CPU caches. IEEE Trans. Comput., C-38(12):1612-1630, Dec. 1989.]]
[31]
J. E. Hopcroft and J. D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley Publishing Company, 1979.]]
[32]
M. T. Kandemir, A. N. Choudhary, N. Shenoy, P. Banerjee, and J. Ramanujam. A linear algebra framework for automatic determination of optimal data layouts. IEEE Transactions on Parallel and Distributed Systems, 10(2):115-135, Feb. 1999.]]
[33]
W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott. The Omega Calculator and Library, version 1.1.0, Nov. 1996.]]
[34]
W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott. The Omega Library Version 1.1.0 Interface Guide, Nov. 1996.]]
[35]
W. Kelly and W. Pugh. A framework for unifying reordering transformations. Technical Report CS-TR-3193, Department of Compute Science, University of Maryland, College Park, MD, Apr. 1993.]]
[36]
W. Kelly and W. Pugh. Finding legal reordering transformations using mappings. Technical Report CS-TR-3297, Department of Compute Science, University of Maryland, College Park, MD, June 1994.]]
[37]
R. E. Kessler and M. D. Hill. Page placement algorithms for large real-index caches. ACM Trans. Comput. Syst., 10(4):338-359, 1992.]]
[38]
M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63-74, Apr. 1991.]]
[39]
A. R. Lebeck and D. A. Wood. Cache profiling and the SPEC benchmarks: A case study. IEEE Computer, 27(10):15-26, Oct. 1994.]]
[40]
A. R. Lebeck and D. A. Wood. Active memory: A new abstraction for memory system simulation. ACM Transactions on Modeling and Computer Simulation, 7(1):42-77, Jan. 1997.]]
[41]
A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine transforms. In Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languagess, pages 201-214, Paris, France, Jan. 1997.]]
[42]
V. Loechner. PolyLib: A Library for Manipulating Parameterized Polyhedra, Mar. 1999.]]
[43]
M. Martonosi, A. Gupta, and T. Anderson. Memspy: Analyzing memory system bottlenecks in programs. In SIGMETRICS92, pages 1-12, June 1992.]]
[44]
K. S. McKinley and O. Temam. Quantifying loop nest locality using SPEC'95 and the Perfect benchmarks. ACM Trans. Comput. Syst., 17(4):288-336, Nov. 1999.]]
[45]
N. Mitchell, L. Carter, J. Ferrante, and K. H ogstedt. Quantifying the multi-level nature of tiling interactions. In Languages and Compilers for Parallel Computing: 10th Annual Workshop, LCPC'97, number 1366 in Lecture Notes in Computer Science, pages 1-15. Springer, 1998.]]
[46]
D. C. Oppen. A 2 2 2 pn upper bound on the complexity of Presburger arithmetic. J. Comput. Syst. Sci., 16(3):323-332, July 1978.]]
[47]
Y. Paek, J. Hoeflinger, and D. Padua. Simplification of array access patterns for compiler optimizations. In Proceedings of ACM PLDI, volume 33, pages 60-71, May 1998.]]
[48]
A. K. Porterfield. Software Methods for Improvement of Cache Performance on Supercomputer Applications. PhD thesis, Rice University, Houston, TX, May 1989. Available as technical report CRPC-TR89009.]]
[49]
W. Pugh. Counting solutions to Presburger formulas: How and why. In Proceedings of the ACM SIGPLAN'94 Conference on Programming Language Design and Implementation, pages 121-134, Orlando, FL, June 1994.]]
[50]
G. Rivera and C.-W. Tseng. Data transformations for eliminating conflict misses. In Proceedings of the ACM SIGPLAN'98 Conference on Programming Language Design and Implementation, pages 38-49, Montreal, Canada, June 1998.]]
[51]
G. Rivera and C.-W. Tseng. Eliminating conflict misses for high performance architectures. In Proceedings of the 1998 International Conference on Supercomputing, pages 353-360, Melbourne, Australia, July 1998.]]
[52]
U. Sch~ning. Complexity of Presburger arithmetic with fixed quantifier dimension. Theory of Computing Systems, 30:423-428, 1997.]]
[53]
N. Shibata, K. Okana, T. Higashino, and K. Taniguchi. A decision algorithm dor prenex form rational Presburger sentences based on combinatorial geometry. In Proceedings of the 2nd International Conference on Discrete Mathematics and Theoretical Computer Science and the 5th Australasian Theory Symposium (DMTCS'99+CATS'99), pages 344-359, Jan. 1999.]]
[54]
A. Srivastava and A. Eustace. ATOM: A system for building customized program analysis tools. In Proceedings of the ACM SIGPLAN'94 Conference on Programming Language Design and Implementation, pages 196-205, June 1994.]]
[55]
The Stanford Compiler Group. SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers. http://suif.stanford.edu.]]
[56]
R. A. Sugumar and S. G. Abraham. Efficient simulation of multiple cache configurations using binomial trees. Technical Report CSE-TR-111-91, 1991.]]
[57]
D. Thiebaut and H. Stone. Footprints in the cache. ACM Trans. Comput. Syst., 5(4):305-329, Nov. 1987.]]
[58]
D. A. B. Weikle, S. A. McKee, and W. A. Wulf. Caches as filters: A new approach to cache analysis. In MASCOTS'98, Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, July 1998.]]
[59]
D. A. B. Weikle, K. Skadron, S. A. McKee, and W. A. Wulf. Caches as filters: A unifying model for memory hierarchy analysis. Technical Report CS-2000-16, University of Virginia, June 2000.]]
[60]
V. Weispfenning. Complexity and uniformity of elimination in Presburger arithmetic. In Proceedings of the 1997 International Symposium on Symbolic and Algebraic Computation, pages 48-53, Kihei, Maui, HI, July 1997.]]
[61]
M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN'91 Conference on Programming Language Design and Implementation, pages 30-44, Toronto, Canada, June 1991.]]
[62]
M. J. Wolfe. More iteration space tiling. In Proceedings of Supercomputing'89, pages 655-664, Reno, NV, Nov. 1989.]]
[63]
D. A. Wood, M. D. Hill, and R. E. Kessler. A model for estimating trace-sample miss ratios. In Proceedings of ACM SIGMETRICS, May 1991.]]
[64]
H. Zhang and M. Martonosi. Mathematical cache miss analysis for pointer data structures. In Proceedings of the SIAM Conference on Parallel Processing for Scientific Computing, Portsmouth, VA, Mar. 2001. CD-ROM.]]

Cited By

View all
  • (2024)Falcon: A Scalable Analytical Cache ModelProceedings of the ACM on Programming Languages10.1145/36564528:PLDI(1854-1878)Online publication date: 20-Jun-2024
  • (2022)A Methodology for Efficient Tile Size Selection for Affine Loop KernelsInternational Journal of Parallel Programming10.1007/s10766-022-00734-550:3-4(405-432)Online publication date: 1-Aug-2022
  • (2022)LLC Access Behavior Prediction Method for Multi-core ProcessorsAdvances in Natural Computation, Fuzzy Systems and Knowledge Discovery10.1007/978-3-030-89698-0_142(1387-1394)Online publication date: 4-Jan-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 36, Issue 5
May 2001
330 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/381694
Issue’s Table of Contents
  • cover image ACM Conferences
    PLDI '01: Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
    June 2001
    331 pages
    ISBN:1581134142
    DOI:10.1145/378795
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2001
Published in SIGPLAN Volume 36, Issue 5

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)24
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Falcon: A Scalable Analytical Cache ModelProceedings of the ACM on Programming Languages10.1145/36564528:PLDI(1854-1878)Online publication date: 20-Jun-2024
  • (2022)A Methodology for Efficient Tile Size Selection for Affine Loop KernelsInternational Journal of Parallel Programming10.1007/s10766-022-00734-550:3-4(405-432)Online publication date: 1-Aug-2022
  • (2022)LLC Access Behavior Prediction Method for Multi-core ProcessorsAdvances in Natural Computation, Fuzzy Systems and Knowledge Discovery10.1007/978-3-030-89698-0_142(1387-1394)Online publication date: 4-Jan-2022
  • (2021)Intelligent Resource Provisioning for Scientific Workflows and HPC2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS)10.1109/WORKS54523.2021.00007(9-16)Online publication date: Nov-2021
  • (2021)An Analytical Model for Loop Tiling TransformationEmbedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-031-04580-6_7(95-107)Online publication date: 4-Jul-2021
  • (2019)Efficient hierarchical online-autotuningProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330377(354-366)Online publication date: 26-Jun-2019
  • (2019)A fast analytical model of fully associative cachesProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314606(816-829)Online publication date: 8-Jun-2019
  • (2018)Lightweight detection of cache conflictsProceedings of the 2018 International Symposium on Code Generation and Optimization - CGO 201810.1145/3179541.3168819(200-213)Online publication date: 2018
  • (2018)Lightweight detection of cache conflictsProceedings of the 2018 International Symposium on Code Generation and Optimization10.1145/3168819(200-213)Online publication date: 24-Feb-2018
  • (2018)A Performance Prediction Framework for Irregular Applications2018 IEEE 25th International Conference on High Performance Computing (HiPC)10.1109/HiPC.2018.00042(304-313)Online publication date: Dec-2018
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media