Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Open access

Cache miss equations: a compiler framework for analyzing and tuning memory behavior

Published: 01 July 1999 Publication History

Abstract

With the ever-widening performance gap between processors and main memory, cache memory, which is used to bridge this gap, is becoming more and more significant. Caches work well for programs that exhibit sufficient locality. Other programs, however, have reference patterns that fail to exploit the cache, thereby suffering heavily from high memory latency. In order to get high cache efficiency and achieve good program performance, efficient memory accessing behavior is necessary. In fact, for many programs, program transformations or source-code changes can radically alter memory access patterns, significantly improving cache performance. Both hand-tuning and compiler optimization techniques are often used to transform codes to improve cache utilization. Unfortunately, cache conflicts are difficult to predict and estimate, precluding effective transformations. Hence, effective transformations require detailed knowledge about the frequency and causes of cache misses in the code. This article describes methods for generating and solving Cache Miss Equations (CMEs) that give a detailed representation of cache behavior, including conflict misses, in loop-oriented scientific code. Implemented within the SUIF compiler framework, our approach extends traditional compiler reuse analysis to generate linear Diophantine equations that summarize each loop's memory behavior. While solving these equations is in general difficult, we show that is also unnecessary, as mathematical techniques for manipulating Diophantine equations allow us to relatively easily compute and/or reduce the number of possible solutions, where each solution corresponds to a potential cache miss. The mathematical precision of CMEs allows us to find true optimal solutions for transformations such as blocking or padding. The generality of CMEs also allows us to reason about interactions between transformations applied in concert. The article also gives examples of their use to determine array padding
and offset amounts that minimize cache misses, and to determine optimal blocking factors for tiled code. Overall, these equations represent an analysis framework that offers the generality and precision needed for detailed compiler optimizations.

References

[1]
Adler, A. and Coury, J. E. 1995. The Theory of Numbers: A Text and Source Book of Problems. Jones and Bartlett Publishers, Boston, MA.
[2]
Allen, R. and Kennedy, K. 1987. Automatic translation of FORTRAN programs to vector form. ACM Trans. Program. Lang. Syst. 9, 4, 491-542.
[3]
Bacon, D. F. et al. 1994. A compiler framework for restructuring data declarations to enhance cache and TLB effiectiveness. In Proceedings of the IBM Centre for Advanced Studies Conference '94.
[4]
Bailey, D. 1992. Unfavorable strides in cache memory systems. Tech. Rep. RNR-92-015, NASA Ames Research Center, CA.
[5]
Banerjee, U. 1993. Loop transformations for Restructuring Compilers. Kluwer Academic Publishers, Norwell, MA.
[6]
Carr, S. and Kennedy, K. 1992. Compiler blockability of numerical algorithms. In Proceedings of the Supercomputing '92 Conference.
[7]
Carr, S. and Lehoucq, R. B. 1995. A compiler-blockable algorithm for QR decomposition. In Proceedings of the 8th SIAM Conference on Paral lel Processing for Scientic Computing.
[8]
Clauss, P. 1996. Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: Applications to analyze and transform scientic programs. In Proceedings of the 1996 International Conference on Supercomputing.
[9]
Coleman, S. and McKinley, K. S. 1995. Tile size selection using cache organization and data layout. In Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation.
[10]
Eisenbeis, C., Jalby, W., Windheiser, D., and Bodin, F. 1990. A strategy for array management in local memory. In Proceedings of the 3rd Workshop on Programming Languages and Compilers for Parallel Computing.
[11]
Ferrante, J., Sarkar, V., and Thrash, W. 1991. On estimating and enhancing cache effectiveness (extended abstract). In Proceedings of the 4th International Workshop on Languages and Compilers for Parallel Computing.
[12]
Gallivan, K., Jalby, W., and Gannon, D. 1988. On the problem of optimizing data transfers for complex memory systems. In Proceedings of the 1988 International Conference on Supercomputing.
[13]
Gannon, D., Jalby, W., and k. Gallivan. 1988. Strategies for cache and local memory management by global program transformation. J. Parall. Distrib. Comput. 5, 5 (Oct.), 587-616.
[14]
Hennessy, J. L. and Patterson, D. A. 1996. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, San Mateo, CA.
[15]
Hill, M. D. 1987. Aspects of cache memory and instruction buffer performance. Ph.D. thesis, Computer Science Dept., University of California, Berkeley.
[16]
Hill, M. D. and Smith, A. J. 1989. Evaluating associativity in CPU caches. IEEE Trans. Comput. 38, 12 (Dec.), 1612-1630.
[17]
Irigoin, F. and Triolet, R. 1988. Supernode partitioning. In Proceedings of the 15th Annual ACM Symposium on the Principles of Programming Languages.
[18]
Kodukula, I., Ahmed, N., and Pingali, K. 1997. Data-centric multi-level blocking. In Proceedings of the ACM SIGPLAN'97 Conference on Programming Language Design and Implementation. 346-357.
[19]
Lam, M., Rothberg, E. E., andWolf, M. E.1991. The cache performance of blocked algorithms. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems.
[20]
Lebeck, A. R.andWood,D. A.1994. Cache proling and the SPEC benchmarks: A case study. IEEE Computer, 15-26.
[21]
Li, W. and Pingali, K. 1992. Access normalization: Loop restructuring for NUMA compilers. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems.
[22]
Martonosi, M., Gupta, A., and Anderson, T. 1992. MemSpy: Analyzing memory system bottlenecks in programs. In Proceedings of the ACM SIGMETRICS 1992 Conference on Measurement and Modeling of Computer Systems. 1-12.
[23]
McKinley, K. S., Carr, S., and Tseng, C. W. 1996. Improving data locality with loop transformations. ACM Trans. Program. Lang. Syst. 18, 4 (July), 424-453.
[24]
McKinley, K. S. and Temam, O. 1996. A quantitative analysis of loop nest locality. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems.
[25]
Mowry, T. C., Lam, M. S., and Gupta, A. 1992. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems.
[26]
Navarro, J. J., Juan, T., and Lang, T. 1994. Mob forms: A class of multilevel block algorithms for dense linear algebra operations. In Proceedings of the 1994 International Conference on Supercomputing.
[27]
Porterfield, A. K. 1989. Software methods for improvement of cache performance on supercomputer applications. Ph.D. thesis, Rice University.
[28]
Pugh, W. 1992. The Omega test: A fast and practical integer programming algorithm for dependence analysis. Commun. ACM 35, 8 (Aug.), 102-114.
[29]
Pugh, W. 1994. Counting solutions to Presburger formulas: How and Why. In Proceedings of the ACM SIGPLAN'94 Conference on Programming Language Design and Implementation. 121-134.
[30]
Rivera, G. and Tseng, C. W. 1998. Data transformations for eliminating con ict misses. In Proceedings of the ACM SIGPLAN'98 Conference on Programming Language Design and Implementation.
[31]
Sugumar, R. A. and Abraham, S. G. 1993. Efficient simulation of caches under optimal replacement with applications to miss characterization. In Proceedings of the ACM SIGMETRICS 1993 Conference on Measurement & Modeling of Computer Systems.
[32]
Temam, O., Fricker, C., and Jalby, W. 1994. Cache interference phenomena. In Proceedings of the ACM SIGMETRICS 1994 Conference on Measurement & Modeling of Computer Systems.
[33]
Temam, O., Granston, E., and Jalby, W. 1993. To copy or not to copy: A compile-time technique for assessing when data copying should be used to eliminate cache con icts. In Proceedings of the Supercomputing'93 Conference.
[34]
Torrellas, J., Lam, M. S., and Hennessey, J. L. 1990. Shared data placement optimizations to reduce multiprocessor cache miss rates. In Proceedings of the 1990 International Conference on Parallel Processing.
[35]
Wilson, R. P.et al.1994. SUIF: An infrastructure for research on parallelizing and optimizing compilers. ACM SIGPLAN Notices 29, 12 (Dec.).
[36]
Wolf, M. E. and Lam, M. S. 1991. A data locality optimization algorithm. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation.
[37]
Wolfe, M. J. 1989. More iteration space tiling. In Proceedings of the Supercomputing '89 Conference.

Cited By

View all
  • (2024)Parallel Loop Locality Analysis for Symbolic Thread CountsProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676948(219-232)Online publication date: 14-Oct-2024
  • (2023)Leveraging LLVM's ScalarEvolution for Symbolic Data Cache Analysis2023 IEEE Real-Time Systems Symposium (RTSS)10.1109/RTSS59052.2023.00029(237-250)Online publication date: 5-Dec-2023
  • (2022) BullsEye : Scalable and Accurate Approximation Framework for Cache Miss CalculationACM Transactions on Architecture and Code Optimization10.1145/355800320:1(1-28)Online publication date: 17-Nov-2022
  • Show More Cited By

Recommendations

Reviews

Max Hailperin

Ghosh et al. present a general technique for modeling the cache misses that occur when executing simple loop nests, such as occur in many numerical programs, that generate regular memory reference sequences. The array subscripting and loop indexing are combined with the layout of arrays in memory and with the cache parameters (size and associativity) into a single system of linear Diophantine inequalities. The number of solutions to the system corresponds to the number of cache misses, and, as the authors repeatedly emphasize, this number can typically be calculated efficiently. Minimizing this calculated number of misses can guide the choice of parameters such as padding and tile size. In principle, the restrictive assumptions on the program are substantial, but the authors show that, in practice, 70 percent of loops in the SPECfp benchmark suite can be analyzed, at least once the problem of variable loop bounds is overcome. They describe some promising directions for addressing the loop bounds problem, though it is clear that more work remains here. Although this paper is not light reading, the authors have taken great care to provide concrete illustrations and to introduce concepts in simplified settings, in order to prepare the reader for the full generality that follows. Thanks to these techniques, the paper should be accessible to the many compiler researchers and advanced compiler developers who will want to read it.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Programming Languages and Systems
ACM Transactions on Programming Languages and Systems  Volume 21, Issue 4
July 1999
192 pages
ISSN:0164-0925
EISSN:1558-4593
DOI:10.1145/325478
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 1999
Published in TOPLAS Volume 21, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cache memories
  2. compilation
  3. optimization
  4. program transformation

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)149
  • Downloads (Last 6 weeks)20
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Parallel Loop Locality Analysis for Symbolic Thread CountsProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676948(219-232)Online publication date: 14-Oct-2024
  • (2023)Leveraging LLVM's ScalarEvolution for Symbolic Data Cache Analysis2023 IEEE Real-Time Systems Symposium (RTSS)10.1109/RTSS59052.2023.00029(237-250)Online publication date: 5-Dec-2023
  • (2022) BullsEye : Scalable and Accurate Approximation Framework for Cache Miss CalculationACM Transactions on Architecture and Code Optimization10.1145/355800320:1(1-28)Online publication date: 17-Nov-2022
  • (2022)Warping cache simulation of polyhedral programsProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523714(316-331)Online publication date: 9-Jun-2022
  • (2022)CARL: Compiler Assigned Reference LeasingACM Transactions on Architecture and Code Optimization10.1145/349873019:1(1-28)Online publication date: 17-Mar-2022
  • (2021)IOOpt: automatic derivation of I/O complexity bounds for affine programsProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454103(1187-1202)Online publication date: 19-Jun-2021
  • (2021)Compiler support for near data computingProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441600(90-104)Online publication date: 17-Feb-2021
  • (2021)Intelligent Resource Provisioning for Scientific Workflows and HPC2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS)10.1109/WORKS54523.2021.00007(9-16)Online publication date: Nov-2021
  • (2021)A generic framework to integrate data caches in the WCET analysis of real-time systemsJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2021.102304120:COnline publication date: 1-Nov-2021
  • (2020)Scope-Aware Useful Cache Block Calculation for Cache-Related Pre-Emption Delay Analysis With Set-Associative Data CachesIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2019.293780739:10(2333-2346)Online publication date: Oct-2020
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media