Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

New tiling techniques to improve cache temporal locality

Published: 01 May 1999 Publication History

Abstract

Tiling is a well-known loop transformation to improve temporal locality of nested loops. Current compiler algorithms for tiling are limited to loops which are perfectly nested or can be transformed, in trivial ways, into a perfect nest. This paper presents a number of program transformations to enable tiling for a class of nontrivial imperfectly-nested loops such that cache locality is improved. We define a program model for such loops and develop compiler algorithms for their tiling. We propose to adopt odd-even variable duplication to break anti- and output dependences without unduly increasing the working-set size, and to adopt speculative execution to enable tiling of loops which may terminate prematurely due to, e.g. convergence tests in iterative algorithms. We have implemented these techniques in a research compiler, Panorama. Initial experiments with several benchmark programs are performed on SGI workstations based on MIPS R5K and R10K processors. Overall, the transformed programs run faster by 9% to 164%.

References

[1]
J. M. Anderson, S. P. Amarasinghe and M. S. Lam. Data and computation transformations for multiprocessors. In Fifth A CM SIGPLAN Symposium on Principles and Practice of Parallel Programming, July 19-21, 1995.
[2]
David F. Bacon, Susan L. Graham and Oliver J. Sharp. Compiler transformations for high-performance computing. In ACM Computing Surveys, Vol. 26, No. 4, Dec. 1994.
[3]
W. Blume and R. Eigenmann. Symbolic range propagation. Proceedings of the 9th International Parallel Processing Symposium, April 1995.
[4]
Jean-Francios Collard. Space-time transformation of while-loops using speculative execution. In Proc. of the Scalable High-Performance Computing Conf., Knoxville, TN, pp. 429-436, May 1994.
[5]
J. Gu, Z. Li, and G. Lee. Experience with efficient array data flow analysis for array privatization. In Sixth A CM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM Press, June 1997.
[6]
Stephanie Coleman and Kathryn S. McKinley. Tile size selection using cache organization and data layout. In Proc. of the A CM SIGPLAN conference on Programming Language Design and Implementation, June 1995.
[7]
M. R. Haghighat. Symbolic Dependence Analysis for High Performance Parallelizing Compilers. Ph.D. thesis, CSRD Rpt No. 995, University of Illinois, May 1990.
[8]
M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee. A matrix-based approach to the global locality optimization problem. In Proc. International Convergence on Parallel Architectures and Compilation Techniques (PACT'98), October 14-17,1998, Paris, France.
[9]
Induprakas Kodukula, Nawaaz Ahmed and Keshav Pingali. Data-centric multi-level blocking. In A CM SIGPLAN Conference on Programming Language Design and Implementation, Jun 1997.
[10]
Induprakas Kodukula, Keshav Pingali. Transformations of imperfectly nested loops. In Proc. Supercomputing, November 1996.
[11]
D. J. Kuck. The Structure of Computers and Computations, Volume 1. John Wiley & Sons, 1978.
[12]
Monica S. Lain, Edward E. Rothberg and Michael E. Wolf. The cache performance and optimizations of blocked algorithms. In Proc. of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63-74, Santa Clara, California, April 8-11, 1991.
[13]
Naraig Manjikian and Tarek S. Abdelrahman. Fusion of loops for parallelism and locality. In IEEB Transactions on Parallel and Distributed Systems,Vol. 8, No. 2, Feb 1997.
[14]
Karhryn S. McKinley, Steve Carr and Chau-Wen Tseng. Improving data locality with loop transformations, in A CM Transactions on Programming Languages and Systems, Vol. 18, No. 4, pp. 424-453, July 1996.
[15]
John McCalpin and David Wonnacott. Time Skewing: A Value-Based Approach to Optimizing for Memory Locality. In http://www, haverford, edu/cmsc/davew/cacheopt/cache- opt. html.
[16]
William Pugh. A Practical Algorithm for Exact Array Dependence Analysis. In Communications of the A CM, August, 1992.
[17]
W. Pugh, E. Rosser and T. Shpeisman. Exploiting Monotone Convergence Functions in Parallel Programs. Technical Report CS-TR-3636.1, University of Maryland, October 1996.
[18]
Gabriel Rivers and Chau-Wen Tseng. Eliminating Conflict Misses for High Performance Architectures. in Proc. of the 1998 ACM International Conference on Supercomputing, Melbourne, Australia, July 1998.
[19]
B. R. Rau and j. A. Fisher. Instruction-level parallel processing: History, overview and perspective. The Journal of Supercomputing, 7:9-50, 1993.
[20]
Standard Performance Evaluation Corporation, SPEC Newsletter, Vols. 1-9, 1989-1997.
[21]
Michelle Strout, Larry Carter, Jeanne Ferrante and Beth Simon. Schedule-independent storage mapping for loops. In Prof. of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, October 1998.
[22]
Michael E. Wolf, Dror E. Maydan and Ding-Kai Chen. Combining loop transformations considering caches and scheduling. In MICRO 29, pages 274-286, Mountain View, CA, 1996.
[23]
Michael E. Wolf. Improving Locality and Parallelism in Nested Loops. Ph.D. thesis, Stanford University, Aug. 1992.
[24]
Michael E. Wolf and Monica S. Lam. A data locality optimizing algorithm. In Proc. of ACM SIGPLAN conference on Programming Language Design and Implementation, June 1991.
[25]
Michael Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley Publishing Company, 1995.

Cited By

View all
  • (2024)LoRAStencil: Low-Rank Adaptation of Stencil Computation on Tensor CoresProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00059(1-17)Online publication date: 17-Nov-2024
  • (2024)A massive MPI parallel framework of smoothed particle hydrodynamics with optimized memory management for extreme mechanics problemsComputer Physics Communications10.1016/j.cpc.2023.108970295(108970)Online publication date: Feb-2024
  • (2021)Applying the Swept Rule for Solving Two-Dimensional Partial Differential Equations on Heterogeneous ArchitecturesMathematical and Computational Applications10.3390/mca2603005226:3(52)Online publication date: 17-Jul-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 34, Issue 5
May 1999
304 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/301631
Issue’s Table of Contents
  • cover image ACM Conferences
    PLDI '99: Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
    May 1999
    304 pages
    ISBN:1581130945
    DOI:10.1145/301618
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 1999
Published in SIGPLAN Volume 34, Issue 5

Check for updates

Author Tags

  1. caches
  2. loop transformations
  3. optimizing compilers

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)142
  • Downloads (Last 6 weeks)24
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)LoRAStencil: Low-Rank Adaptation of Stencil Computation on Tensor CoresProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00059(1-17)Online publication date: 17-Nov-2024
  • (2024)A massive MPI parallel framework of smoothed particle hydrodynamics with optimized memory management for extreme mechanics problemsComputer Physics Communications10.1016/j.cpc.2023.108970295(108970)Online publication date: Feb-2024
  • (2021)Applying the Swept Rule for Solving Two-Dimensional Partial Differential Equations on Heterogeneous ArchitecturesMathematical and Computational Applications10.3390/mca2603005226:3(52)Online publication date: 17-Jul-2021
  • (2021)Distance-in-time versus distance-in-spaceProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454069(665-680)Online publication date: 19-Jun-2021
  • (2020)Multiplicative Schwartz-Type Block Multi-Color Gauss-Seidel Smoother for Algebraic Multigrid MethodsProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3368474.3368481(217-226)Online publication date: 15-Jan-2020
  • (2020)Accelerating High-Order Stencils on GPUs2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)10.1109/PMBS51919.2020.00014(86-108)Online publication date: Nov-2020
  • (2019)Runge-Kutta Discontinuous Galerkin Method and DiamondTorre GPGPU Algorithm for Effective Simulation of Large 3D Multiphase Fluid Flows with Shocks2019 International Multi-Conference on Engineering, Computer and Information Sciences (SIBIRCON)10.1109/SIBIRCON48586.2019.8958102(0817-0822)Online publication date: Oct-2019
  • (2018)Roofline Guided Design and Analysis of a Multi-stencil CFD Solver for Multicore Performance2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2018.00085(753-762)Online publication date: May-2018
  • (2018)A performance study of the time-varying cache behaviorThe Journal of Supercomputing10.1007/s11227-017-2144-174:2(665-695)Online publication date: 1-Feb-2018
  • (2017)Unsteady Navier-Stokes Computations on GPU Architectures23rd AIAA Computational Fluid Dynamics Conference10.2514/6.2017-4508Online publication date: 2-Jun-2017
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media