article

Free access

Compiler optimizations for improving data locality

Authors:

Kathryn S. McKinley,

Chau-Wen TsengAuthors Info & Claims

ACM SIGOPS Operating Systems Review, Volume 28, Issue 5

Pages 252 - 262

https://doi.org/10.1145/381792.195557

Published: 01 November 1994 Publication History

Abstract

In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In this paper, we present compiler optimizations to improve data locality based on a simple yet accurate cost model. The model computes both temporal and spatial reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations consisting of loop permutation, loop fusion, loop distribution, and loop reversal. We demonstrate that these program transformations are useful for optimizing many programs.

To validate our optimization strategy, we implemented our algorithms and ran experiments on a large collection of scientific programs and kernels. Experiments with kernels illustrate that our model and algorithm can select and achieve the best performance. For over thirty complete applications, we executed the original and transformed versions and simulated cache hit rates. We collected statistics about the inherent characteristics of these programs and our ability to improve their data locality. To our knowledge, these studies are the first of such breadth and depth. We found performance improvements were difficult to achieve because benchmark programs typically have high hit rates even for small data caches; however, our optimizations significantly improved several programs.

References

[1]

W. Abu-Sufah. Improving the Performance of Virtual Memory Computers. PhD thesis, Dept. of Computer Science, University of Illinois at Urbana-Champaign, 1979.

Digital Library

[2]

S. Cart. Memory-HierarchyManagement. PhD thesis, Dept. of Computer Science, Rice University, September 1992.

Digital Library

[3]

D. Callahan, J. Cocke, and K. Kennedy. Estimating interlock and improving balance for pipelined machines. Journal of Parallel and Distributed Computing, 5(4):334-358, August 1988.

Digital Library

[4]

D. Callahan, S. Cart, and K. Kennedy. Improving register allocation for subscripted variables. In Proceedings of the SIG- PLAN '90 Conference on Program Language Design and Implementation, White Plains, NY, June 1990.

Digital Library

[5]

K. Cooper, M. W. Hall, R. T. Hood, K, Kennedy, K. S. McKinley, J. M. Mellor-Crummey, L. Torczon, and S. K. Warren. The ParaScope parallel programming environment. Proceedings of the IEEE, 81(2):244-263,February I993.

[6]

K. Cooper, M. W. Hall, and K. Kennedy. A methodology for procedure cloning. Computer Languages, 19(2):105-I 17, February 1993.

[7]

S. Carr and K. Kennedy. Scalar replacement in the presence of conditional control flow. Software---Practice and Experience, 24(1):51-77, January 1994.

Digital Library

[8]

S. Carr, K. S. MCKinley, and C.-W. Tseng. Compiler optimizations for improving data locality. Technical Report TR94-234, Dept. of Computer Science, Rice University, July 1994.

Digital Library

[9]

J. Ferrante, V. Sarkar, and W. Thrash. On estimating and enhancing cache effectiveness. In U. Banerjee, D. Gelemter, A. Nicolau, and D. Padua, editors, Languages and Compilers for Parallel Computing, Fourth International Workshop, Santa Clara, CA, August 1991. Springer~Vedag.

Digital Library

[10]

D. Gannon, W. Jalby, and K. Galhvan. Strategies for cache and local memory management by global program transformation. Journal of Parallel and Distributed Computing, 5(5):587-616, October 1988.

Digital Library

[11]

G. Goff, K. Kennedy, and C.-W. Tseng, Practical dependence testing. In Proceedings of the SIGPLAN ' 91 Conference on Program Language Design and Implementation, Toronto, Canada, June 1991.

Digital Library

[12]

M.W. Hall, K. Kennedy, and K. S. MCKinley. Interprocedural transformations for parallel code generation. In Proceedings of Supercomputing' 91, Albuquerque, NM, November 1991.

Digital Library

[13]

E Idgoin and R. Triolet. Supemode partitioning. In Proceedings of the Fifteenth Annual ACM Symposium on the Principles of Programming Languages, San Diego, CA, January 1988.

Digital Library

[14]

D. Kuck, R. Kuhn, D. Padua, B. Leasure, and M. J. Wolfe. Dependence graphs and compiler optimizations. In Conference Record of the Eighth Annual ACM Symposiumon the Principles of Prograrnming Languages, Williamsburg, VA, January 1981.

Digital Library

[15]

K. Kennedy and K. S. MCKinley. Optimizing for parallelism and data locality. In Proceedings of the 1992 ACM International Conference on Supercomputing, Washington, DC, July 1992.

Digital Library

[16]

K. Kennedy and K. S. MCKinley. Maximizmg loop parallelism and improving data locality via loop fusion and distribution. In Proceedings of the Sixth Workshop on Languages and Compilers for Parallel Computing, Portland, OR, August 1993.

Digital Library

[17]

K. Kennedy, K. S. MCKinley, and C.-W. Tseng. Analysis and transformation in an interactive parallel programming tool. Concurrency: Practice & Experience, 5(7):575--602, October i993.

[18]

W. Li and K, Pingali. Access normalization: Loop restructuring for NUMA compilers. In Proceedings ofthe Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, Boston, MA, October 1992.

Digital Library

[19]

M. Lain, E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings ofthe Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, Santa Clara, CA, April 1991.

Digital Library

[20]

K.S. McKinley. Automatic and Interactive Parallelization. PhD thesis, Dept. of Computer Science, Rice University, April 1992.

Digital Library

[21]

J. Warren. A hierachical basis for reordering transformations. In Conference Record of the Eleventh Annual ACM Symposium on the Principles of Programming Languages, Salt Lake City, UT, january 1984.

Digital Library

[22]

M.E. Wolf and M. Lain. A data locality optimizing algorithm. In Proceedings of the SIGPLAN '91 Conference on Program Language Design and implementation, Toronto, Canada, June 1991.

Digital Library

[23]

M.J. Wolfe. Iteration space tiling for memory hierarchies, December 1987. Extended version of a paper which appeared in Proceedings of the Third SIAM Conference on Parallel Processing.

Digital Library

[24]

M.J. Wolfe. The Troy loop restructuring research tool. in Proceedings of the 1991 International Conference on Parallel Processing, St. Charles, IL, August 1991.

[25]

M.E. Wolf. Improving Locality and Parallelism in Nested Loops. PhD thesis, Dept. of Computer Science, Stanford University, August 1992.

Digital Library

Cited By

Aggarwal KBondhugula U(2020)Optimizing the Linear Fascicle Evaluation Algorithm for Multi-core and Many-core SystemsACM Transactions on Parallel Computing10.1145/34180757:4(1-45)Online publication date: 25-Nov-2020
https://dl.acm.org/doi/10.1145/3418075
Jeyapaul RShrivastava ADeprettere EStefanov T(2010)B2P2Proceedings of the 13th International Workshop on Software & Compilers for Embedded Systems10.1145/1811212.1811215(1-10)Online publication date: 28-Jun-2010
https://dl.acm.org/doi/10.1145/1811212.1811215
Hu Zdel Cuvillo JZhu WGao G(2006)Optimization of dense matrix multiplication on IBM cyclops-64Proceedings of the 12th international conference on Parallel Processing10.1007/11823285_14(134-144)Online publication date: 28-Aug-2006
https://dl.acm.org/doi/10.1007/11823285_14
Show More Cited By

Index Terms

Compiler optimizations for improving data locality

Recommendations

Improving data locality with loop transformations

In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In the this article, we present ...
Compiler Optimizations for Cache Locality and Coherence
Compiler optimizations for improving data locality
ASPLOS VI: Proceedings of the sixth international conference on Architectural support for programming languages and operating systems

In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In this paper, we present ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGOPS Operating Systems Review

ACM SIGOPS Operating Systems Review Volume 28, Issue 5

Dec. 1994

323 pages

ISSN:0163-5980

DOI:10.1145/381792

Chairman:
Henry M. Levy
Univ. of Washington, Seattle

Issue’s Table of Contents

ASPLOS VI: Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
November 1994
341 pages
ISBN:0897916603
DOI:10.1145/195473
Chairmen:
Forest Baskett
Silicon Graphics
,
Douglas Clark
Princeton Univ.

Copyright © 1994 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 1994

Published in SIGOPS Volume 28, Issue 5

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

246
Total Citations
View Citations
1,634
Total Downloads

Downloads (Last 12 months)165
Downloads (Last 6 weeks)15

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Aggarwal KBondhugula U(2020)Optimizing the Linear Fascicle Evaluation Algorithm for Multi-core and Many-core SystemsACM Transactions on Parallel Computing10.1145/34180757:4(1-45)Online publication date: 25-Nov-2020
https://dl.acm.org/doi/10.1145/3418075
Jeyapaul RShrivastava ADeprettere EStefanov T(2010)B2P2Proceedings of the 13th International Workshop on Software & Compilers for Embedded Systems10.1145/1811212.1811215(1-10)Online publication date: 28-Jun-2010
https://dl.acm.org/doi/10.1145/1811212.1811215
Hu Zdel Cuvillo JZhu WGao G(2006)Optimization of dense matrix multiplication on IBM cyclops-64Proceedings of the 12th international conference on Parallel Processing10.1007/11823285_14(134-144)Online publication date: 28-Aug-2006
https://dl.acm.org/doi/10.1007/11823285_14
Jiménez MLlabería JFernández AMorancho E(2005)A unified transformation technique for multilevel blockingEuro-Par'96 Parallel Processing10.1007/3-540-61626-8_53(402-405)Online publication date: 8-Jun-2005
https://doi.org/10.1007/3-540-61626-8_53
Mustafa DAlkhasawneh RObeidat FShatnawi A(2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3372990
Mururu GNi KGavrilovska APande SEgger BLee D(2023)PinIt: Influencing OS Scheduling via Compiler-Induced AffinitiesProceedings of the 24th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3589610.3596279(87-98)Online publication date: 13-Jun-2023
https://dl.acm.org/doi/10.1145/3589610.3596279
Makor LKloibhofer SLeopoldseder DBonetta DStadler LMössenböck H(2022)Automatic Array Transformation to Columnar Storage at Run TimeProceedings of the 19th International Conference on Managed Programming Languages and Runtimes10.1145/3546918.3546919(16-28)Online publication date: 14-Sep-2022
https://dl.acm.org/doi/10.1145/3546918.3546919
Kandemir MTang XKotra JKarakoy MMitra TYoung EXiong J(2022)Fine-Granular Computation and Data Layout Reorganization for Improving LocalityProceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design10.1145/3508352.3549386(1-9)Online publication date: 30-Oct-2022
https://dl.acm.org/doi/10.1145/3508352.3549386
Ryoo JKandemir MKarakoy M(2022)Memory Space RecyclingProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35080346:1(1-24)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3508034
Tang XKandemir MKarakoy M(2021)Mix and Match: Reorganizing Tasks for Enhancing Data LocalityProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/34600875:2(1-24)Online publication date: 4-Jun-2021
https://dl.acm.org/doi/10.1145/3460087
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents