article

Improving register allocation for subscripted variables

Authors:

David Callahan,

Ken KennedyAuthors Info & Claims

ACM SIGPLAN Notices, Volume 39, Issue 4

Pages 328 - 342

https://doi.org/10.1145/989393.989428

Published: 01 April 2004 Publication History

Abstract

Most conventional compilers fail to allocate array elements to registers because standard data-flow analysis treats arrays like scalars, making it impossible to analyze the definitions and uses of individual array elements. This deficiency is particularly troublesome for floating-point registers, which are most often used as temporary repositories for subscripted variables.In this paper, we present a source-to-source transformation, called scalar replacement, that finds opportunities for reuse of subscripted variables and replaces the references involved by references to temporary scalar variables. The objective is to increase the likelihood that these elements will be assigned to registers by the coloring-based register allocators found in most compilers. In addition, we present transformations to improve the overall effectiveness of scalar replacement and show how these transformations can be applied in a variety of loop nest types. Finally, we present experimental results showing that these techniques are extremely effective---capable of achieving integer factor speedups over code generated by good optimizing compilers of conventional design.

References

[1]

W. Abu-Sufah. Improving the Performance of Virtual Memory Computers. PhD thesis, University of Illinois, 1978.

Digital Library

[2]

F. E. Allen and J. Cocke. A catalogue of optimizing transformations. In Design and Optimization of Compilers, pages 1--30. Prentice-Hall, 1972.

[3]

J. Allen, D. Callahan, and K. Kennedy. Automatic decomposition of scientific programs for parallel execution. In Conference Record of the Fourteenth ACM Symposium on the Principles of Programming Languages, Munich, West Germany, Jan. 1987.

Digital Library

[4]

J. Allen and K. Kennedy. Automatic translation of Fortran programs to vector form. ACM Transactions on Programming Languages and Systems, 9(4):491--542, Oct. 1987.

Digital Library

[5]

J. Allen and K. Kennedy. Vector register allocation. IEEE Transactions on Computers, 41(10):1290 -- 1317, Oct. 1992.

Digital Library

[6]

P. Briggs, K. D. Cooper, K. Kennedy, and L. Torczon. Coloring heuristics for register allocation. In Proceedings of the ACM SIGPLAN '89 Conference on Programming Language Design and Implementation, pages 275--284, Portland, OR, July 1989.

Digital Library

[7]

D. Callahan, J. Cocke, and K. Kennedy. Estimating interlock and improving balance for pipelined machines. Journal of Parallel and Distributed Computing, 5:334--358, 1988.

Digital Library

[8]

S. Carr. Combining optimization for cache and instruction-level parallelism. In Proceedings of the 1996 Conference on Parallel Architectures and Compiler Techniques, pages 238--247, Boston, MA, Oct. 1996.

Digital Library

[9]

S. Carr and Y. Guan. Unroll-and-jam using uniformly generated sets. In Proceedings of the 30th International Symposium on Microarchitecture (MICRO-30), Research Triangle Park, NC, Dec. 1997.

Digital Library

[10]

S. Carr and K. Kennedy. Improving the ratio of memory operations to floating-point operations in loops. ACM Transactions on Programming Languages and Systems, 16(6):1768--1810, 1994.

Digital Library

[11]

S. Carr and K. Kennedy. Scalar replacement in the presence of conditional control flow. Software - Practice & Experience, 24(1):51--77, Jan. 1994.

Digital Library

[12]

G. Chaitin, M. Auslander, A. Chandra, J. Cocke, M. Hopkins, and P. Markstein. Register allocation via coloring. Computer Languages, 6:45--57, Jan. 1981.

[13]

G. J. Chaitin. Register allocation and spilling via graph coloring. In Proceedings of the ACM SIGPLAN '82 Symposium on Compiler Construction, pages 98--105, Boston, MA, June 1982.

Digital Library

[14]

F. C. Chow and J. L. Hennessy. Register allocation by priority-based coloring. In Proceedings of the ACM SIGPLAN '84 Symposium on Compiler Construction, pages 222--232, Montreal, Quebec, June 1984.

Digital Library

[15]

K.-H. Drechsler and M. P. Stadel. A solution to a problem with Morel and Renvoise's "Global optimization by suppression of partial redundancies". ACM Transactions on Programming Languages and Systems, 10(4):635--640, Oct. 1988.

Digital Library

[16]

E. Duesterwald, R. Gupta, and M. L. Soffa. A practical data flow framework for array reference analysis and its use in optimizations. In Proceedings of the ACM SIGPLAN '93 Conference on Programming Language Design and Implementation, pages 68--77, Albuquerque, NM, June 1993.

Digital Library

[17]

D. Kuck, R. Kuhn, B. Leasure, and M. Wolfe. The structure of an advanced retargetable vectorizer. In Supercomputers: Design and Applications, pages 163--178. IEEE Computer Society Press, Silver Spring, MD., 1984.

[18]

D. Kuck, R. Kuhn, D. Padua, B. Leasure, and M. Wolfe. Dependence graphs and compiler optimizations. In Conference Record of the Eight ACM Symposium on the Principles of Programming Languages, 1981.

Digital Library

[19]

D. Kuck, Y. Muraoka, and S. Chen. On the number of operations simultaneously executable in fortran-like programs and their resulting speedup. IEEE Transactions on Computers, C-21(12):1293--1310, Dec. 1972.

Digital Library

[20]

L. Lamport. The parallel execution of DO-loops. Communications of the ACM, 17(2):83--93, 1974.

Digital Library

[21]

K. S. McKinley, S. Carr, and C.-W. Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, 18(4):424--453, 1996.

Digital Library

[22]

D. A. Padua and M. J. Wolfe. Advanced compiler optimizations for supercomputers. Communications of the ACM, 29(12):1184--1201, Dec. 1986.

Digital Library

[23]

Y. Qian, S. Carr, and P. Sweany. Optimizing loop performance for clustered vliw architectures. In Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques, pages 271--280, Charlottesville, VA, Sept. 2002.

Digital Library

[24]

V. Sarkar. Optimized unrolling of nested loops. In Proceedings of the 2000 International Conference on Supercomputing, pages 153--166, Sante Fe, NM, May 2000.

Digital Library

[25]

M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN '91 Conference on Programming Language Design and Implementation, pages 30--44, Toronto, Ontario, June 1991.

Digital Library

[26]

M. E. Wolf, D. E. Maydan, and D.-K. Chen. Combining loop transformations considering caches and scheduling. In Twenty-Ninth Annual Symposium on Micorarchitecture (MICRO-29), Dec. 1996.

Digital Library

[27]

M. Wolfe. Advanced loop interchange. In Proceedings of the 1986 International Conference on Parallel Processing, Aug. 1986.

[28]

M. Wolfe. Loop skewing: The wavefront method revisited. Journal of Parallel Programming, 15(4):279--293, Aug. 1986.

Digital Library

[29]

{AC72} F. E. Allen and J. Cocke. A catalogue of optimizing transformations. In Design and Optimization of Compilers, pages 1--30. Prentice-Hall, 1972.

[30]

{AK84a} J. R. Allen and K. Kennedy. Automatic loop interchange. In Proceedings of the SIGPLAN '84 Symposium on Compiler Construction, SIGPLAN Notices Vol. 19, No. 6, June 1984.

Digital Library

[31]

{AK84b} J. R. Allen and K. Kennedy. PFC: A program to convert fortran to parallel form. In Supercomputers: Design and Applications, pages 186--205. IEEE Computer Society Press, Silver Spring, MD., 1984.

[32]

{AK88} J. R. Allen and K. Kennedy. Vector register allocation. Technical report, Department of Computer Science, Rice University, 1988.

[33]

{AN87} A. Aiken and A. Nicolau. Loop quantization: An analysis and algorithm. Technical Report 87-821, Cornell University, March 1987.

Digital Library

[34]

{CAC+81} G. J. Chaitin, M. A. Auslander, A. K. Chandra, J. Cocke, M. E. Hopkins, and P. W. Markstein. Register allocation via coloring. Computer Languages, 6:45--57, January 1981.

[35]

{CCK87} D. Callahan, J. Cocke, and K. Kennedy. Estimating interlock and improving balance for pipelined machines. In Proceedings of the 1987 International Conference on Parallel Processing, August 1987.

[36]

{CK89} S. Carr and K. Kennedy. Blocking linear algebra codes for memory hierarchies. In Proceedings of the Fourth SIAM Conference on Parallel Processing for Scientific Computing, Chicago, IL, December 1989.

Digital Library

[37]

{DBMS79} J. J. Dongarra, J. R. Bunch, C. B. Moler, and G. W. Stewart. LINPACK User's Guide. SIAM Publications, Philadelphia, 1979.

[38]

{GJG87} D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformations. In Proceedings of the First International Conference on Supercomputing. Springer-Verlag, Athens, Greece, 1987.

Digital Library

[39]

{IT88} F. Irigoin and R. Triolet. Supernode partitiong. In Conference Record of the Fifteenth ACM Symposium on the Principles of Programming Languages, pages 319--328, January 1988.

Digital Library

[40]

{Kuc78} D. Kuck. The Structure of Computers and Computations Volume 1. John Wiley and Sons, New York, 1978.

Digital Library

[41]

{Por89} A. K. Porterfield. Software Methods for Improvement of Cache Performance on Supercomputer Applications. PhD thesis, Rice University, May 1989.

Digital Library

[42]

{Wol82} M. Wolfe. Optimizing Supercompilers for Supercomputers. PhD thesis, University of Illinois, October 1982.

Digital Library

[43]

{Wol86} M. Wolfe. Advanced loop interchange. In Proceedings of the 1986 International Conference on Parallel Processing, August 1986.

[44]

{Wol87} M. Wolfe. Iteration space tiling for memory hierarchies, December 1987. Extended version of a paper which appeared in Proceedings of the Third SIAM Conference on Parallel Processing.

Digital Library

[45]

{Wol89} M. Wolfe. More iteration space tiling. In Proceedings of the Supercomputing '89 Conference, 1989.

Digital Library

Cited By

Cherian AZhou KGrubisic DMeng XMellor-Crummey J(2021)Measurement and Analysis of GPU-Accelerated OpenCL Computations on Intel GPUs2021 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)10.1109/ProTools54808.2021.00009(26-35)Online publication date: Nov-2021
https://doi.org/10.1109/ProTools54808.2021.00009
Rostrup SDe Sterck H(2010)Parallel hyperbolic PDE simulation on clusters: Cell versus GPUComputer Physics Communications10.1016/j.cpc.2010.07.049181:12(2164-2179)Online publication date: Dec-2010
https://doi.org/10.1016/j.cpc.2010.07.049
Philippidis CShang W(2010)On minimizing register usage of linearly scheduled algorithms with uniform dependenciesComputer Languages, Systems and Structures10.1016/j.cl.2009.12.00136:3(250-267)Online publication date: 1-Oct-2010
https://dl.acm.org/doi/10.1016/j.cl.2009.12.001
Show More Cited By

Improving register allocation for subscripted variables
1. Software and its engineering
  1. Software notations and tools

Recommendations

Improving register allocation for subscripted variables

Most conventional compilers fail to allocate array elements to registers because standard data-flow analysis treats arrays like scalars, making it impossible to analyze the definitions and uses of individual array elements. This deficiency is ...
Improving register allocation for subscripted variables
PLDI '90: Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation

Most conventional compilers fail to allocate array elements to registers because standard data-flow analysis treats arrays like scalars, making it impossible to analyze the definitions and uses of individual array elements. This deficiency is ...
Differential register allocation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation

Micro-architecture designers are very cautious about expanding the number of architected registers (also the register field), because increasing the register field adds to the code size, raises I-cache and memory pressure, complicates processor ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 39, Issue 4

20 Years of the ACM SIGPLAN Conference on Programming Language Design and Implementation 1979-1999: A Selection

April 2004

673 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/989393

Editor:
Kathryn S. McKinley
The University of Texas at Austin, USA

Issue’s Table of Contents

Copyright © 2004 Authors.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 April 2004

Published in SIGPLAN Volume 39, Issue 4

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
467
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cherian AZhou KGrubisic DMeng XMellor-Crummey J(2021)Measurement and Analysis of GPU-Accelerated OpenCL Computations on Intel GPUs2021 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)10.1109/ProTools54808.2021.00009(26-35)Online publication date: Nov-2021
https://doi.org/10.1109/ProTools54808.2021.00009
Rostrup SDe Sterck H(2010)Parallel hyperbolic PDE simulation on clusters: Cell versus GPUComputer Physics Communications10.1016/j.cpc.2010.07.049181:12(2164-2179)Online publication date: Dec-2010
https://doi.org/10.1016/j.cpc.2010.07.049
Philippidis CShang W(2010)On minimizing register usage of linearly scheduled algorithms with uniform dependenciesComputer Languages, Systems and Structures10.1016/j.cl.2009.12.00136:3(250-267)Online publication date: 1-Oct-2010
https://dl.acm.org/doi/10.1016/j.cl.2009.12.001
Hwu WRodrigues CRyoo SStratton J(2009)Compute Unified Device Architecture Application SuitabilityComputing in Science and Engineering10.1109/MCSE.2009.4811:3(16-26)Online publication date: 1-May-2009
https://dl.acm.org/doi/10.1109/MCSE.2009.48
Ryoo SRodrigues CBaghsorkhi SStone SKirk DHwu WChatterjee SScott M(2008)Optimization principles and application performance evaluation of a multithreaded GPU using CUDAProceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming10.1145/1345206.1345220(73-82)Online publication date: 20-Feb-2008
https://dl.acm.org/doi/10.1145/1345206.1345220

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents