Article

Fast and efficient partial code reordering: taking advantage of dynamic recompilatior

Authors:

Xianglong Huang,

Stephen M. Blackburn,

Kathryn S. McKinleyAuthors Info & Claims

ISMM '06: Proceedings of the 5th international symposium on Memory management

Pages 184 - 192

https://doi.org/10.1145/1133956.1133980

Published: 10 June 2006 Publication History

Abstract

Poor instruction cache locality can degrade performance on modern architectures. For example, our simulation results show that eliminating all instruction cache misses improves performance by as much as 16% for a modestly sized instruction cache. In this paper, we show how to take advantage of dynamic code generationin a Java Virtual Machine (VM) to improve instruction locality at run-time. We develop a dynamic code reordering (DCR) system; alow overhead, online approach for improving instruction locality. DCR has three optimizations: (1) Interprocedural method separation; (2) Intraprocedural code splitting; and (3) Code padding. DCR uses the dynamic call graph and an edge profile that most VMs already collect to separate hot/cold methods and hot/cold code within a method. It also puts padding between methods to minimize conflict misses between frequent caller/callee pairs. It incrementally performs these optimizations only when the VM is optimizing a method at a higher level. We implement DCR in Jikes RVM and show its overhead is negligible. Extensive simulation and run-time experiments show that a simple code space improves average performance on a Pentium 4 by around 6% on SPEC and DaCapo Java benchmarks. These programs however have very small instruction cache footprints that limit opportunities for DCR to improve performance. Consequently, DCR optimizations on average show little effect, sometimes degrading performance and occasionally improving performance by up to 5%. Our work shows that the VM has the potential to dynamically improve instruction locality incrementally by simply piggybacking on hotspot recompilation.

References

[1]

M. Arnold, S. Fink, D. Grove, M. Hind, and P. F. Sweeney. Architecture and policy for adaptive optimization in virtual machines. Technical Report 23429, IBM Research, Nov. 2004.]]

[2]

M. Arnold, S. J. Fink, D. Grove, M. Hind, and P. Sweeney. Adaptive optimization in the Jalapeñ no JVM. In ACM Conference on Object-Oriented Programming Systems, Languages, and Applications, pages 47--65, Minneapolis, MN, October 2000.]]

Digital Library

[3]

M. Arnold, A. Welc, and V. T. Rajan. Improving virtual machine performance using a cross-run profile repository. In ACM Conference on Object-Oriented Programming Systems, Languages, and Applications, pages 297--311, 2005.]]

Digital Library

[4]

S. M. Blackburn, P. Cheng, and K. S. McKinley. Myths and realities: The performance impact of garbage collection. In ACM Conference on Measurement & Modeling Computer Systems, pages 25--36, NY, NY, June 2004.]]

Digital Library

[5]

S. M. Blackburn, P. Cheng, and K. S. McKinley. Oil and water? High performance garbage collection in Java with JMTk. In Proceedings of the International Conference on Software Engineering, pages 137--146, Scotland, UK, May 2004.]]

Digital Library

[6]

S. M. Blackburn, R. Garner, C. Hoffmann, A. M. Khan, K. S. McKinley, R. Bentzur, A. Diwan, D. Feinberg, S. Z. Guyer, A. Hosking, M. Jump, J. E. B. Moss, D. StefanoviĆ, T. VanDrunen, D. von Dincklage, and B. Wiedermann. The DaCapo Benchmarks: Java benchmarking development and analysis. Technical Report TR-CS-06-01, Dept. of Computer Science, Austrailian National University, Mar. 2006. http://ali-www.cs.umass.edu/DaCapo/-Benchmarks.]]

Digital Library

[7]

D. Bruening, V. Kiriansky, T. Garnett, and S. Banerji. Thread-shared software code caches. In IEEE/ACM International Symposium on Code Generation and Optimization, pages 28--38, NY, NY, Mar. 2006.]]

Digital Library

[8]

D. Burger and T. M. Austin. The SimpleScalar tool set version 2.0. Technical Report 1342, Computer Sciences Department, University of Wisconsin, June 1997.]]

Digital Library

[9]

J. B. Chen and B. D. D. Leupen. Improving instruction locality with just-in-time code layout. In Proceedings of the USENIX Windows NT Workshop, pages 25--32, 1997.]]

Digital Library

[10]

C. Click. Personal communication, Jan 2006.]]

[11]

R. Cohn, D. Goodwin, P. G. Lowney, and N. Rubin. Spike: An Optimizer for Alpha/NT Executables. In USENIX Windows NTWorkshop, pages 17--24, 1997.]]

Digital Library

[12]

S. Dieckmann and U. Hölzle. A study of the allocation behavior of the SPECjvm98 Java benchmarks. In Proceedings of the European Conference on Object-Oriented Programming, pages 92--115, June 1999.]]

Digital Library

[13]

L. Eeckhout, A. Georges, and K. D. Bosschere. How Java programs interact with virtual machines at the microarchitectural level. In ACM Conference on Object-Oriented Programming Systems, Languages, and Applications, pages 244--358, Anaheim, CA, Oct. 2003.]]

Digital Library

[14]

N. Gloy and M. D. Smith. Procedure Placement Using Temporal-Ordering Information. ACM Transactions on Programming Languages and Systems, 21(5):977--1027, 1999.]]

Digital Library

[15]

A. H. Hashemi, D. R. Kaeli, and B. Calder. Efficient Procedure Mapping Using Cache Line Coloring. In ACM Conference on Programming Languages Design and Implementation, pages 171--182, 1997.]]

Digital Library

[16]

K. Hazelwood and R. Cohn. A cross-architectural interface for code cache manipulation. In IEEE/ACM International Symposium on Code Generation and Optimization, pages 17--27, NY, NY, Mar. 2006.]]

Digital Library

[17]

K. Hazelwood and J. E. Smith. Exploring code cache eviction granularities in dynamic optimization systems. In International Symposium on Code Generation and Optimization, pages 89--99, Palo Alto, CA, March 2004.]]

Digital Library

[18]

X. Huang, J. E. B.Moss, K. S. McKinley, S. Blackburn, and D. Burger. Dynamic SimpleScalar: Simulating Java virtual machines. Technical Report TR-03-03, University of Texas at Austin, Department of Computer Sciences, Feb. 2003.]]

[19]

X. Huang, B. T. Lewis, and K. S. McKinley. Dynamic code management: Improving whole program code locality in managed runtimes. In International Conference on Virtual Execution Environments, Ottawa, Canada, June 2006.]]

[20]

X. Huang, Z. Wang, S. M. Blackburn, K. S. McKinley, J. E. B. Moss, and P. Cheng. The garbage collection advantage: Improving program locality. In ACM Conference on Object-Oriented Programming Systems, Languages, and Applications, Vancouver, BC, 2004.]]

Digital Library

[21]

Jikes Research Virtual Machine (RVM). http://jikesrvm.sourceforge.net.]]

[22]

C.-K. Luk, R. Muth, H. Patil, R. S. Cohn, and P. G. Lowney. Ispike: A Post-link Optimizer for the Intel Itanium Architecture. In IEEE/ACM International Symposium on Code Generation and Optimization, pages 15--26, 2004.]]

Digital Library

[23]

S. McFarling. Program Optimization for Instruction Caches. In ACM Conference on Architectural Support for Programming Languages and Operating Systems, pages 183--191, 1989.]]

Digital Library

[24]

K. Pettis and R. C. Hansen. Profile-guided code positioning. In ACM Conference on Programming Languages Design and Implementation, pages 16--27, 1990.]]

Digital Library

[25]

A. Ramirez, J.-L. Larriba-Pey, C. Navarro, J. Torrellas, and M. Valero. Software Trace Cache. In International Conference on Supercomputing, pages 119--126, 1999.]]

Digital Library

[26]

E. Rotenberg, S. Bennett, and J. E. Smith. A Trace Cache Microarchitecture and Evaluation. IEEE Transactions on Computers, 48(2):111--120, 1999.]]

Digital Library

[27]

D. Scales. Efficient Dynamic Procedure Placement. Technical Report WRL-98/5, Compaq WRL Research Lab, May 1998.]]

[28]

Standard Performance Evaluation Corporation. SPECjvm98 Docu-mentation, release 1.03 edition, March 1999.]]

[29]

Standard Performance Evaluation Corporation. SPECjbb2000 (JavaBusiness Benchmark) Documentation, release 1.01 edition, 2001.]]

[30]

J. Whaley. Dynamic Optimization Through the Use of Automatic Runtime Specialization. Master's thesis, Massachusetts Institute of Technology, May 1999.]]

[31]

B. Zorn. Performance in the Age of Trustworthy Computing, January 2004. Presentation at the DaCapo winter meeting. The University of Colorado, Boulder, CO.]]

Cited By

Gordon-Ross AVahid FDutt N(2013)Combining code reordering and cache configurationACM Transactions on Embedded Computing Systems10.1145/2362336.239917711:4(1-20)Online publication date: 1-Jan-2013
https://dl.acm.org/doi/10.1145/2362336.2399177
McDaniel MHazelwood K(2012)Runtime adaptationProceedings of the 2nd International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era10.1145/2185475.2185476(1-11)Online publication date: 3-Mar-2012
https://dl.acm.org/doi/10.1145/2185475.2185476
Merrill DHazelwood KGregg DAdve VBershad B(2008)Trace fragment selection within method-based JVMsProceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments10.1145/1346256.1346263(41-50)Online publication date: 5-Mar-2008
https://dl.acm.org/doi/10.1145/1346256.1346263

Index Terms

Fast and efficient partial code reordering: taking advantage of dynamic recompilatior
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Incremental compilers

Recommendations

WCET-driven cache-aware code positioning
CASES '11: Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems

Code positioning is a well-known compiler optimization aiming at the improvement of the instruction cache behavior. A contiguous mapping of code fragments in memory avoids overlapping of cache sets and thus decreases the number of cache conflict misses.
...
Combining code reordering and cache configuration

The instruction cache is a popular optimization target due to the cache's high impact on system performance and power and because of the cache's predictable temporal and spatial locality. This article is an in depth study on the interaction of code ...
A performance study of the time-varying cache behavior: a study on APEX, Mantevo, NAS, and PARSEC

Cache has long been used to minimize the latency of main memory accesses by storing frequently used data near the processor. Processor performance depends on the underlying cache performance. Therefore, significant research has been done to identify the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISMM '06: Proceedings of the 5th international symposium on Memory management

June 2006

202 pages

ISBN:1595932216

DOI:10.1145/1133956

General Chair:
Erez Petrank
Technion - Israel Institute of Technology
,
Program Chair:
Eliot Moss
University of Massachusetts, Amherst

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 June 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

ISMM06

Sponsor:

ISMM06: The 2006 International Symposium on Memory Management

June 10 - 11, 2006

Ontario, Ottawa, Canada

Acceptance Rates

Overall Acceptance Rate 72 of 156 submissions, 46%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
346
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)1

Reflects downloads up to 14 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gordon-Ross AVahid FDutt N(2013)Combining code reordering and cache configurationACM Transactions on Embedded Computing Systems10.1145/2362336.239917711:4(1-20)Online publication date: 1-Jan-2013
https://dl.acm.org/doi/10.1145/2362336.2399177
McDaniel MHazelwood K(2012)Runtime adaptationProceedings of the 2nd International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era10.1145/2185475.2185476(1-11)Online publication date: 3-Mar-2012
https://dl.acm.org/doi/10.1145/2185475.2185476
Merrill DHazelwood KGregg DAdve VBershad B(2008)Trace fragment selection within method-based JVMsProceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments10.1145/1346256.1346263(41-50)Online publication date: 5-Mar-2008
https://dl.acm.org/doi/10.1145/1346256.1346263

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents