Efficient and General On-Stack Replacement For Aggressive Program Specialization
Efficient and General On-Stack Replacement For Aggressive Program Specialization
Efficient and General On-Stack Replacement For Aggressive Program Specialization
E-mail: {sunils,ckrintz}@cs.ucsb.edu
Compilation Time (ms) Space Added (KB) our system. Columns 2 and 3 show the compilation
Benchmark Clean VARMAP Compile Time Runtime time for the reference system and our VARMAP, re-
compress 68 79 14.52 3.16 spectively. Column 4 shows the percentage degradation
db 91 117 24.57 5.26 in compilation time imposed by VARMAP. Columns 5
jack 445 543 139.67 30.00
and 6 show the space overhead introduced by VARMAP
javac 1962 2540 629.94 136.98
jess 504 656 136.80 29.20 during compilation time (collectable) and runtime (per-
mtrt 595 746 154.38 33.50 manent), respectively. On average, our system increases
MST 50 66 17.03 3.73 compilation time by just over 100ms, and adds space
Perimeter 86 66 15.82 3.47 overhead of 133KB that is collected, and 29KB that is
Voronoi 96 129 62.06 13.49
not collectable.
Avg. 433 549 132.75 28.75
Avg. Spec98 611 780 183.31 39.68
Specialization for Generational Garbage Collection
We next present results for a novel OSR-based special-
Figure 3. Overhead of our OSR-VARMAP
ization for write-barrier removal for generational GCs.
Implementation in the JikesRVM refer-
Prior to the initial GC when there are no objects in the
ence system. Cols 2 & 3 indicate compila-
mature space, write-barriers are not required, and thus,
tion times and Cols 5-6 show space over-
impose pure overhead. Our goal with this specialization
head.
is to reduce the overhead of write-barriers for programs
that do not require GC, and to improve the startup per-
VARMAP shows significant improvement in applica- formance of those programs that do.
tion performance – by 9% on average across all bench- For this specialization, we employ the popular Gen-
marks, and by over 10% across the SpecJVM bench- erational Mark Sweep (GMS) collector. The com-
marks. jess and mtrt show the most benefit, 31% and piler checks the maximum heap size to ensure that it
20% respectively. For these benchmarks, the original is large enough to warrant specialization (>=500MB),
implementation severely inhibits optimization, partic- and that heap residency (pages used/pages allocated) is
ularly due to increased register pressure by artificially low (<=60%). We identified both values empirically. If
extending live ranges of variables, past their last actual no GCs have occur, the compiler elides write-barriers.
use. This results in a large number of variable spills to This specialization must be invalidated when objects
memory. With our implementation, we do not need to are first promoted to the mature space. Our system only
maintain conservative liveness estimates, since we track performs OSR for methods that require write barriers
liveness information accurately. and when there are field assignments that will execute
Other benchmarks show benefits of 5% or less. For after the point at which execution has been suspended.
these benchmarks, improved code quality does not im- We present the performance of write-barrier special-
pact overall performance significantly. Since these pro- ization (referred to as WBSpec) in Figure 4. We used
grams are short running, they are not heavily optimized. a heap size of 500MB for these experiments. Column
In addition, OSR VARMAP does impose some GC 2 and 3 show the execution time performance in sec-
overhead, since we maintain information for each pos- onds without and with WBSpec, respectively. Column
sible OSR point – which for short running codes is not 4 shows the percent improvement enabled by WBSpec.
fully amortized, especially for small heap sizes. Column 5 shows the number of write barriers elimi-
Figure 3 details the space compilation overhead of nated. The final two columns show the OSR overhead
I 8 7
QU V X^W
@ ~
L
? ] X
QM R ] W
KP >
I \X
ST =
\[W
L ~
Q < ZX
L
Z[W
QM R ;
K Y X
OP : Y W
NI 9
X
ML ~ W
8
KI
7 _m l
_
jeff o c oqp[r s o ec ` u lzy e _ {d|
H IJ ~ bdc e f fhg ikj l jln a e v `c`w x ec r b e
ABCC DFE G E } ~ _`a t e cu a n
e
c lzy
xn e
Figure 5. Performance improvement from
using OSR VARMAP for speculative de- Figure 6. Performance improvement from
virtualization (dynamic dispatch) of vir- using OSR VARMAP for OSR within an au-
tual methods. We omit benchmarks which tomatic GC switching system.
showed no significant change.
We also used our OSR implementation to im-
imposed – column 6 is the number of OSRs that occur prove the performance of a dynamic, garbage-collection
and column 7 is the total time for all OSRs. For many of switching system within JikesRVM that we developed
the benchmarks, no OSR is required since a minor GC in prior work. This system enables performance gains
is not triggered for these. For those that require OSR, (including all overheads) of 11-14% on average. How-
the overhead is very small. ever, this system incurs an overhead of 10% on aver-
On average, WBSpec improves performance by 6% age, even when no dynamic switching is triggered. This
across benchmarks. For the SPECJVM benchmarks overhead inhibits the full performance potential of the
(first 6 in the table), WBSpec improves performance system, the primary source of which is the need for OSR
by 3% on average. For benchmarks that require GC, support. Our OSR implementation reduces the over-
this benefit is during program startup. The JOlden head of the system by almost half.
benchmarks require no GC when we use a heap size of Figure 6 shows the benefits on total execution time
500MB. We believe that such applications are ideal can- due to the use of our OSR implementation in the GC
didates for specialization presented. The average im- switching system. The data is the average improvement
provement in execution time for these benchmarks is across a range of heap sizes for each benchmark (from
13%. the minimum to 12x the minimum). VARMAP im-
proves performance most significantly for jess (26%),
OSR VARMAP for Existing Specializations mtrt (19%), and MST (22%). This is due to the num-
For guard-free dynamic dispatch, we replace code ber of OSR points in the hot methods for these bench-
patching and OSRPoints for deferred compilation with marks, and due to the fact that MST is very short run-
our VARMAP implementation to guard speculatively ning. Across benchmarks, VARMAP shows a 9% im-
inlined virtual method calls. The compiler inlines calls provement. This result is similar to the performance
that meet size constraints and for which a single ob- improvement enabled by VARMAP versus the extant
ject target can be predicted [2]. To preserve correct- state-of-the-art that we presented previously.
ness, we employ OSR to replace code that does not
have checks inserted, when class loading invalidates as- 6 Conclusions and Future Work
sumptions made by the compiler. Upon recompilation, We present a new implementation of on-stack re-
the compiler generates a new version of the method that placement (OSR) that decouples dynamic state collec-
implements dynamic dispatch at the previously inlined tion from compilation and optimization. Unlike ex-
call site. We only perform OSR for methods for which isting approaches, our implementation does not inhibit
the compiler cannot establish pre-existence guarantees compiler optimization, and enables the compiler to pro-
(see Section 4). duce higher quality code. Our empirical measurements
Figure 5 shows the impact of using VARMAP over within JikesRVM show a performance improvement by
using the original OSRPoint and code patching. We 9% on average (from 1% to 31%), over a commonly
only presents the results for the benchmarks (2) that used implementation. We implement a novel, OSR-
implemented call sites for which the compiler was based specialization for write-barrier removal in gener-
unable to make pre-existence guarantees. For jess, ational GC that improves performance by 6% on aver-
a single inlined method with an eliminated guard age. Moreover, we empirically confirm that our system
( 202 jess.jess.ValueVector.size()) constitutes 8% of the is effective for extant specializations: dynamic dispatch
total number of method invocations. In the case of mtrt, of virtual methods, and for automatic GC switching.
3 such methods constitute over 50% of the total invoca- As part of future work, we plan to employ our OSR
tions. system for other aggressive specializations. These in-
clude removing infrequently executing instructions, e.g. [10] S. Fink and F. Qian. Design, Implementation and Eval-
exception handling code. In addition, OSR can be used uation of Adaptive Recompilation with On-Stack Re-
to trigger dynamic software updates in highly available placement. In International Symposium on Code Gen-
server systems. In such environments, control never eration and Optimization (CGO), Mar. 2003.
leaves a particular method (typically, a program loops [11] G. Aigner and U. Hölzle. Eliminating Virtual Function
Calls in C++ Programs. In European Conference on
forever listening for service requests, and issues re-
Object-Oriented Programming (ECOOP), July 1996.
quested work to slave processes). Using OSR, we can, [12] J. Gosling, B. Joy, and G. Steele. The Java Language
hence, upgrade code without affecting service availabil- Specification. Addison-Wesley, 1996.
ity. Finally, we plan to investigate the use of OSR in ag- [13] U. Hölzle. Optimizing Dynamically Dispatched Calls
gressive incremental alias, and escape analysis to per- with Run-Time Type Feedback. In Conference on Pro-
form speculative pointer-based optimizations, such as, gramming Language Design and Implementation, June
stack allocation of objects, memory layout optimiza- 1994.
tions, and synchronization removal. [14] U. Hölzle, C. Chambers, and D. Ungar. Debugging Op-
timized Code with Dynamic Deoptimization. In ACM
Acknowledgements Conference on Programming language design and im-
plementation, June 1992.
We thank Kathryn McKinley and Steve Fink for their [15] U. Hölzle and D. Ungar. A Third Generation Self Im-
invaluable input and suggestions on this work. This plementation: Reconciling Responsiveness With Per-
work was funded in part by NSF grant Nos. CAREER- formance. In Conference on Object-Oriented Program-
CNS-0546737, ST-HEC-0444412, and EHS-0209195 ming, Systems, Languages, and Applications (OOP-
and grants from Intel, Microsoft, and UCMicro. SLA), Oct. 1994.
[16] A. Hosking, J. E. Moss, and D. Stefanović. A Com-
References parative Performance Evaluation of Write Barrier Im-
plementations. In Conference on Object-Oriented
[1] B. Alpern, C. R. Attanasio, J. J. Barton, M. G. Burke, Programming, Systems, Languages, and Applications
P.Cheng, J.-D. Choi, A. Cocchi, S. J. Fink, D. Grove, (OOPSLA), Oct. 1992.
M. Hind, S. F. Hummel, D. Lieber, V. Litvinov, M. F. [17] K. Ishizaki, M. Kawahito, T. Yasue, H. Komatsu, and
Mergen, T. Ngo, J. R. Russell, V. Sarkar, M. J. Serrano, T. Nakatani. A Study of Devirtualization Techniques for
J. C. Shepherd, S. E. Smith, V. C. Sreedhar, H. Srini- a Java Just-In-Time Compiler. In Conference on Object-
vasan, and J. Whaley. The Jalapeño Virtual Machine. Oriented Programming, Systems, Languages, and Ap-
IBM Systems Journal, 39(1):211–221, 2000. plications (OOPSLA), Oct. 2000.
[2] M. Arnold, S. Fink, D. Grove, M. Hind, and P. Sweeney. [18] H. Lieberman and C. Hewitt. A Real-Time Garbage
Adaptive Optimization in the Jalapeño JVM. In Pro- Collector based on the Lifetimes of Objects. Communi-
ceedings of the ACM SIGPLAN Conference on Object- cations of the ACM, 26(6):419–429, 1983.
Oriented Programming Systems, Languages, and Appli- [19] M. Paleczny, C. Vick, and C. Click. The Java
cations (OOPSLA), Oct. 2000. HotSpot(TM) Server Compiler. In USENIX Java Vir-
[3] S. Blackburn and K. McKinley. In or Out? Putting tual Machine Research and Technology Symposium
Write Barriers in Their Place. In International Sympo- (JVM’01), Apr. 2001.
sium on Memory Management (ISMM), June 2002. [20] N. Sachindran, J. Eliot, and B. Moss. Mark-copy: Fast
[4] S. M. Blackburn and A. L. Hosking. Barriers: Friend or Copying GC with less Space Overhead. In Conference
Foe? In International Symposium on Memory Manage- on Object-Oriented Programming, Systems, Languages,
ment (ISMM), Oct. 2004. and Applications (OOPSLA), Oct. 2003.
[5] M. Burke, J. Choi, S. Fink, D. Grove, M. Hind, [21] S. Soman, C. Krintz, and D. F. Bacon. Dynamic
V. Sarkar, M. Serrano, V. Shreedhar, H. Srinivasan, and Selection of Application-Specific Garbage Collectors.
J. Whaley. The Jalapeño Dynamic Optimizing Compiler In International Symposium on Memory Management
for Java. In ACM JavaGrande Conference, June 1999. (ISMM), Oct. 2004.
[6] B. Cahoon and K. McKinley. Data Flow Analysis for [22] SpecJVM’98 Benchmarks.
Software Prefetching Linked Data Structures in Java http://www.spec.org/osg/jvm98.
Controller. In International Conference on Parallel Ar- [23] T. Suganuma, T. Yasue, and T. Nakatani. A Region-
chitectures and Compilation Techniques (PACT), Sept. Based Compilation Technique for a Java Just-In-Time
2001. Compiler. In Conference on Programming Language
[7] C. Chambers and D. Ungar. Making Pure Object-
Design and Implementation, June 2003.
Oriented Languages Practical. In Conference on Object-
[24] D. Ungar. Generation scavenging: A non-disruptive
Oriented Programming, Systems, Languages, and Ap-
high performance storage recalamation algorithm. In
plications (OOPSLA), Oct. 1991.
Software Engineering Symposium on Practical Soft-
[8] D. Detlefs and O. Agesen. Inlining of Virtual Methods.
ware Development Environments, Apr 1992.
In European Conference on Object-Oriented Program-
[25] P. Wilson and T. Moher. A Card-Marking Scheme for
ming (ECOOP), June 1999.
[9] A. Diwan, E. Moss, and R. Hudson. Compiler Support Controlling Intergenerational References in Generation-
for Garbage Collection in a Statically Typed Language. Based Garbage Collection on Stock Hardware. SIG-
In ACM Conference on Programming language design PLAN Notices, 24(5):87–92, 1989.
and implementation, June 1992.