This document provides guidance on optimizing cache performance in Java applications. It discusses key elements of cache performance like hit ratio, expiration policies, and spikes in load. It examines metrics like insert/read latencies and object sizes. It also analyzes issues that can impact performance like garbage collection pauses, object overhead, serialization costs, and thread contention. The document provides examples of different cache implementations and recommends measuring performance to identify optimization opportunities.
1 of 29
More Related Content
How to Stop Worrying and Start Caching in Java
2. How to Stop Worrying and
Start Caching in Java
SriSatish Ambati , Azul Systems
sris@azulsystems.com
Manik Surtani, RedHat Inc
msurtani@redhat.com
3. The Trail
Examples
Elements of Cache Performance
Theory
Metrics
200GB Cache Design
Whodunit
Overheads in Java – Objects, GC
Locks,
Communication, Sockets, Queues (SEDA)
Serialization
Measure
4. Wie die wahl hat, hat die Qual!
He who has the choice has the agony!
5. Some example caches
Homegrown caches – Surprisingly work well.
(do it yourself! It’s a giant hash)
Infinispan, Coherence, Gemstone, GigaSpaces,
EhCache, etc
NoSQL stores (Apache Cassandra)
Non java alternatives: MemCached & clones.
7. Elements of Cache Performance
Hot or Not: The 80/20 rule.
A small set of objects are very popular!
Hit or Miss: Hit Ratio
How effective is your cache?
LRU, LFU, FIFO, LIRS.. Expiration
Long-lived objects, better locality
Spikes happen
Cascading events: Node load, Node(s) dead.
Cache Thrash: Full Table scan.
8. Elements of Cache Performance : Metrics
Inserts: Puts/sec, Latencies
Reads: Gets/sec, Latencies, Indexing
Updates: mods/sec, latencies (Locate, Modify & Notify)
Replication, Consistency, Persistence
Size of Objects
Number of Objects
Size of Cache
# of cacheserver Nodes (read only, read write)
# of clients
9. Partitioning & Distributed Caches
Near Cache/L1 Cache
Bring data close to the Logic that is using it.
Birds of feather flock together - related data live closer
Read-only nodes, Read-Write nodes
Management nodes
Communication Costs
Balancing (buckets)
Serialization (more later)
10. I/O considerations
Asynchronous
Sockets
Queues & Threads serving the sockets
Bandwidth
Persistence –
File, DB (CacheLoaders)
Write Behind
Data Access Patterns of Doom, ex:
“Death by a million cuts” – Batch your reads.
11. Buckets–Partitions, Hashing function
Birthdays, Hashmaps & Prime Numbers
Collisions, Chaining
Unbalanced HashMap
- behaves like a list O(n) retrieval
Partition-aware Hashmaps
Non-blocking Hashmaps
(see: locking)
Performance Degrades with 80% table density
13. How many nodes to get a 200G cache?
Who needs a 200G cache?
Disk is the new Tape!
200 nodes @ 1GB heap each
2 nodes @ 100GB heap each
(plus overhead)
14. SIDE ONE
Join together with the band
I don’t even know myself
SIDE TWO
Let’s see action
Relay
Don’t happen that way at all
The seeker
15. Java Limits: Objects are not cheap!
How many bytes for a 8 char String ?
(assume 32-bit)
How many objects in a Tomcat idle instance?
char[]
String
book keeping fields
12 bytes
JVM Overhead
16 bytes
Pointer
4 bytes
data
16 bytes
JVM Overhead
16 bytes
A. 64bytes
31% overhead
Size of String
Varies with JVM
16. Picking the right collection: Mozart or Bach?
100 elements of:
Treemap <Double, Double>
82% overhead, 88 bytes constant cost per
element
[pro: enables updates while maintaining order]
double[], double[] –
2% overhead, amortized
[con: load-then-use]
Sparse collections, empty
collections. wrong collections.
TreeMap
Fixed Overhead: 48 bytes
TreeMap$Entry
data
Per-entry Overhead: 40 bytes
Double doubl
e
*From one 32-bit JVM.
Varies with JVM Architecture
Double
JVM Overhead
16 bytes
data
8 bytes
doubl
e
18. Java Limits: Garbage Collection
GC defines cache configuration
Pause Times: If stop_the_world_pause > time_to_live
⇒
node is declared dead
Allocation Rate: Write, Insertion Speed.
Live Objects (residency)
if residency > 50%. GC overheads dominate.
Increasing Heap Size only increases pause times.
64-bit is not going to rescue us either:
Increases object header, alignment & pointer overhead
40-50% increase in heap sizes for same workloads.
Overheads – cycles spent GC (vs. real work); space
19. Fragmentation, Generations
Fragmentation – compact often, uniform sized objects
[Finding seats for a gang-of-four
is easier in an empty theater!]
Face your fears, Face them Often!
Generational Hypothesis
Long-lived objects promote often,
inter-generational pointers, more old-gen collections.
Entropy: How many flags does it take to tune your GC ?
⇒
Avoid OOM, configure node death if OOM
⇒
Shameless plug: Azul’s Pauseless GC (now software edition) ,
Cooperative-Memory (swap space for your jvm under spike: No more
OOM!)
20. Locks: Why Amdahl’s law trumps Moore’s!
Schemes
Optimistic, Pessimistic
Consistency
Eventually vs. ACID
Contention, Waits
java.util.concurrent, critical sections: Use Lock Striping
MVCC, Lock-free, wait-free DataStructures. (NBHM)
Transactions are expensive
⇒
Reduce JTA abuse, Set the right isolation levels.
21. Inter-node communication
• TCP for mgmt & data
– Infinispan
• TCP for mgmt, UDP for data
– Coherence, Infinispan
• UDP for mgmt, TCP for data
– Cassandra, Infinispan
• Instrumentation
– EHCache/Terracotta
• Bandwidth & Latency considerations
⇒Ensure proper network configuration in the kernel
⇒Run Datagram tests
⇒Limit number of management nodes & nodes
22. Sockets, Queues, Threads, Stages
How many sockets?
gemstone (VMWare) : Multi-socket implementation,
infinispan
alt: increase ports, nodes, clients
How many threads per socket? Mux
Asynchronous IO/Events (apache mina, jgroups)
Curious Case of a single threaded Queue Manager
Reduce context switching
SEDA
25. Count what is countable, measure what is measurable, and
what is not measurable, make measurable
-Galileo
26. Latency:
Where have all the millis gone?
Measure. 90th
percentile. Look for consistency.
=> JMX is great! JMX is also very slow.
Reduced number of nodes means less MBeans!
Monitor (network, memory, cpu), ganglia,
Know thyself: Application Footprint, Trend data.
27. Q&A
References:
Making Sense of Large Heaps, Nick Mitchell, IBM
Oracle Coherence 3.5, Aleksandar Seovic
Large Pages in Java http://andrigoss.blogspot.com/2008/02/jvm-performance-tuning.html
Patterns of Doom http://3.latest.googtst23.appspot.com/
Infinispan Demos http://community.jboss.org/wiki/5minutetutorialonInfinispan
RTView, Tom Lubinski, http://www.sl.com/pdfs/SL-BACSIG-100429-final.pdf
Google Protocol Buffers, http://code.google.com/p/protobuf/
Azul’s Pauseless GC http://www.azulsystems.com/technology/zing-virtual-machine
Cliff Click’s Non-Blocking Hash Map http://sourceforge.net/projects/high-scale-lib/
JVM Serialization Benchmarks:
http://code.google.com/p/thrift-protobuf-compare/wiki/BenchmarkingV2
Description of Graph
Shows the average number of cache misses expected when inserting into a hash table with various collision resolution mechanisms; on modern machines, this is a good estimate of actual clock time required. This seems to confirm the common heuristic that performance begins to degrade at about 80% table density.
It is based on a simulated model of a hash table where the hash function chooses indexes for each insertion uniformly at random. The parameters of the model were:
You may be curious what happens in the case where no cache exists. In other words, how does the number of probes (number of reads, number of comparisons) rise as the table fills? The curve is similar in shape to the one above, but shifted left: it requires an average of 24 probes for an 80% full table, and you have to go down to a 50% full table for only 3 probes to be required on average. This suggests that in the absence of a cache, ideally your hash table should be about twice as large for probing as for chaining.