Lecture16 PDF

Improving Cache Performance
1. Reducing miss rates 3. Reducing miss penalty or

miss rates via parallelism
Larger block size
Reduce miss penalty or
larger cache size miss rate by
Cache Optimizations III higher associativity parallelism
victim caches Non-blocking caches
way prediction and Hardware prefetching
Compiler prefetching
Critical word first, reads over Pseudoassociativity
writes, merging write buffer, non- compiler optimization
4. Reducing cache hit time
blocking cache, stream buffer, and 2. Reducing miss penalty Small and simple caches
software prefetching Multilevel caches Avoiding address
critical word first translation
read miss first Pipelined cache access
merging write buffers Trace caches
1 2
Adapted from UC Berkeley CS252 S01
Reducing Miss Penalty Summary Early Restart and Critical Word

⎛
CPUtime = IC × CPI
⎝ Execution
+
Memory accesses
Instruction
⎞
× Miss rate × Miss penalty × Clock cycle time
⎠
First
Don’t wait for full block to be loaded before restarting
Four techniques CPU
Multi-level cache Early restart—As soon as the requested word of the block
Early Restart and Critical Word First on miss arrives, send it to the CPU and let the CPU continue execution
Critical Word First—Request the missed word first from
Read priority over write

memory and send it to the CPU as soon as it arrives; let the
Merging write buffer CPU continue execution while filling the rest of the words in
the block. Also called wrapped fetch and requested word first
Can be applied recursively to Multilevel Caches Generally useful only in large blocks (relative to
Danger is that time to DRAM will grow with multiple bandwidth)
levels in between Good spatial locality may reduce the benefits of early
First attempts at L2 caches can make things worse, restart, as the next sequential word may be needed
anyway
since increased worst case is worse
block
3 4
Read Priority over Write on Miss Read Priority over Write on Miss
Write-through with write buffers offer RAW conflicts
with main memory reads on cache misses
CPU
If simply wait for write buffer to empty, might increase read
miss penalty (old MIPS 1000 by 50% ) in out
Check write buffer contents before read; if no conflicts, let the Write Buffer
memory access continue
Usually used with no-write allocate and a write buffer
Write-back also want buffer to hold misplaced blocks
Read miss replacing dirty block
Normal: Write dirty block to memory, and then do the read
Instead copy the dirty block to a write buffer, then do the read, write
and then do the write buffer
CPU stall less since restarts as soon as do read DRAM

Usually used with write allocate and a writeback buffer (or lower mem)
5 6
1
Merging Write Buffer Improving Cache Performance
1. Reducing miss rates 3. Reducing miss penalty or
Larger block size
Reduce miss penalty or
larger cache size miss rate by
higher associativity parallelism
victim caches Non-blocking caches
way prediction and Hardware prefetching
Pseudoassociativity Compiler prefetching
compiler optimization
4. Reducing cache hit time
2. Reducing miss penalty Small and simple caches
Multilevel caches Avoiding address
Write merging: new written data into an existing

critical word first translation

block are merged

Reduce stall for write (writeback) buffer being full


Improve memory efficiency
7 8
Non-blocking Caches to reduce stalls Value of Hit Under Miss for SPEC
on misses Hit Under i Misses
Non-blocking cache or lockup-free cache allow data 2
cache to continue to supply cache hits during a miss 1.8
Usually works with out-of-order execution 1.6
“hit under miss” reduces the effective miss penalty by

1.4
0->1
0->1
allowing one cache miss; processor keeps running until
1.2
1->2
1->2
another miss happens
1
0.8
2->64 2->64
Sequential memory access is enough 0.6
Base Base
Relative simple implementation 0.4 “Hit under n Misses”
“hit under multiple miss” or “miss under miss” may 0.2
further lower the effective miss penalty by overlapping 0
doduc
nasa7
espresso
ear
ora
wave5
eqntott
compress
fpppp
tomcatv
multiple misses
hydro2d
spice2g6
su2cor
alvinn
xlisp
swm256
mdljdp2
mdljsp2
Implies memories support concurrency (parallel or pipelined) Integer Floating Point

Significantly increases the complexity of the cache controller FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26
Requires muliple memory banks (otherwise cannot support)
Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19
Penium Pro allows 4 outstanding memory misses 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss
9 10
Reducing Misses by Hardware Prefetching

of Instructions & Data Stream Buffer Diagram
from processor to processor
E.g., Instruction Prefetching
Alpha 21064 fetches 2 blocks on a miss
Extra block placed in “stream buffer”
On miss check stream buffer Direct
mapped
Works with data blocks too: Tags Data
cache
Jouppi [1990] 1 data stream buffer got 25% misses from 4KB
cache; 4 streams got 43%
Palacharla & Kessler [1994] for scientific programs for 8
streams got 50% to 70% of misses from head tag and
2 64KB, 4-way set associative caches comp a one cache block of data Stream
tag a one cache block of data buffer
Prefetching relies on having extra memory bandwidth tail tag a one cache block of data
Source: Jouppi
that can be used without penalty tag a one cache block of data
ICS’90
+1 Shown with a single stream buffer
next level of cache (way); multiple ways and filter may
11 12
be used
2
Victim Buffer Diagram Improving Cache Performance
to proc 1. Reducing miss rates 3. Reducing miss penalty or
Larger block size
from proc Reduce miss penalty or
larger cache size miss rate by parallelism
Direct higher associativity Non-blocking caches
Tags Data mapped victim caches Hardware prefetching
cache way prediction and Compiler prefetching
Pseudoassociativity
Proposed in compiler optimization 4. Reducing cache hit time
the same Small and simple
next level of cache
paper: Jouppi 2. Reducing miss penalty caches
tag and comp one cache block of data ICS’90 Multilevel caches Avoiding address
tag and comp one cache block of data critical word first translation
tag and comp one cache block of data Victim cache, fully
tag and comp one cache block of data associative
13 14
Fast hits by Avoiding Address Fast hits by Avoiding Address Translation

Translation Send virtual address to cache? Called Virtually Addressed
Cache or just Virtual Cache vs. Physical Cache
CPU Every time process is switched logically must flush the cache;
CPU CPU
otherwise get false hits
VA VA VA Cost is time to flush + “compulsory” misses from empty cache
VA PA Dealing with aliases (sometimes called synonyms);
TB $ $ TB
Tags Tags Two different virtual addresses map to same physical address
PA VA PA I/O use physical addresses and must interact with cache, so need
L2 $ virtual address
$ TB
Antialiasing solutions
PA PA MEM HW guarantees covers index field & direct mapped, they must be
unique; called page coloring
MEM MEM
Overlap cache access Solution to cache flush
with VA translation:
Conventional Virtually Addressed Cache requires cache index to Add process identifier tag that identifies process as well as address
Organization Translate only on miss remain invariant within process: can’t get a hit if wrong process
Synonym Problem 15
across translation 16
Fast Cache Hits by Avoiding Fast Cache Hits by Avoiding Translation:

Translation: Process ID impact Index with Physical Portion of Address
If a direct mapped cache is no larger than a page, then
Black is uniprocess the index is physical part of address
Light Gray is multiprocess can start tag access in parallel with translation so that
when flush cache can compare to physical tag
Dark Gray is multiprocess
when use Process ID tag
Page Address Page Offset 0
Y axis: Miss Rates up to 31 12 11 0
20%
Address Tag Index Block Offset
X axis: Cache size from 2 Limits cache to page size: what if want bigger caches and
KB to 1024 KB uses same trick?
Higher associativity moves barrier to right
Page coloring
Compared with virtual cache used with page coloring?
17 18
3
Trace Cache
Pipelined Cache Access
Trace: a dynamic sequence of
Alpha 21264 Data cache design instructions including taken branches
The cache is 64KB, 2-way associative;
cannot be accessed within one-cycle
One-cycle used for address transfer and Traces are dynamically constructed by
data transfer, pipelined with data array processor hardware and frequently
access used traces are stored into trace
Cache clock frequency doubles processor cache
frequency; wave pipelined to achieve the
speed
Example: Intel P4 processor, storing
about 12K mops
19 20
What is the Impact of What

Cache Optimization Summary
We’ve Learned About Caches?
1960-1985: Speed Technique MP MR HT Complexity
= ƒ(no. operations)
1000
CPU Multilevel cache + 2
1990 Critical work first + 2
penalty
miss
Read first + 1
Pipelined
100 Merging write buffer + 1
Execution &
Victim caches + + 2
Fast Clock Rate
Larger block - + 0
Out-of-Order
Larger cache + - 1
miss rate
10
execution
Higher associativity + - 1
Superscalar DRAM
Instruction Issue 1 Way prediction + 2

1998: Speed = Pseudoassociative + 2
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
ƒ(non-cached memory accesses) Compiler techniques + 0

What does this mean for
Compilers? Operating Systems? Algorithms?
Data Structures?
21 22
Cache Optimization Summary

Technique MP MR HT Complexity
Multilevel cache + 2
Critical work first + 2
penalty
miss
Read first + 1
Merging write buffer + 1
Victim caches + + 2
Larger block - + 0
Larger cache + - 1
miss rate
Higher associativity + - 1
Way prediction + 2
Pseudoassociative + 2
Compiler techniques + 0
23

Lecture16 PDF

Uploaded by

Copyright:

Available Formats

Lecture16 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture16 PDF

Uploaded by

Copyright:

Available Formats

Improving Cache Performance

1. Reducing miss rates 3. Reducing miss penalty or

Reducing Miss Penalty Summary Early Restart and Critical Word

 CPU stall less since restarts as soon as do read DRAM

critical word first translation

read miss first  Pipelined cache access

 merging write buffers  Trace caches

Non-blocking cache or lockup-free cache allow data 2

cache to continue to supply cache hits during a miss 1.8

 Usually works with out-of-order execution 1.6

“hit under miss” reduces the effective miss penalty by

further lower the effective miss penalty by overlapping 0

 Implies memories support concurrency (parallel or pipelined) Integer Floating Point

Reducing Misses by Hardware Prefetching

 merging write buffers  Trace caches

Fast hits by Avoiding Address Fast hits by Avoiding Address Translation

Fast Cache Hits by Avoiding Fast Cache Hits by Avoiding Translation:

What is the Impact of What

Instruction Issue 1 Way prediction + 2

ƒ(non-cached memory accesses) Compiler techniques + 0

Cache Optimization Summary

You might also like

CPU stall less since restarts as soon as do read DRAM

read miss first Pipelined cache access

merging write buffers Trace caches

Usually works with out-of-order execution 1.6

Implies memories support concurrency (parallel or pipelined) Integer Floating Point

merging write buffers Trace caches