Lecture16 PDF
Lecture16 PDF
Lecture16 PDF
1 2
Adapted from UC Berkeley CS252 S01
Read Priority over Write on Miss Read Priority over Write on Miss
Write-through with write buffers offer RAW conflicts
with main memory reads on cache misses
CPU
If simply wait for write buffer to empty, might increase read
miss penalty (old MIPS 1000 by 50% ) in out
Check write buffer contents before read; if no conflicts, let the Write Buffer
memory access continue
Usually used with no-write allocate and a write buffer
Write-back also want buffer to hold misplaced blocks
Read miss replacing dirty block
Normal: Write dirty block to memory, and then do the read
Instead copy the dirty block to a write buffer, then do the read, write
and then do the write buffer
5 6
1
Merging Write Buffer Improving Cache Performance
1. Reducing miss rates 3. Reducing miss penalty or
miss rates via parallelism
Larger block size
Reduce miss penalty or
larger cache size miss rate by
higher associativity parallelism
victim caches Non-blocking caches
way prediction and Hardware prefetching
Pseudoassociativity Compiler prefetching
compiler optimization
4. Reducing cache hit time
2. Reducing miss penalty Small and simple caches
Multilevel caches Avoiding address
Write merging: new written data into an existing
Non-blocking Caches to reduce stalls Value of Hit Under Miss for SPEC
on misses Hit Under i Misses
0.8
2->64 2->64
Sequential memory access is enough 0.6
Base Base
Relative simple implementation 0.4 “Hit under n Misses”
“hit under multiple miss” or “miss under miss” may 0.2
doduc
nasa7
espresso
ear
ora
wave5
eqntott
compress
fpppp
tomcatv
multiple misses
hydro2d
spice2g6
su2cor
alvinn
xlisp
swm256
mdljdp2
mdljsp2
2
Victim Buffer Diagram Improving Cache Performance
to proc 1. Reducing miss rates 3. Reducing miss penalty or
miss rates via parallelism
Larger block size
from proc Reduce miss penalty or
larger cache size miss rate by parallelism
Direct higher associativity Non-blocking caches
Tags Data mapped victim caches Hardware prefetching
cache way prediction and Compiler prefetching
Pseudoassociativity
Proposed in compiler optimization 4. Reducing cache hit time
the same Small and simple
next level of cache
paper: Jouppi 2. Reducing miss penalty caches
tag and comp one cache block of data ICS’90 Multilevel caches Avoiding address
tag and comp one cache block of data critical word first translation
tag and comp one cache block of data Victim cache, fully
read miss first Pipelined cache access
tag and comp one cache block of data associative
13 14
X axis: Cache size from 2 Limits cache to page size: what if want bigger caches and
KB to 1024 KB uses same trick?
Higher associativity moves barrier to right
Page coloring
Compared with virtual cache used with page coloring?
17 18
3
Trace Cache
Pipelined Cache Access
Trace: a dynamic sequence of
Alpha 21264 Data cache design instructions including taken branches
The cache is 64KB, 2-way associative;
cannot be accessed within one-cycle
One-cycle used for address transfer and Traces are dynamically constructed by
data transfer, pipelined with data array processor hardware and frequently
access used traces are stored into trace
Cache clock frequency doubles processor cache
frequency; wave pipelined to achieve the
speed
Example: Intel P4 processor, storing
about 12K mops
19 20
Read first + 1
Pipelined
100 Merging write buffer + 1
Execution &
Victim caches + + 2
Fast Clock Rate
Larger block - + 0
Out-of-Order
Larger cache + - 1
miss rate
10
execution
Higher associativity + - 1
Superscalar DRAM
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Data Structures?
21 22
Read first + 1
Merging write buffer + 1
Victim caches + + 2
Larger block - + 0
Larger cache + - 1
miss rate
Higher associativity + - 1
Way prediction + 2
Pseudoassociative + 2
Compiler techniques + 0
23