Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Lecture16 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Improving Cache Performance

1. Reducing miss rates 3. Reducing miss penalty or


miss rates via parallelism
„ Larger block size
Reduce miss penalty or
„ larger cache size miss rate by
Cache Optimizations III „ higher associativity parallelism
„ victim caches Non-blocking caches
„ way prediction and Hardware prefetching
Compiler prefetching
Critical word first, reads over Pseudoassociativity
writes, merging write buffer, non- „ compiler optimization
4. Reducing cache hit time
blocking cache, stream buffer, and 2. Reducing miss penalty „ Small and simple caches
software prefetching „ Multilevel caches „ Avoiding address
„ critical word first translation
„ read miss first „ Pipelined cache access
„ merging write buffers „ Trace caches

1 2
Adapted from UC Berkeley CS252 S01

Reducing Miss Penalty Summary Early Restart and Critical Word



CPUtime = IC × CPI
⎝ Execution
+
Memory accesses
Instruction

× Miss rate × Miss penalty × Clock cycle time

First
Don’t wait for full block to be loaded before restarting
Four techniques CPU
„ Multi-level cache „ Early restart—As soon as the requested word of the block
„ Early Restart and Critical Word First on miss arrives, send it to the CPU and let the CPU continue execution
Critical Word First—Request the missed word first from
Read priority over write
„
„ memory and send it to the CPU as soon as it arrives; let the
„ Merging write buffer CPU continue execution while filling the rest of the words in
the block. Also called wrapped fetch and requested word first
Can be applied recursively to Multilevel Caches Generally useful only in large blocks (relative to
„ Danger is that time to DRAM will grow with multiple bandwidth)
levels in between Good spatial locality may reduce the benefits of early
„ First attempts at L2 caches can make things worse, restart, as the next sequential word may be needed
anyway
since increased worst case is worse
block
3 4

Read Priority over Write on Miss Read Priority over Write on Miss
Write-through with write buffers offer RAW conflicts
with main memory reads on cache misses
CPU
„ If simply wait for write buffer to empty, might increase read
miss penalty (old MIPS 1000 by 50% ) in out
„ Check write buffer contents before read; if no conflicts, let the Write Buffer
memory access continue
„ Usually used with no-write allocate and a write buffer
Write-back also want buffer to hold misplaced blocks
„ Read miss replacing dirty block
„ Normal: Write dirty block to memory, and then do the read
„ Instead copy the dirty block to a write buffer, then do the read, write
and then do the write buffer

„ CPU stall less since restarts as soon as do read DRAM


„ Usually used with write allocate and a writeback buffer (or lower mem)

5 6

1
Merging Write Buffer Improving Cache Performance
1. Reducing miss rates 3. Reducing miss penalty or
miss rates via parallelism
„ Larger block size
Reduce miss penalty or
„ larger cache size miss rate by
„ higher associativity parallelism
„ victim caches Non-blocking caches
„ way prediction and Hardware prefetching
Pseudoassociativity Compiler prefetching
„ compiler optimization
4. Reducing cache hit time
2. Reducing miss penalty „ Small and simple caches
Multilevel caches „ Avoiding address
Write merging: new written data into an existing
„

critical word first translation


block are merged „

read miss first „ Pipelined cache access


Reduce stall for write (writeback) buffer being full
„

„ merging write buffers „ Trace caches


Improve memory efficiency
7 8

Non-blocking Caches to reduce stalls Value of Hit Under Miss for SPEC
on misses Hit Under i Misses

Non-blocking cache or lockup-free cache allow data 2

cache to continue to supply cache hits during a miss 1.8

„ Usually works with out-of-order execution 1.6

“hit under miss” reduces the effective miss penalty by


1.4
0->1
0->1
allowing one cache miss; processor keeps running until
1.2
1->2
1->2
another miss happens
1

0.8
2->64 2->64
„ Sequential memory access is enough 0.6
Base Base
„ Relative simple implementation 0.4 “Hit under n Misses”
“hit under multiple miss” or “miss under miss” may 0.2

further lower the effective miss penalty by overlapping 0

doduc

nasa7
espresso

ear

ora
wave5
eqntott

compress

fpppp

tomcatv

multiple misses

hydro2d

spice2g6
su2cor

alvinn
xlisp

swm256

mdljdp2
mdljsp2

„ Implies memories support concurrency (parallel or pipelined) Integer Floating Point


„ Significantly increases the complexity of the cache controller FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26
„ Requires muliple memory banks (otherwise cannot support)
Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19
„ Penium Pro allows 4 outstanding memory misses 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss
9 10

Reducing Misses by Hardware Prefetching


of Instructions & Data Stream Buffer Diagram
from processor to processor
E.g., Instruction Prefetching
„ Alpha 21064 fetches 2 blocks on a miss
„ Extra block placed in “stream buffer”
„ On miss check stream buffer Direct
mapped
Works with data blocks too: Tags Data
cache
„ Jouppi [1990] 1 data stream buffer got 25% misses from 4KB
cache; 4 streams got 43%
„ Palacharla & Kessler [1994] for scientific programs for 8
streams got 50% to 70% of misses from head tag and
2 64KB, 4-way set associative caches comp a one cache block of data Stream
tag a one cache block of data buffer
Prefetching relies on having extra memory bandwidth tail tag a one cache block of data
Source: Jouppi
that can be used without penalty tag a one cache block of data
ICS’90
+1 Shown with a single stream buffer
next level of cache (way); multiple ways and filter may
11 12
be used

2
Victim Buffer Diagram Improving Cache Performance
to proc 1. Reducing miss rates 3. Reducing miss penalty or
miss rates via parallelism
„ Larger block size
from proc Reduce miss penalty or
„ larger cache size miss rate by parallelism
Direct „ higher associativity Non-blocking caches
Tags Data mapped „ victim caches Hardware prefetching
cache „ way prediction and Compiler prefetching
Pseudoassociativity
Proposed in „ compiler optimization 4. Reducing cache hit time
the same „ Small and simple
next level of cache
paper: Jouppi 2. Reducing miss penalty caches
tag and comp one cache block of data ICS’90 „ Multilevel caches „ Avoiding address
tag and comp one cache block of data „ critical word first translation
tag and comp one cache block of data Victim cache, fully
read miss first „ Pipelined cache access
tag and comp one cache block of data associative „

„ merging write buffers „ Trace caches

13 14

Fast hits by Avoiding Address Fast hits by Avoiding Address Translation


Translation Send virtual address to cache? Called Virtually Addressed
Cache or just Virtual Cache vs. Physical Cache
CPU „ Every time process is switched logically must flush the cache;
CPU CPU
otherwise get false hits
VA VA VA Š Cost is time to flush + “compulsory” misses from empty cache
VA PA „ Dealing with aliases (sometimes called synonyms);
TB $ $ TB
Tags Tags Two different virtual addresses map to same physical address
PA VA PA „ I/O use physical addresses and must interact with cache, so need
L2 $ virtual address
$ TB
Antialiasing solutions
PA PA MEM „ HW guarantees covers index field & direct mapped, they must be
unique; called page coloring
MEM MEM
Overlap cache access Solution to cache flush
with VA translation:
Conventional Virtually Addressed Cache requires cache index to „ Add process identifier tag that identifies process as well as address
Organization Translate only on miss remain invariant within process: can’t get a hit if wrong process
Synonym Problem 15
across translation 16

Fast Cache Hits by Avoiding Fast Cache Hits by Avoiding Translation:


Translation: Process ID impact Index with Physical Portion of Address
If a direct mapped cache is no larger than a page, then
Black is uniprocess the index is physical part of address
Light Gray is multiprocess can start tag access in parallel with translation so that
when flush cache can compare to physical tag
Dark Gray is multiprocess
when use Process ID tag
Page Address Page Offset 0
Y axis: Miss Rates up to 31 12 11 0
20%
Address Tag Index Block Offset

X axis: Cache size from 2 Limits cache to page size: what if want bigger caches and
KB to 1024 KB uses same trick?
„ Higher associativity moves barrier to right
„ Page coloring
Compared with virtual cache used with page coloring?
17 18

3
Trace Cache
Pipelined Cache Access
Trace: a dynamic sequence of
Alpha 21264 Data cache design instructions including taken branches
„ The cache is 64KB, 2-way associative;
cannot be accessed within one-cycle
„ One-cycle used for address transfer and Traces are dynamically constructed by
data transfer, pipelined with data array processor hardware and frequently
access used traces are stored into trace
„ Cache clock frequency doubles processor cache
frequency; wave pipelined to achieve the
speed
Example: Intel P4 processor, storing
about 12K mops
19 20

What is the Impact of What


Cache Optimization Summary
We’ve Learned About Caches?
1960-1985: Speed Technique MP MR HT Complexity
= ƒ(no. operations)
1000
CPU Multilevel cache + 2
1990 Critical work first + 2
penalty
miss

Read first + 1
„ Pipelined
100 Merging write buffer + 1
Execution &
Victim caches + + 2
Fast Clock Rate
Larger block - + 0
„ Out-of-Order
Larger cache + - 1
miss rate

10
execution
Higher associativity + - 1
„ Superscalar DRAM

Instruction Issue 1 Way prediction + 2


1998: Speed = Pseudoassociative + 2
1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

ƒ(non-cached memory accesses) Compiler techniques + 0


What does this mean for
„ Compilers? Operating Systems? Algorithms?

Data Structures?
21 22

Cache Optimization Summary


Technique MP MR HT Complexity
Multilevel cache + 2
Critical work first + 2
penalty
miss

Read first + 1
Merging write buffer + 1
Victim caches + + 2
Larger block - + 0
Larger cache + - 1
miss rate

Higher associativity + - 1
Way prediction + 2
Pseudoassociative + 2
Compiler techniques + 0

23

You might also like