Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
53 views

Cache Performance Average Memory Access Time

This document discusses cache performance including cache hit/miss rates, miss penalties, and average memory access time. It also covers ways to improve cache performance such as prefetching, interleaving memory modules, and using multiple cache levels.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Cache Performance Average Memory Access Time

This document discusses cache performance including cache hit/miss rates, miss penalties, and average memory access time. It also covers ways to improve cache performance such as prefetching, interleaving memory modules, and using multiple cache levels.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Cache Performance

Reading: Chap. 8.7


Recall: Cache Usage
• Cache Read (or Write) Hit/Miss: The read (or write)
operation can/cannot be performed on the cache.
Main
Processor Cache Memory
Unit: Unit:
Cache Line
• Cache Block /Word
Line: The unit composed multiple
successive memory words (size: cache block > word).
– The contents of a cache block (of memory words) will be
loaded into or unloaded from the cache at a time.
• Mapping Functions: Decide how cache is organized
and how addresses are mapped to the main memory.
• Replacement Algorithms: Decide which item to be
unloaded from cache when cache is full.
CSCI2510 Lec08: Cache Performance 2
Outline
• Performance Evaluation
– Cache Hit/Miss Rate and Miss Penalty
– Average Memory Access Time

• Performance Enhancements
– Prefetch
– Memory Module Interleaving
– Load-Through

3
Cache Hit/Miss Rate and Miss Penalty
• Cache Hit:
– The access can be done in the cache.
– Hit Rate: The ratio of number of hits to all accesses.
• Hit rates over 0.9 are essential for high-performance PCs.

• Cache Miss:
– The access can not be done in the cache.
– Miss Rate: The ratio of number of misses to all accesses.
– When cache miss occur, extra time is needed to bring blocks
from the slower main memory to the faster cache.
• During that time, the processor is stalled.
– Miss Penalty: the total access time passed through (seen by
the processor) when a cache miss occurs.
4
Average memory access time

AMAT = Hit time + Miss rate * Miss penalty


First, let's calculate the miss penalty for each cache level:
Miss penalty for L1 cache = Access time of L2 cache + Miss rate to L2 * Miss penalty of L2 cache
Miss penalty for L2 cache = Access time of main memory + Miss rate to main memory * Miss
penalty of main memory

AMAT = (L1 Cache hit time * L1 Cache hit ratio) + (L2 Cache hit time * L2 Cache hit ratio) +
(Main Memory access time * Main Memory hit ratio
An Example of Miss Penalty
• Miss Penalty: the total access time passed through
(seen by the processor) when a cache miss occurs.
• Consider a system with only one level of cache with
following parameters: 𝑡 10𝑡 Main
CPU Cache
– Word access time to the cache: 𝑡 𝑡 Memory

– Word access time to the main memory: 10𝑡


– When a cache miss occurs, a cache block of 8 words will be
transferred from the main memory to the cache.
• Time to transfer the first word: 10𝑡
• Time to transfer one word of the rest 7 words: 𝑡 (hardware support!)
• The miss penalty can be derived as:
𝑡 + 10𝑡 + 7  𝑡 + 𝑡 = 19𝑡
The initial cache access CPU access the requested
that results in a miss.
8: Cache Performance
data in the cache. 7
Average Memory Access Time
• Consider a system with only one level of cache:
– ℎ: Cache Hit Rate
Expected Value
– 1 − ℎ: Cache Miss Rate in Probability

– 𝐶: Cache Access Time 𝐸[𝑋] = ෍ 𝑥𝑖 × 𝑓(𝑥𝑖 )

– 𝑀: Miss Penalty
𝑖

• It mainly consists of the time to access a block in the main memory.


• The average memory access time can be defined as:
𝑡𝑎𝑣𝑔 = ℎ × 𝐶 + (1 − ℎ) × 𝑀
h
Main
Processor Cache
(1-h) Memory

– For example, given ℎ = 0.9, 𝐶 = 1 𝑐𝑦𝑐𝑙𝑒, 𝑀 = 19 𝑐𝑦𝑐𝑙𝑒𝑠:


 Avg. memory access time: 0.9 × 1 + 0.1 × 19 = 2.8 (𝑐𝑦𝑐𝑙𝑒𝑠)
CSCI2510 Lec08: Cache Performance 9
Real-life Example: Intel Core 2 Duo
• Number of Processors :1
• Number of Cores : 2 per processor
• Number of Threads : 2 per processor
• Name : Intel Core 2 Duo E6600
• Code Name : Conroe
• Specification : Intel(R) Core(TM)2 CPU 6600@2.40GHz
• Technology : 65 nm
• Core Speed : 2400 MHz
• Multiplier x Bus speed : 9.0 x 266.0 MHz = 2400 MHz
• Front-Side-Bus speed : 4 x 266.0MHz = 1066 MHz
• Instruction Sets : MMX, SSE, SSE2, SSE3, SSSE3, EM64T
• L1 Cache
– Data Cache : 2 x 32 KBytes, 8-way set associative, 64-byte line size
– Instruction Cache : 2 x 32 KBytes, 8-way set associative, 64-byte line size
• L2 Cache : 4096 KBytes, 16-way set associative, 64-byte line size
CSCI2510 Lec08: Cache Performance 10
Separate Instruction/Data Caches (1/2)
• Consider the system with only one level of cache:
– Word access time to the cache: 1 cycle
– Word access time to the main memory: 10 𝑐𝑦𝑐𝑙𝑒𝑠
– When a cache miss occurs, a cache block of 8 words will be
transferred from the main memory to the cache.
• Time to transfer the first word: 10 𝑐𝑦𝑐𝑙𝑒𝑠
• Time to transfer one word of the rest 7 words: 1 cycle
– Miss Penalty: 1 + 10 + 7  1 + 1 = 19 (𝑐𝑦𝑐𝑙𝑒𝑠)

• Assume there are total 130 memory accesses:


– 100 memory accesses for instructions with hit rate 0.95
– 30 memory access for data (operands) with hit rate = 0.90

CSCI2510 Lec08: Cache Performance 12


Separate Instruction/Data Caches (2/2)
• Total execution cycles without cache:
𝑡 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 = 100 × 10 + 30 × 10 = 1300 𝑐𝑦𝑐𝑙𝑒𝑠
– All of the memory accesses will result in a reading of a
memory word (of latency 10 cycles).

• Total execution cycles with cache:


Avg. memory access time for instructions:
ℎ×𝐶+(1−ℎ)×𝑀
𝑡𝑤𝑖𝑡ℎ = 100 × 0.95 × 1 + 0.05 × 19 +
30 × 0.9 × 1 + 0.1 × 19 =
274 𝑐𝑦𝑐𝑙𝑒𝑠
Avg. memory access time for data: ℎ×𝐶+(1−ℎ)×𝑀
• The performance improvement:

𝑡 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 1300
= 274 = 4.74 (𝑠𝑝𝑒𝑒𝑑 𝑢𝑝!)
𝑡 𝑤𝑖𝑡ℎ
CSCI2510 Lec08: Cache Performance 13
Class Exercise 8.2
• Consider the same system with one level of cache.
– Word access time to the cache: 1 cycle
– Word access time to the main memory: 10 𝑐𝑦𝑐𝑙𝑒𝑠
– Miss Penalty: 1 + 10 + 7  1 + 1 = 19 (𝑐𝑦𝑐𝑙𝑒𝑠)
• What is the performance difference between this
cache and an ideal cache?
– Ideal Cache: All the accesses can be done in cache.

CSCI2510 Lec08: Cache Performance 14


Multi-Level Caches
• In high-performance processors, two levels of caches are
normally used, L1 and L2.
– L1 Cache: Must be very fast as they determine the memory
access time seen by the processor.
– L2 Cache: Can be slower, but it should be much larger
than the L1 cache to ensure a high hit rate.
• The avg. memory access time of two levels of caches:
𝑡𝑎𝑣𝑔 = ℎ1 × 𝐶1 + (1 − ℎ1) × ℎ2 × 𝑀𝐿1 + (1 − ℎ2 )
×𝑀 – ℎ1/ℎ2: hit rate of L1 cache / L2 cache ,
𝐿2
– 𝐶1/𝐶2/𝑀𝑒𝑚: access time to L1 cache / L2 cache / memory
– 𝑀𝐿1: miss penalty of L1 miss & L2 hit
• E.g., 𝑀𝐿1 = 𝐶1 + 𝐶2 + 𝐶1 Avg. memory access time of
one level of cache:
– 𝑀𝐿2: miss penalty of L1 miss & L2 miss
𝑡𝑎𝑣𝑔 = ℎ × 𝐶 + (1 − ℎ) × 𝑀
• E.g., 𝑀 𝐿2 = 𝐶1 + 𝐶2 + 𝑀𝑒𝑚 + 𝐶2
+Cache
CSCI2510 Lec08: 𝐶1 Performance 16
Class Exercise 8.3
• Given a system with one level of cache, and a
system with two level of caches.
• Assume the hit rates of L1 cache and L2 cache (if
any) are both 0.9.
• What are the probabilities that miss penalty must be paid
to read a block from memory in both systems?

CSCI2510 Lec08: Cache Performance 17


How to Improve the Performance?
• Recall the system with only one level of cache:
– ℎ: Cache Hit Rate
– 1 − ℎ: Cache Miss Rate
– 𝐶: Cache Access Time
– 𝑀: Miss Penalty
• It mainly consists of the time to access a block in the main memory.
• The average memory access time can be defined as:
𝑡𝑎𝑣𝑔 = ℎ × 𝐶 + (1 − ℎ) × 𝑀
• Possible ways to further reduce 𝑡𝑎𝑣𝑔 :
 Use faster cache (i.e., 𝐶 ↓)? $$$...
 Improve the hit rate (i.e., ℎ 𝗍 )?
 Reduce the miss penalty (i.e., 𝑀 ↓)?
CSCI2510 Lec08: Cache Performance 20
How to Improve Hit Rate?
• How about larger block size?
– Larger blocks take more advantage of the spatial locality.
• Spatial Locality: If all items in a larger block are needed in a
computation, it is better to load them into cache in a single miss.
– Larger blocks are effective only up to a certain size:
• Too many items will remain unused before the block is replaced.
• It takes longer time to transfer larger blocks, and may also increase
the miss penalty.
– Block sizes of 16 to 128 bytes are most popular.

B
B
Main
Processor Cache Memory
Larger B
Larger B

CSCI2510 Lec08: Cache Performance 22


Prefetch: More rather than Larger
• Prefetch: Load more (rather than larger) blocks into
the cache before they are needed, while CPU is busy.
– Prefetch instruction can be put by programmer or compiler.

• Some data may be loaded into the cache without being


used, before the prefetched data are replaced.
– The overall effect on performance is positive.
– Most processors support the prefetch instruction.

Main
Processor Cache Memory

prefetch
CSCI2510 Lec08: Cache Performance 23
Outline
• Performance Evaluation
– Cache Hit/Miss Rate and Miss Penalty
– Average Memory Access Time

• Performance Enhancements
– Prefetch
– Memory Module Interleaving
– Load-Through

CSCI2510 Lec08: Cache Performance 24


Memory Module Interleaving (3/3)
• Which scheme below can be better interleaved?
– Scheme (a): Consecutive words in the same module.
– Scheme (b): Consecutive words in successive module.
• Keep multiple modules busy at any on time.
( 0 … 0 0 0 0 0 … 0010 ) 2 ( 0 … 0 0 0 0 0 … 0010 ) 2
0 0 =( 2 ) 10 0 0 =( 2 ) 10
( 0 … 0 0 0 0 0 … 0001 ) 2 ( 0 … 0 0 0 0 0 … 0001 ) 2
0 0 = ( 1m ) 0 0 m bits= ( 1 ) k1 0bits
bits1 0
Main Memory( 0 … 0 0Address … 0 0 0 0 ) Main
0 0 in0module Memory( 0 … 0 0 0 0 0 … 0000 )
Address in module Module
0Module
2 2
Address 0 Address 0 0
=( 0 ) 10 =( 0 ) 10
k bits

ABR DBR ABR DBR ABR DBR ABR DBR ABR DBR ABR DBR
Module Module Module Module Module Module
0 i n-1 0 i 2 k- 1
0 1 2 0 1 2

(a) Consecutive words in the same module (b) Consecutive words in
CSCI2510 Lec08: Cache Performance
successive modules 27
Example of Memory Module Interleaving
• Consider a cache read miss, and we need to load a
block of 8 words from main memory to the cache.
• Assume consecutive words are in successive modules for
the better interleaving (i.e., Scheme (b)).
• For every memory module:
– Address Buffer Register & Data Buffer Register
– Module Operations:
• Send an address to ABR: 𝟏 cycle ABR DBR
• Read the first word from module into DBR: 𝟔 cycles
• Module
Read a subsequent word from module into DBR: 𝟒 cycles i
• Read the data from DBR: 𝟏 cycle
Assume reads can be performed in parallel as accessing ABR or DBR, but
it only allows accessing either ABR or DBR of a module at a time.
CSCI2510 Lec08: Cache Performance 28
Without Interleaving (Single Module)
• Total cycles to read a single word from the module:
1 6 1 Send an address to ABR: 𝟏 cycle
Read the first word: 𝟔 cycles
– 1 cycle to send the address Read a subsequent word: 𝟒 cycles
Read the data from DBR: 𝟏 cycle
– 6 cycles to read the first word
– 1 cycle to read the data from DBR  1 + 6 + 1 = 8 𝑐𝑦𝑐𝑙𝑒𝑠
• Total cycles to read an 8-word block from the module:
Cycl e 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 … 36

1st 1 6 1 1+6+4×7+1
(read the 1st word) 1 4 1 = 36 𝑐𝑦𝑐𝑙𝑒𝑠
2nd
1 4 1 3rd
ABR DBR ABR DBR


Send an address Read data from DBR
+ (in paralle) Module Module + (in parallel)
Read a i i Read a word
8 th
word
CSCI2510 Lec08: Cache Performance
from module 1 4 1
29
With Interleaving Send an address to ABR: 𝟏 cycle
Read the first word: 𝟔 cycles
Read a subsequent word: 𝟒 cycles
• Total cycles to read a Read the data from DBR: 𝟏 cycle

8-word block from four interleaved memory modules:


Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
*Send an address &
1st from #0 1 6 1 Read a word from the module
2nd from #1 1 6 1 #0
*Read data from DBR &
Read a word from module #0
3rd from #2 1 6 1
4th from #3 1 6 1
It only allows to send the 5th from #0 1 4 1
memory addresses to
the modules one by one. 6th from #1 1 4 1
Why? The bus is shared.
7th from #2 1 4 1
ABR DBR ABR DBR ABR DBR ABR DBR 8th from #3 1 4 1
Module Module Module Module
#0 #1 #2 #3
1 + 6 + 1 × 8 = 15 𝑐𝑦𝑐𝑙𝑒𝑠
CSCI2510 Lec08: Cache Performance 30
Load-through
• Consider a read cache miss:
– Copy the block containing the requested word to the cache.
– Then forward to CPU after the entire block is loaded.
• Load-through: Instead of waiting the whole block to
be transferred, send the requested word to the processor
as soon as it is ready.
– Pros: Reduce the CPU’s waiting time (i.e., miss paneity)
– Cons: At the expense of more complex circuitry ($)
forward copy
a word a block

Cache Main
Processor Memory
load-through:
forward the requested word to the processor
as soon as it is read from the main memory!
CSCI2510 Lec08: Cache Performance 34
Summary
• Performance Evaluation
– Cache Hit/Miss Rate and Miss Penalty
– Average Memory Access Time

• Performance Enhancements
– Prefetch
– Memory Module Interleaving
– Load-Through

CSCI2510 Lec08: Cache Performance 35

You might also like