Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
6 views

Cache Memory

Uploaded by

rrrroptv
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Cache Memory

Uploaded by

rrrroptv
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

COMPUTER ARCHITECTURE

CACHE MEMORY ORGANIZATION

Falguni Sinhababu
Government College of Engineering and Leather Technology

1
Cache Memory Organization

CACHE MEMORY ORGANIZATION 2


INTRODUCTION
▪ A small, fast storage is introduced between the CPU and
the slow main memory to improve average access time.
▪ Improves memory system performance:
▪ By exploiting spatial and temporal locality
▪ Basic issues:
▪ If the size is large, the cost will increase and the address
decoder get much more complex.
▪ It should be large enough, so that most of the time the
processor will find instruction or data in the cache.
▪ It is found that the optimum size of cache will be 1K
byte to 512 K byte

CACHE MEMORY ORGANIZATION 3


INTRODUCTION
▪Let us consider a single level cache, and that part
of the memory hierarchy consisting of cache
memory and main memory.

CACHE MEMORY ORGANIZATION 4


INTRODUCTION
▪ Cache memory is logically divided into blocks or lines, where
every block or line typically contains 8 to 256 bytes.
▪ When the CPU wants to access a word in memory, a special
hardware first checks whether it is present in cache memory.
▪ If so (called cache hit), the word is directly accessed from the cache
memory.
▪ If not, the block containing the requested word is brought from the main
memory to cache.
▪ For writes, sometimes the CPU can also directly write to main memory.

▪ Objective is to keep the commonly used blocks in the cache


memory.
▪ Will result in significantly improved performance due to the property of
locality of reference.

CACHE MEMORY ORGANIZATION 5


CACHE MEMORY: BASIC PRINCIPLES
Start
Access main memory for
block containing RA
Receive address
(RA) from CPU Allocate cache slot for
main memory block
Is block
containing RA No
Deliver RA word to CPU
in cache
Yes
Fetch RA word and Load main memory block
deliver to CPU into cache slot

Done

CACHE MEMORY ORGANIZATION 6


CACHE DESIGN
1. Block Placement
▪ It helps us to find the ways how to place the Main Memory
block in the cache?
2. Block Identification
▪ How to find whether a Main Memory block present inside the
cache or not?
3. Block Replacement
▪ During a cache miss, how to choose which entry to replace
from the cache?
4. Write strategy
▪ Whenever we make some updation with the Cache block how
that is propagated to the Main Memory?

CACHE MEMORY ORGANIZATION 7


1. BLOCK PLACEMENT
▪ This is determined by some mapping algorithms.
▪ Specifies which main memory blocks can reside in
which cache memory blocks.
▪ At any given time, only a small subsets of the main
memory blocks can be held in cache memory.
▪ Three common block mapping techniques are used:
▪ Direct mapping
▪ Associative mapping
▪ (N-way) Set Associative mapping
CACHE MEMORY ORGANIZATION 8
A 2-LEVEL MEMORY HIERARCHY EXAMPLE
▪ Consider a 2-level cache memory and main
memory hierarchy.
▪ The cache memory consists of 256 blocks (lines)
of 32 words each.
▪ Total cache size is 256X32 = 8192 (8K) words
▪ Each word require (32 = 25) 5 bits to access.
▪ Main memory is addressable by a 24-bit address
▪ Total size of the main memory = 224 =16M words
▪ Number of 32 word blocks in main memory = 16
M/32= 512K

CACHE MEMORY ORGANIZATION 9


DIRECT MAPPING
▪Each main memory block can be placed in only
one block in the cache.
▪The mapping function is (which main memory
block is placed in which cache block):
▪ cache block = (main memory block) % (number of
cache blocks)
▪For the example:
▪ cache block = (main memory block) %256
▪Some example mappings:
▪ 0→0, 1→1, 255 → 255, 256 → 0, 257 → 1, 512 → 0, 513 →
1, etc
CACHE MEMORY ORGANIZATION 10
DIRECT MAPPING
In memory address:
1 block of cache = 32
words = 25
So 5 bits are for words
There are 256 blocks =
28
So 8 bits are for blocks.

CACHE MEMORY ORGANIZATION 11


DIRECT MAPPING
▪ Block replacement algorithm is trivial, as there is no choice
▪ More than one MM block is mapped onto the same cache block
▪ May lead to contention even if the cache is not full
▪ New block will replace the old block
▪ May lead to poor performance if both the blocks are frequently used

▪ The MM address is divided into three fields: TAG, BLOCK and


WORD
▪ When a new block is loaded into the cache, the 8-bit BLOCK field
determines the cache block where it is to be stored.
▪ The higher order 11 bits are stored in a TAG register associated with the
cache block
▪ When accessing a memory word, the corresponding TAG fields are
compared.
▪ Match implies HIT

CACHE MEMORY ORGANIZATION 12


ASSOCIATIVE MAPPING
▪ Here, a main memory (MM) block can potentially reside in any
cache block partition
▪ The memory address is divided into two fields: TAG and WORD
▪ When a block is loaded into the cache from MM, the higher order 19 bits
of the address are stored into the TAG register corresponding to the
cache block.
▪ When accessing memory the 19 bit TAG field address is compared with
all the TAG registers corresponding to all the cache blocks.
▪ Requires associative memory for storing the TAG values
▪ High cost / lack of scalability

▪ Because of complete freedom in block portioning, a wide range


of replacement algorithms is possible
CACHE MEMORY ORGANIZATION 13
FULLY ASSOCIATIVE MAPPING

CACHE MEMORY ORGANIZATION 14


N-WAY SET ASSOCIATIVE MAPPING
▪ A group of N consecutive blocks in the cache is called
set
▪ This algorithm is a balance of direct mapping and
associative mapping.
▪ Like direct mapping, a MM block is mapped to a set
▪ Set Number = (MM Block Number)%(Number of Sets in Cache)
▪ The block can be placed anywhere within the set (there are N
choices)
▪ The value of N is a design parameter
▪ If N = 1, this is same as direct mapping
▪ If N = number of cache blocks, this is same as associate
mapping
▪ Typical values of N used in practice are: 2, 4, 8

CACHE MEMORY ORGANIZATION 15


4-WAY SET ASSOCIATIVE MAPPING
In memory address:
1 block of cache = 32
words = 25
So 5 bits are for words
There are 64 sets = 26
So 6 bits are for sets.

So, 13 bits are required for TAG

CACHE MEMORY ORGANIZATION 16


4-WAY SET ASSOCIATIVE MAPPING
▪ Illustration for N = 4:
▪ Number of sets in cache memory = 64.
▪ Memory blocks are mapped to a set using modulo-64
operation.
▪ Example: MM blocks MM blocks 0, 64, 128, etc. all mapped to
set 0, where they can occupy any of the 4 available positions.
▪ MM address is divided into 3 fields: TAG, SET and
WORD.
▪ The TAG field of the address must be associatively compared
to the TAG field of the 4 blocks of the selected set.
▪ This instead of requiring a single large associate memory, we
need a number of very small associative memories only one
of which is used at a time.

CACHE MEMORY ORGANIZATION 17


2. BLOCK IDENTIFICATION
▪ Caches include a TAG associated with each cache block.
▪ The TAG of every cache block where the block being request may be
present needs to be compared with the TAG field of the MM address.
▪ All the possible TAGs are compared in parallel, as speed is important.
▪ Mapping Algorithms
▪ Direct mapping requires a single comparison.
▪ Associative mapping requires a full associative search over all the TAGs
corresponding to all cache blocks.
▪ Set associative mapping requires a limited associated search over the
TAGs of only the selected set.
▪ Use of valid bit:
▪ There must be a way to know whether a cache block contains valid or
garbage information.
▪ The valid bit can be added to the TAG, which indicates whether the block
contain valid data.
▪ If the valid bit is not set, there is no need to match the corresponding TAG

CACHE MEMORY ORGANIZATION 18


3. BLOCK REPLACEMENT
▪ With fully associative or set associative mapping, there
can be several blocks to choose from for replacement
when a miss occurs.
▪ Two primary strategies are used:
▪ a) Random: the candidate block is selected randomly for
replacement. This simple strategy tends to spread allocation
uniformly.
▪ b) Least Recently Used (LRU): The block replaced is the one
that has not been used for the longest period of time.
▪ Make use of a corollary of temporal locality: if recently
used blocks are likely to be used again, then the least
candidate for replacement is the least recently used
block

CACHE MEMORY ORGANIZATION 19


WHICH BLOCK SHOULD BE REPLACED IN ON A CACHE MISS?
▪ To implement the LRU algorithm, the cache controller
must track the LRU block as the computation proceeds.
▪ Example: Consider a 4-way set associative cache.
▪ For tracking the LRU block within a set, we use a 2-bit counter
with every block.
▪ When hit occurs:
▪ Counter of the referenced block is reset to 0.
▪ Counters with values originally lower than the referenced one are
incremented by 1, and all others remain unchanged.
▪ When miss occurs:
▪ If the set is not full, the counter associated with the new block loaded is
set to 0, and all other counters are incremented by 1.
▪ If the set is full, the block with counter value 3 is removed, the new
block put in its place, and the counter is set to 0. the other 3 counters
are incremented by 1.

CACHE MEMORY ORGANIZATION 20


EXAMPLE
IF MISS OCCURS IF MISS OCCURS IF MISS OCCURS
AND THE SET IS AND THE SET IS AND THE SET IS
IF HIT OCCURS NOT FULL FULL FULL
B0 01 10 B0 10 11 B0 11 B4 11 00
B1 00 01 B1 00 01 B1 01 B1 01 10
B2 11 11 B2 01 10 B2 10 B2 10 11
B3 10 00 B6 00 B6 00 B6 00 01

CACHE MEMORY ORGANIZATION 21


EXAMPLE

CACHE MEMORY ORGANIZATION 22


TYPES OF CACHE MISSES?
1. Compulsory miss
▪ On the first access to a block, the block must be brought into the cache.
▪ Also known as cold start misses, or first reference misses.
▪ Can be reduced by increasing cache block size or prefetching cache
blocks
2. Capacity miss
▪ Blocks may be replaced from cache because the cache cannot hold all
the blocks needed by a program.
▪ Can be reduced by increasing the total cache size.
3. Conflict miss
▪ In case of direct mapping or N-way set associative mapping, several
blocks may be mapped to the same block or set in the cache.
▪ May result in block replacements and hence access misses, even though
all the cache blocks may not be occupied.
▪ Can be reduced by increasing the value of N (cache associativity)

CACHE MEMORY ORGANIZATION 23


4. WRITE STRATEGY
▪Statistical data suggests that read operations
(including instruction fetches) dominate
processor cache accesses.
▪ All instruction fetch operations are read
▪ Most instruction do not write to memory
▪Making the common case fast:
▪ Optimize cache accesses for reads.
▪ But Amadahl’s law reminds that for high
performance designs we cannot ignore the speed
of write operations.
IMPROVING CACHE PERHORMANCE
1. CACHE WRITE HIT POLICY
▪ The common case (read operation) is relatively easy to make
faster
▪ We can read many blocks at the same time while the TAG is compared
with the block addresses
▪ If the read is a HIT the data can be passed to the CPU; if it is miss ignore it.

▪ Problem with write operation:


▪ The CPU specifies the size of the write (between 1 and 8 bytes), and only
that portion of a block has to be changes.
▪ Implies a read-modify-write sequence of operations on the block
▪ Also, the process of modifying the block cannot begin until the TAG is checked to see
if it is a hit.
▪ Thus, cache write operations take more time than cache read operations.

IMPROVING CACHE PERHORMANCE


CACHE WRITE STRATEGIES
▪ Cache designs can be classified based on the write and memory
update strategies being used.
1. Write through / store through
2. Write back / copy back

CPU Cache Memory Main Memory

IMPROVING CACHE PERHORMANCE


1. WRITE THROUGH STRATEGIES
▪ Information is written to both the cache block and the main memory block.
▪ Features:
▪ Easy to implement.
▪ Read misses do not result in writes to the upper level (i.e. MM)
▪ The upper level (i.e. MM) has the most updated version of the data – important for I/O
operations and multiprocessor systems.
▪ A write buffer is often used to reduce CPU write stall time while data is written to main
memory

CPU Cache Memory Main Memory

Write Buffer

IMPROVING CACHE PERHORMANCE


1. WRITE THROUGH STRATEGIES
▪ Perfect Write Buffer
▪ All writes are handled by write buffer; no stalling for
write operations.
▪ For unified L1 cache.
▪ Stall Cycles/Memory Access = % Reads x (1-HL1).tMM
▪ Realistic Write Buffer
▪ A percentage of write stalls are not eliminated when
the write buffer is full.
▪ For unified L1 cache.
▪ Stall Cycles/Memory Access = (% Reads x (1-HL1) + % write
stalls not eliminated) x tMM

IMPROVING CACHE PERHORMANCE


2. WRITE BACK STRATEGY
▪ Information is written only to the cache block.
▪ A modified cache block is written to MM only when it is
replaced
▪ Features:
▪ Writes occur at the speed of cache memory.
▪ Multiple writes to cache block requires only one write to MM.
▪ Uses less memory bandwidth, make it attractive to multiprocessors.
▪ Write-back cache blocks can be clean or dirty.
▪ A status bit called dirty bit or modified bit is associated with each
cache block, which indicates whether the block was modified in the
cache (0: clean, 1: dirty)
▪ If the status is clean, the block is not written back to MM while
being replaced.

IMPROVING CACHE PERHORMANCE


2. WRITE BACK STRATEGY

CPU Cache Memory Main Memory

Several Single write


writes to a during
block replacement

IMPROVING CACHE PERHORMANCE


CACHE WRITE MISS POLICY
▪ Since information is usually not needed immediately on a
write miss, two options are possible on a cache write miss:
a) Write allocate:
▪ The missed block is loaded into cache on a write miss, followed by
write hit actions.
▪ Requires a cache block to be allocated for the block to be written
into. When a miss occurs, the missed block first read and then it is
brought into the cache and then write is performed.
▪ It can work equally with write through and write back. It mostly used
write back.
b) No-Write Allocate:
▪ The block is modified only in the MM, and loaded into cache.
▪ Cache block is not allocated for the block to be written into.

IMPROVING CACHE PERHORMANCE


CACHE WRITE MISS POLICY
▪Typical Usages:
a) Write-Back cache with write-allocate
▪In order to capture subsequent writes to the
block in cache
b) Write-through cache with no write-
allocate
▪Since subsequent writes still have to go to MM

IMPROVING CACHE PERHORMANCE


ESTIMATION OF MISS PENALTIES
▪Write-Through Cache
▪Write hit operation:
▪Without write buffer, miss penalty = tMM
▪With perfect write buffer, miss penalty = 0

▪Write-Back Cache
▪Write hit operation:
▪Miss penalty =0
IMPROVING CACHE PERHORMANCE
ESTIMATION OF MISS PENALTIES
▪Write-Back Cache (with Write Allocate)
▪ Write Hit Operation
▪ Miss penalty = 0
▪ Read or Write Miss Operation
▪ If the replacement block is clean, miss penalty = tMM
▪ No need to write the block back to MM.
▪ New block to be brought into MM (tMM).
▪ If the replace block is dirty miss penalty = 2 tMM
▪ Write the block to be replaced to MM (tMM)
▪ New block to be brought into MM (tMM)

IMPROVING CACHE PERHORMANCE


CHOICE OF BLOCK SIZE IN CACHE
▪Larger block sizes reduce compulsory
misses.
▪Larger block sizes also reduce the
number of blocks in cache, increasing
conflict misses.
▪Typical block size: 16 or 32 bytes

IMPROVING CACHE PERHORMANCE


INSTRUCTION-ONLY AND DATA-ONLY CACHES
▪ Caches are sometimes divided into instruction-only and
data-only caches.
▪ The CPU knows whether it is issuing an instruction address and
data address.
▪ There are 2 separate ports, thereby doubling the bandwidth
between the CPU and the cache
▪ Typically L1 caches are separated into L1 i-cache and L1 d-cache.

▪ Separate caches also offers the opportunity of optimizing


each cache separately
▪ Instruction and data reference patterns are different.
▪ Different capacities, block sizes and associativity (i.e. N)
IMPROVING CACHE PERHORMANCE
INTEL CORE-I7 CACHE HIERARCHY
▪ L1 i-cache and d-cache:
▪ 32 KB, 8-way set associative
▪ Access: 4 cycles

▪ L2 unified cache:
▪ 256 KB, 8-way set
associative
▪ Access: 11 cycles
▪ L3 unified cache:
▪ 8 MB, 16-way set
associative
▪ Access: 30 - 40 cycles

▪ Block size: 64 bytes for


For Intel Core-i7 Sandybridge:
all caches L1 and L2 – within core, L3 – within chip,
MM – outside chip
IMPROVING CACHE PERHORMANCE
IMPROVING CACHE PERFORMANCE

CACHE MEMORY ORGANIZATION 38


AVERAGE MEMORY ACCESS TIME(AMAT)
▪ We shall discuss various techniques using which the
performance of a cache memory can be improved.
▪ We consider the following expression for average
memory access time (AMAT)
▪ AMAT = hit time + miss rate x miss penalty
▪ When we talk about improving the performance of
cache memory systems, we can try to reduce one or
more of the three parameters: hit time, miss rate, miss
penalty.
▪ Hit time is the hit ratio X average access time of the
cache
IMPROVING CACHE PERHORMANCE
TERMINOLOGIES
▪ Hit Time: The hit time is how long it takes data to be
sent from the cache to the processor. This is usually
fast, on the order of 1-3 clock cycles.
▪ Hit time is the hit ratio X average access time of the cache
▪ A cache hit ratio is calculated by dividing the number of
cache hits by the total number of cache hits and misses
▪ Miss Rate: The miss rate is the percentage of misses.
▪ Miss Penalty: The miss penalty is the time to copy
data from main memory to the cache. This often
requires dozens of clock cycles (at least).

IMPROVING CACHE PERHORMANCE


BASIC CACHE OPTIMIZATION TECHNIQUES
▪ The optimization techniques are as follows:
▪ Reducing the miss rate: we can use larger block
sizes, larger cache size, and higher associativity
▪ Reducing the miss penalty: we can use multi-level
caches and giving priority to reads over writes
▪ Reducing the cache hit time: we can avoid the
address translation when indexing the cache. We
can do this by extracting the TAG and matching
can be done beforehand.
IMPROVING CACHE PERHORMANCE
A) USING LARGER BLOCK SIZE
▪ Increasing the block size helps
in reducing the miss rate.
▪ Larger blocks also reduce
compulsory misses.
▪ Since larger blocks can take better
advantage of spatial locality.
▪ Drawbacks:
▪ The miss penalty increases, as it is
required to transfer larger blocks.
▪ Since the number of blocks
decreases, the number of conflict
misses and even capacity misses
can increase.
▪ The overhead may outweigh the Hennessy & Patterson, “Computer
gain. Architecture: A Quantitative Approach” 4/e

IMPROVING CACHE PERHORMANCE


B) USING LARGER CACHE MEMORY
▪Increasing the size of the cache is a
straightforward way to reduce the capacity
misses.
▪Drawbacks:
▪ Increases the hit time since the number of TAGs to be
searched in parallel will be possibly large.
▪ Results in higher cost and power consumption.
▪Traditional popular for on-chip caches

IMPROVING CACHE PERHORMANCE


C) USING HIGHER ASSOCIATIVITY
▪ For N-way associative cache, the miss rate reduces as we
increase N.
▪ Reduces conflict misses, as there are more choices to place a block in
cache
▪ General rule of thumb:
▪ 8-way set associative cache is as effective as fully associative for practical
scenarios.
▪ A direct mapped cache of size N has about the same miss rate as a 2-way
set associative cache of size N/2.
▪ Drawbacks:
▪ Increases the hit time as we have to search a larger associative memory
▪ Increases power consumption due to higher complexity of associative
memory

IMPROVING CACHE PERHORMANCE


D) USING MULTILEVEL CACHES
▪ Here we try to reduce the miss penalty, and not the
miss rate.
▪ Performance gap between processors and memory
increases with time.
▪ Use faster cache to keep pace with the speed of the
processor.
▪ Make the cache larger to bridge the widening gap between
processor and MM.
▪ We can use both in a multi-level cache system:
▪ The L1 cache can be small enough to match the clock cycle
time of the fast processor.
▪ The L2 cache can be large enough to capture many accesses
that would go to MM, thereby reducing the miss penalty

IMPROVING CACHE PERHORMANCE


MULTILEVEL INCLUSION VS MULTILEVEL EXCLUSION
▪ Multilevel inclusion: it requires that L1 data are
always present in L2.
▪ Desirable because consistency between I/O and
caches can be determined just by checking the L2
cache.
▪ Multilevel exclusion: it requires that L1 data never
found in L2.
▪ Typically a cache miss in L1 result in a swap of blocks
between L1 and L2 rather than replacement of a L1
block with a L2 block.
▪ This policy prevents wasting space in the L2 cache.
▪ May make sense if the designer can only afford a L2
cache that is slightly bigger than the L1 cache
IMPROVING CACHE PERHORMANCE
THANK YOU

IMPROVING CACHE PERHORMANCE

You might also like