CPU Memory Hierarchy and Caching Techniques

Memory Hierarchy
CS4342 Advanced Computer Architecture
Dilum Bandara
Dilum.Bandara@uom.lk
Slides adapted from “Computer Architecture, A Quantitative Approach” by
John L. Hennessy and David A. Patterson, 5th Edition, 2012, Morgan
Kaufmann Publishers

Processor-Memory Performance Gap
2
Gap grew 50%
per year

Why Memory Hierarchy?
 Applications want unlimited amounts of memory
with low latency
 Fast memory is more expensive per bit
 Solution
 Organize memory system into a hierarchy
 Entire addressable memory space available in largest,
slowest memory
 Incrementally smaller & faster memories
 Temporal & spatial locality ensures that nearly all
references can be found in smaller memories
 Gives illusion of a large, fast memory being presented
to processor 3

Why Hierarchical Design?
 Becomes more crucial with multi-core
processors
 Aggregate peak bandwidth grows with no of cores
 Intel Core i7 can generate 2 references per core per clock
 4 cores and 3.2 GHz clock
 25.6 billion 64-bit data references/second +
12.8 billion 128-bit instruction references
= 409.6 GB/s
 DRAM bandwidth is only 6% of this (25 GB/s)
 Requires
 Multi-port, pipelined caches
 2 levels of cache per core
 Shared third-level cache on chip 5

Core i7 Die & Major Components
6
Source: Intel

Performance vs. Power
 High-end microprocessors have >10 MB on-chip
cache
 Consumes large amount of chip area & power budget
 Leakage current – when not operating
 Active current – when operating
 Major limiting factor for processors used in
mobile devices
7

Definitions – Blocks
 Multiple blocks are moved between levels in the
hierarchy
 Spatial locality  efficiency
 Blocks are tagged with memory address
 Tags are searched parallel
8
Source: http://archive.arstechnica.com/paedia/c/caching/m-caching-5.html

Definitions – Associativity
 Defines where blocks can be placed in a cache
9

Pentium 4 vs. Opteron Memory
Hierarchy
10
CPU Pentium 4 (3.2
GHz)
Opteron (2.8 GHz)
Instruction
Cache
Trace Cache 8K
micro-ops)
2-way associative,
64 KB, 64B block
Data Cache 8-way associative,
16 KB, 64B block,
inclusive in L2
2-way associative,
64 KB, 64B block,
exclusive to L2
L2 Cache 8-way associative, 2
MB, 128B block
16-way associative,
1 MB, 64B block
Prefetch 8 streams to L2 1 stream to L2
Memory 200 MHz x 64 bits 200 MHz x 128 bits

Definitions – Updating Cache
 Write-through
 Update cache block & all other levels below
 Use write buffers to speed up
 Write-back
 Update cache block
 Update lower level when replacing cached block
 Use write buffers to speed up
11

Definitions – Replacing Cached Blocks
 Cache replacement policies
 Random
 Least Recently Used (LRU)
 Need to track last access time
 Least Frequently Used (LFU)
 Need to track no of accesses
 First In First Out (FIFO)
12

Definitions – Cache Misses
 When required items is not found in cache
 Miss rate – fraction of cache accesses that result
in a failure
 Types of misses
 Compulsory – 1st access to a block
 Capacity – limited cache capacity force blocked to be
removed from a cache & later retrieved
 Conflict – if placement strategy is not fully associative
 Average memory access time
= Hit time + Miss rate x Miss penalty 13

Definitions – Cache Misses (Cont.)
 Memory stall cycles
= Instruction count x Fraction of memory access per
instructions x Miss rate x Miss Penalty
 Fraction of memory access per instructions
= Instruction memory access per instruction + Data memory
access per instruction
 Example
 50% instructions are load & store. Miss rate is 2% &
penalty is 25 clock cycles. Suppose CPI is 1. How fast
can this be if all instructions are cache hits?
IC x (1 + 0.75) x CC = 1.75
IC x 1 x CC 14

Cache Performance Metrics
 Hit time
 Miss rate
 Miss penalty
 Cache bandwidth
 Power consumption
15

6 Basic Cache Optimization Techniques
1. Larger block sizes
 Reduce compulsory misses
 Increase capacity & conflict misses
 Increase miss penalty
 Choosing a correct block size is challenging
2. Larger total cache capacity to reduce miss rate
 Reduce misses
 Increase hit time
 Increase power consumption & cost
3. Higher no of cache levels
 Reduce overall memory access time 16

6 Basic Cache Optimization Techniques
(Cont.)
4. Higher associativity
 Reduce conflict misses
 Increase power consumption
5. Giving priority to read misses over writes
 Allow reads to check write buffer
 Reduce miss penalty
6. Avoiding address translation in cache indexing
 Virtual to physical address mapping
 Reduce hit time
17

10 Advanced Cache Optimization
Techniques
 5 categories
1. Reducing hit time
2. Increasing cache bandwidth
3. Reducing miss penalty
4. Reducing miss rate
5. Reducing miss penalty or miss rate via parallelism
18

Advanced Optimizations 1
 Small & simple 1st level caches
 Recently size of L1 cache increased either slightly or
not at all
 Critical timing path in a cache hit
 addressing tag memory, then
 comparing tags, then
 selecting correct set
 Direct-mapped caches can overlap tag compare &
transmission of data
 Improve hit time
 Lower associativity reduces power because fewer
cache lines are accessed
19

L1 Size & Associativity – Access Time
20

L1 Size & Associativity – Energy
21

 Way Prediction
 Given access to the current block, predict which block
to access next
 Improve hit time
 Mis-prediction increase hit time
 Prediction accuracy
 > 90% for 2-way
 > 80% for 4-way
 Instruction cache has better accuracy than Data cache
 First used on MIPS R10000 in mid-90s
 Used on ARM Cortex-A8
22

 Pipeline cache access
 Enable L1 cache access to be multiple cycles
 Examples
 Pentium – 1 cycle
 Pentium Pro to Pentium III – 2 cycles
 Pentium 4 to Core i7 – 4 cycles
 Improve bandwidth
 Makes it easier to increase associativity
 Increases branch mis-prediction penalty
23

 Nonblocking Caches
 Allow hits before previous
misses complete
 “Hit under miss”
 “Hit under multiple miss”
 L2 must support this
 In general, processors can
hide L1 miss penalty but
not L2 miss penalty
 Increase bandwidth
24

 Multibanked Caches
 Organize cache as independent banks to support
simultaneous access
 Examples
 ARM Cortex-A8 supports 1-4 banks for L2
 Intel i7 supports 4 banks for L1 & 8 banks for L2
 Interleave banks according to block address
 Increase bandwidth
25

 Critical Word First, Early Restart
 Critical word first
 Request missed word from memory first
 Send it to processor as soon as it arrives
 Early restart
 Request words in normal order
 Send missed work to processor as soon as it arrives
 Effectiveness depends on block size & likelihood of
another access to portion of the block that has not yet
been fetched
26

Advanced Optimizations 7 - 10
 Merging Write Buffer
 Compiler Optimizations
 Examples
 Loop Interchange – Swap nested loops to access memory in
sequential order
 Instead of accessing entire rows or columns, subdivide matrices into
blocks
 Reduce miss rate
 Hardware Prefetching
 Fetch 2 blocks on miss
 Reduce miss penalty or miss rate
 Compiler Prefetching
 Reduce miss penalty or miss rate 27

Memory Technologies
 Performance metrics
 Latency is concern of cache
 Bandwidth is concern of multiprocessors & I/O
 Access time
 Time between read request & when desired word arrives
 Cycle time
 Minimum time between unrelated requests to memory
 DRAM used for main memory
 SRAM used for cache
29

Memory Technology (Cont.)
 Amdahl
 Memory capacity should grow linearly with processor
speed
 Unfortunately, memory capacity & speed hasn’t kept
pace with processors
 Some optimizations
 Multiple accesses to same row
 Synchronous DRAM (SDRAM)
 Added clock to DRAM interface
 Burst mode with critical word first
 Wider interfaces
 Double data rate (DDR)
 Multiple banks on each DRAM device 30

DRAM Optimizations
31
MB/sec = Clock rate x 2 x 8 bytes

DRAM Power Consumption
 Reducing power in DRAMs
 Lower voltage
 Low power mode (ignores clock, continues to refresh)
32

Flash Memory
 Type of EEPROM
 Must be erased (in blocks) before being
overwritten
 Non volatile
 Limited no of write cycles
 Cheaper than DRAM, more expensive than disk
 Slower than SRAM, faster than disk
33

Modern Memory Hierarchy
34
Source: http://blog.teachbook.com.au/index.php/2012/02/memory-hierarchy/

Intel Optane Non-volatile Memory
35
Source: www.forbes.com/sites/tomcoughlin/2018/06/11/intel-optane-finally-on-dimms/#5792e114190b

Intel Optane (Cont.)
36
Source: www.anandtech.com/show/9541/intel-announces-optane-storage-brand-for-3d-xpoint-products

Virtual Memory
 Each process has its own address space
 Protection via virtual memory
 Keeps processes in their own memory space
 Role of architecture
 Provide user mode & supervisor mode
 Protect certain aspects of CPU state
 Provide mechanisms for switching between user
mode & supervisor mode
 Provide mechanisms to limit memory accesses
 Provide TLB to translate addresses
37

Paging Hardware With TLB
 Parallel search on TLB
 Address translation (p, d)
 If p is in associative register,
get frame # out
 Otherwise get frame # from
page table in memory

Summary
 Caching techniques are continuing to evolve
 Combination of techniques are combined
 Cache sizes are unlikely to increase significantly
 Better performance when programs are
optimized based on cache architecture
39

CPU Memory Hierarchy and Caching Techniques

Related slideshows

More Related Content

CPU Memory Hierarchy and Caching Techniques