Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Memory Hierarchy
CS4342 Advanced Computer Architecture
Dilum Bandara
Dilum.Bandara@uom.lk
Slides adapted from “Computer Architecture, A Quantitative Approach” by
John L. Hennessy and David A. Patterson, 5th Edition, 2012, Morgan
Kaufmann Publishers
Processor-Memory Performance Gap
2
Gap grew 50%
per year
Why Memory Hierarchy?
 Applications want unlimited amounts of memory
with low latency
 Fast memory is more expensive per bit
 Solution
 Organize memory system into a hierarchy
 Entire addressable memory space available in largest,
slowest memory
 Incrementally smaller & faster memories
 Temporal & spatial locality ensures that nearly all
references can be found in smaller memories
 Gives illusion of a large, fast memory being presented
to processor 3
Memory Hierarchy
4
Why Hierarchical Design?
 Becomes more crucial with multi-core
processors
 Aggregate peak bandwidth grows with no of cores
 Intel Core i7 can generate 2 references per core per clock
 4 cores and 3.2 GHz clock
 25.6 billion 64-bit data references/second +
12.8 billion 128-bit instruction references
= 409.6 GB/s
 DRAM bandwidth is only 6% of this (25 GB/s)
 Requires
 Multi-port, pipelined caches
 2 levels of cache per core
 Shared third-level cache on chip 5
Core i7 Die & Major Components
6
Source: Intel
Performance vs. Power
 High-end microprocessors have >10 MB on-chip
cache
 Consumes large amount of chip area & power budget
 Leakage current – when not operating
 Active current – when operating
 Major limiting factor for processors used in
mobile devices
7
Definitions – Blocks
 Multiple blocks are moved between levels in the
hierarchy
 Spatial locality  efficiency
 Blocks are tagged with memory address
 Tags are searched parallel
8
Source: http://archive.arstechnica.com/paedia/c/caching/m-caching-5.html
Definitions – Associativity
 Defines where blocks can be placed in a cache
9
Pentium 4 vs. Opteron Memory
Hierarchy
10
CPU Pentium 4 (3.2
GHz)
Opteron (2.8 GHz)
Instruction
Cache
Trace Cache 8K
micro-ops)
2-way associative,
64 KB, 64B block
Data Cache 8-way associative,
16 KB, 64B block,
inclusive in L2
2-way associative,
64 KB, 64B block,
exclusive to L2
L2 Cache 8-way associative, 2
MB, 128B block
16-way associative,
1 MB, 64B block
Prefetch 8 streams to L2 1 stream to L2
Memory 200 MHz x 64 bits 200 MHz x 128 bits
Definitions – Updating Cache
 Write-through
 Update cache block & all other levels below
 Use write buffers to speed up
 Write-back
 Update cache block
 Update lower level when replacing cached block
 Use write buffers to speed up
11
Definitions – Replacing Cached Blocks
 Cache replacement policies
 Random
 Least Recently Used (LRU)
 Need to track last access time
 Least Frequently Used (LFU)
 Need to track no of accesses
 First In First Out (FIFO)
12
Definitions – Cache Misses
 When required items is not found in cache
 Miss rate – fraction of cache accesses that result
in a failure
 Types of misses
 Compulsory – 1st access to a block
 Capacity – limited cache capacity force blocked to be
removed from a cache & later retrieved
 Conflict – if placement strategy is not fully associative
 Average memory access time
= Hit time + Miss rate x Miss penalty 13
Definitions – Cache Misses (Cont.)
 Memory stall cycles
= Instruction count x Fraction of memory access per
instructions x Miss rate x Miss Penalty
 Fraction of memory access per instructions
= Instruction memory access per instruction + Data memory
access per instruction
 Example
 50% instructions are load & store. Miss rate is 2% &
penalty is 25 clock cycles. Suppose CPI is 1. How fast
can this be if all instructions are cache hits?
IC x (1 + 0.75) x CC = 1.75
IC x 1 x CC 14
Cache Performance Metrics
 Hit time
 Miss rate
 Miss penalty
 Cache bandwidth
 Power consumption
15
6 Basic Cache Optimization Techniques
1. Larger block sizes
 Reduce compulsory misses
 Increase capacity & conflict misses
 Increase miss penalty
 Choosing a correct block size is challenging
2. Larger total cache capacity to reduce miss rate
 Reduce misses
 Increase hit time
 Increase power consumption & cost
3. Higher no of cache levels
 Reduce overall memory access time 16
6 Basic Cache Optimization Techniques
(Cont.)
4. Higher associativity
 Reduce conflict misses
 Increase hit time
 Increase power consumption
5. Giving priority to read misses over writes
 Allow reads to check write buffer
 Reduce miss penalty
6. Avoiding address translation in cache indexing
 Virtual to physical address mapping
 Reduce hit time
17
10 Advanced Cache Optimization
Techniques
 5 categories
1. Reducing hit time
2. Increasing cache bandwidth
3. Reducing miss penalty
4. Reducing miss rate
5. Reducing miss penalty or miss rate via parallelism
18
Advanced Optimizations 1
 Small & simple 1st level caches
 Recently size of L1 cache increased either slightly or
not at all
 Critical timing path in a cache hit
 addressing tag memory, then
 comparing tags, then
 selecting correct set
 Direct-mapped caches can overlap tag compare &
transmission of data
 Improve hit time
 Lower associativity reduces power because fewer
cache lines are accessed
19
L1 Size & Associativity – Access Time
20
L1 Size & Associativity – Energy
21
Advanced Optimizations 2
 Way Prediction
 Given access to the current block, predict which block
to access next
 Improve hit time
 Mis-prediction increase hit time
 Prediction accuracy
 > 90% for 2-way
 > 80% for 4-way
 Instruction cache has better accuracy than Data cache
 First used on MIPS R10000 in mid-90s
 Used on ARM Cortex-A8
22
Advanced Optimizations 3
 Pipeline cache access
 Enable L1 cache access to be multiple cycles
 Examples
 Pentium – 1 cycle
 Pentium Pro to Pentium III – 2 cycles
 Pentium 4 to Core i7 – 4 cycles
 Improve bandwidth
 Makes it easier to increase associativity
 Increase hit time
 Increases branch mis-prediction penalty
23
Advanced Optimizations 4
 Nonblocking Caches
 Allow hits before previous
misses complete
 “Hit under miss”
 “Hit under multiple miss”
 L2 must support this
 In general, processors can
hide L1 miss penalty but
not L2 miss penalty
 Increase bandwidth
24
Advanced Optimizations 5
 Multibanked Caches
 Organize cache as independent banks to support
simultaneous access
 Examples
 ARM Cortex-A8 supports 1-4 banks for L2
 Intel i7 supports 4 banks for L1 & 8 banks for L2
 Interleave banks according to block address
 Increase bandwidth
25
Advanced Optimizations 6
 Critical Word First, Early Restart
 Critical word first
 Request missed word from memory first
 Send it to processor as soon as it arrives
 Early restart
 Request words in normal order
 Send missed work to processor as soon as it arrives
 Reduce miss penalty
 Effectiveness depends on block size & likelihood of
another access to portion of the block that has not yet
been fetched
26
Advanced Optimizations 7 - 10
 Merging Write Buffer
 Reduce miss penalty
 Compiler Optimizations
 Examples
 Loop Interchange – Swap nested loops to access memory in
sequential order
 Instead of accessing entire rows or columns, subdivide matrices into
blocks
 Reduce miss rate
 Hardware Prefetching
 Fetch 2 blocks on miss
 Reduce miss penalty or miss rate
 Compiler Prefetching
 Reduce miss penalty or miss rate 27
Summary of Techniques
28
Memory Technologies
 Performance metrics
 Latency is concern of cache
 Bandwidth is concern of multiprocessors & I/O
 Access time
 Time between read request & when desired word arrives
 Cycle time
 Minimum time between unrelated requests to memory
 DRAM used for main memory
 SRAM used for cache
29
Memory Technology (Cont.)
 Amdahl
 Memory capacity should grow linearly with processor
speed
 Unfortunately, memory capacity & speed hasn’t kept
pace with processors
 Some optimizations
 Multiple accesses to same row
 Synchronous DRAM (SDRAM)
 Added clock to DRAM interface
 Burst mode with critical word first
 Wider interfaces
 Double data rate (DDR)
 Multiple banks on each DRAM device 30
DRAM Optimizations
31
MB/sec = Clock rate x 2 x 8 bytes
DRAM Power Consumption
 Reducing power in DRAMs
 Lower voltage
 Low power mode (ignores clock, continues to refresh)
32
Flash Memory
 Type of EEPROM
 Must be erased (in blocks) before being
overwritten
 Non volatile
 Limited no of write cycles
 Cheaper than DRAM, more expensive than disk
 Slower than SRAM, faster than disk
33
Modern Memory Hierarchy
34
Source: http://blog.teachbook.com.au/index.php/2012/02/memory-hierarchy/
Intel Optane Non-volatile Memory
35
Source: www.forbes.com/sites/tomcoughlin/2018/06/11/intel-optane-finally-on-dimms/#5792e114190b
Intel Optane (Cont.)
36
Source: www.anandtech.com/show/9541/intel-announces-optane-storage-brand-for-3d-xpoint-products
Virtual Memory
 Each process has its own address space
 Protection via virtual memory
 Keeps processes in their own memory space
 Role of architecture
 Provide user mode & supervisor mode
 Protect certain aspects of CPU state
 Provide mechanisms for switching between user
mode & supervisor mode
 Provide mechanisms to limit memory accesses
 Provide TLB to translate addresses
37
Paging Hardware With TLB
 Parallel search on TLB
 Address translation (p, d)
 If p is in associative register,
get frame # out
 Otherwise get frame # from
page table in memory
Summary
 Caching techniques are continuing to evolve
 Combination of techniques are combined
 Cache sizes are unlikely to increase significantly
 Better performance when programs are
optimized based on cache architecture
39

More Related Content

CPU Memory Hierarchy and Caching Techniques

  • 1. Memory Hierarchy CS4342 Advanced Computer Architecture Dilum Bandara Dilum.Bandara@uom.lk Slides adapted from “Computer Architecture, A Quantitative Approach” by John L. Hennessy and David A. Patterson, 5th Edition, 2012, Morgan Kaufmann Publishers
  • 3. Why Memory Hierarchy?  Applications want unlimited amounts of memory with low latency  Fast memory is more expensive per bit  Solution  Organize memory system into a hierarchy  Entire addressable memory space available in largest, slowest memory  Incrementally smaller & faster memories  Temporal & spatial locality ensures that nearly all references can be found in smaller memories  Gives illusion of a large, fast memory being presented to processor 3
  • 5. Why Hierarchical Design?  Becomes more crucial with multi-core processors  Aggregate peak bandwidth grows with no of cores  Intel Core i7 can generate 2 references per core per clock  4 cores and 3.2 GHz clock  25.6 billion 64-bit data references/second + 12.8 billion 128-bit instruction references = 409.6 GB/s  DRAM bandwidth is only 6% of this (25 GB/s)  Requires  Multi-port, pipelined caches  2 levels of cache per core  Shared third-level cache on chip 5
  • 6. Core i7 Die & Major Components 6 Source: Intel
  • 7. Performance vs. Power  High-end microprocessors have >10 MB on-chip cache  Consumes large amount of chip area & power budget  Leakage current – when not operating  Active current – when operating  Major limiting factor for processors used in mobile devices 7
  • 8. Definitions – Blocks  Multiple blocks are moved between levels in the hierarchy  Spatial locality  efficiency  Blocks are tagged with memory address  Tags are searched parallel 8 Source: http://archive.arstechnica.com/paedia/c/caching/m-caching-5.html
  • 9. Definitions – Associativity  Defines where blocks can be placed in a cache 9
  • 10. Pentium 4 vs. Opteron Memory Hierarchy 10 CPU Pentium 4 (3.2 GHz) Opteron (2.8 GHz) Instruction Cache Trace Cache 8K micro-ops) 2-way associative, 64 KB, 64B block Data Cache 8-way associative, 16 KB, 64B block, inclusive in L2 2-way associative, 64 KB, 64B block, exclusive to L2 L2 Cache 8-way associative, 2 MB, 128B block 16-way associative, 1 MB, 64B block Prefetch 8 streams to L2 1 stream to L2 Memory 200 MHz x 64 bits 200 MHz x 128 bits
  • 11. Definitions – Updating Cache  Write-through  Update cache block & all other levels below  Use write buffers to speed up  Write-back  Update cache block  Update lower level when replacing cached block  Use write buffers to speed up 11
  • 12. Definitions – Replacing Cached Blocks  Cache replacement policies  Random  Least Recently Used (LRU)  Need to track last access time  Least Frequently Used (LFU)  Need to track no of accesses  First In First Out (FIFO) 12
  • 13. Definitions – Cache Misses  When required items is not found in cache  Miss rate – fraction of cache accesses that result in a failure  Types of misses  Compulsory – 1st access to a block  Capacity – limited cache capacity force blocked to be removed from a cache & later retrieved  Conflict – if placement strategy is not fully associative  Average memory access time = Hit time + Miss rate x Miss penalty 13
  • 14. Definitions – Cache Misses (Cont.)  Memory stall cycles = Instruction count x Fraction of memory access per instructions x Miss rate x Miss Penalty  Fraction of memory access per instructions = Instruction memory access per instruction + Data memory access per instruction  Example  50% instructions are load & store. Miss rate is 2% & penalty is 25 clock cycles. Suppose CPI is 1. How fast can this be if all instructions are cache hits? IC x (1 + 0.75) x CC = 1.75 IC x 1 x CC 14
  • 15. Cache Performance Metrics  Hit time  Miss rate  Miss penalty  Cache bandwidth  Power consumption 15
  • 16. 6 Basic Cache Optimization Techniques 1. Larger block sizes  Reduce compulsory misses  Increase capacity & conflict misses  Increase miss penalty  Choosing a correct block size is challenging 2. Larger total cache capacity to reduce miss rate  Reduce misses  Increase hit time  Increase power consumption & cost 3. Higher no of cache levels  Reduce overall memory access time 16
  • 17. 6 Basic Cache Optimization Techniques (Cont.) 4. Higher associativity  Reduce conflict misses  Increase hit time  Increase power consumption 5. Giving priority to read misses over writes  Allow reads to check write buffer  Reduce miss penalty 6. Avoiding address translation in cache indexing  Virtual to physical address mapping  Reduce hit time 17
  • 18. 10 Advanced Cache Optimization Techniques  5 categories 1. Reducing hit time 2. Increasing cache bandwidth 3. Reducing miss penalty 4. Reducing miss rate 5. Reducing miss penalty or miss rate via parallelism 18
  • 19. Advanced Optimizations 1  Small & simple 1st level caches  Recently size of L1 cache increased either slightly or not at all  Critical timing path in a cache hit  addressing tag memory, then  comparing tags, then  selecting correct set  Direct-mapped caches can overlap tag compare & transmission of data  Improve hit time  Lower associativity reduces power because fewer cache lines are accessed 19
  • 20. L1 Size & Associativity – Access Time 20
  • 21. L1 Size & Associativity – Energy 21
  • 22. Advanced Optimizations 2  Way Prediction  Given access to the current block, predict which block to access next  Improve hit time  Mis-prediction increase hit time  Prediction accuracy  > 90% for 2-way  > 80% for 4-way  Instruction cache has better accuracy than Data cache  First used on MIPS R10000 in mid-90s  Used on ARM Cortex-A8 22
  • 23. Advanced Optimizations 3  Pipeline cache access  Enable L1 cache access to be multiple cycles  Examples  Pentium – 1 cycle  Pentium Pro to Pentium III – 2 cycles  Pentium 4 to Core i7 – 4 cycles  Improve bandwidth  Makes it easier to increase associativity  Increase hit time  Increases branch mis-prediction penalty 23
  • 24. Advanced Optimizations 4  Nonblocking Caches  Allow hits before previous misses complete  “Hit under miss”  “Hit under multiple miss”  L2 must support this  In general, processors can hide L1 miss penalty but not L2 miss penalty  Increase bandwidth 24
  • 25. Advanced Optimizations 5  Multibanked Caches  Organize cache as independent banks to support simultaneous access  Examples  ARM Cortex-A8 supports 1-4 banks for L2  Intel i7 supports 4 banks for L1 & 8 banks for L2  Interleave banks according to block address  Increase bandwidth 25
  • 26. Advanced Optimizations 6  Critical Word First, Early Restart  Critical word first  Request missed word from memory first  Send it to processor as soon as it arrives  Early restart  Request words in normal order  Send missed work to processor as soon as it arrives  Reduce miss penalty  Effectiveness depends on block size & likelihood of another access to portion of the block that has not yet been fetched 26
  • 27. Advanced Optimizations 7 - 10  Merging Write Buffer  Reduce miss penalty  Compiler Optimizations  Examples  Loop Interchange – Swap nested loops to access memory in sequential order  Instead of accessing entire rows or columns, subdivide matrices into blocks  Reduce miss rate  Hardware Prefetching  Fetch 2 blocks on miss  Reduce miss penalty or miss rate  Compiler Prefetching  Reduce miss penalty or miss rate 27
  • 29. Memory Technologies  Performance metrics  Latency is concern of cache  Bandwidth is concern of multiprocessors & I/O  Access time  Time between read request & when desired word arrives  Cycle time  Minimum time between unrelated requests to memory  DRAM used for main memory  SRAM used for cache 29
  • 30. Memory Technology (Cont.)  Amdahl  Memory capacity should grow linearly with processor speed  Unfortunately, memory capacity & speed hasn’t kept pace with processors  Some optimizations  Multiple accesses to same row  Synchronous DRAM (SDRAM)  Added clock to DRAM interface  Burst mode with critical word first  Wider interfaces  Double data rate (DDR)  Multiple banks on each DRAM device 30
  • 31. DRAM Optimizations 31 MB/sec = Clock rate x 2 x 8 bytes
  • 32. DRAM Power Consumption  Reducing power in DRAMs  Lower voltage  Low power mode (ignores clock, continues to refresh) 32
  • 33. Flash Memory  Type of EEPROM  Must be erased (in blocks) before being overwritten  Non volatile  Limited no of write cycles  Cheaper than DRAM, more expensive than disk  Slower than SRAM, faster than disk 33
  • 34. Modern Memory Hierarchy 34 Source: http://blog.teachbook.com.au/index.php/2012/02/memory-hierarchy/
  • 35. Intel Optane Non-volatile Memory 35 Source: www.forbes.com/sites/tomcoughlin/2018/06/11/intel-optane-finally-on-dimms/#5792e114190b
  • 36. Intel Optane (Cont.) 36 Source: www.anandtech.com/show/9541/intel-announces-optane-storage-brand-for-3d-xpoint-products
  • 37. Virtual Memory  Each process has its own address space  Protection via virtual memory  Keeps processes in their own memory space  Role of architecture  Provide user mode & supervisor mode  Protect certain aspects of CPU state  Provide mechanisms for switching between user mode & supervisor mode  Provide mechanisms to limit memory accesses  Provide TLB to translate addresses 37
  • 38. Paging Hardware With TLB  Parallel search on TLB  Address translation (p, d)  If p is in associative register, get frame # out  Otherwise get frame # from page table in memory
  • 39. Summary  Caching techniques are continuing to evolve  Combination of techniques are combined  Cache sizes are unlikely to increase significantly  Better performance when programs are optimized based on cache architecture 39