Lec3 - Cache and Memory System
Lec3 - Cache and Memory System
Lec3 - Cache and Memory System
Sang-Woo Jun
Spring, 2019
Watch Video!
CPU Caches and Why You Care – Scott Meyers
o His books are great!
Some History
80386 (1985) :
Last Intel desktop CPU with no on-chip cache
(Optional on-board cache chip though!)
Source: Extreme tech, “How L1 and L2 CPU Caches Work, and Why They’re an Essential Part of Modern Chips,” 2018
Memory System Architecture
Package Package
L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$
L2 $ L2 $ L2 $ L2 $
L3 $ L3 $
QPI / UPI
DRAM DRAM
Memory System Bandwidth Snapshot
Cache Bandwidth Estimate
64 Bytes/Cycle ~= 200 GB/s/Core
Core Core
DDR4 2666 MHz
128 GB/s
QPI / UPI
Memory/PCIe controller used to be on a separate “North bridge” chip, now integrated on-die
All sorts of things are now on-die! Even network controllers! (Specialization!)
Cache Architecture Details
Numbers from modern Xeon processors (Broadwell – Kaby lake)
Cache Level Size Latency (Cycles) Core Core
L1 64 KiB <5 L1 I$ L1 D$ L1 I$ L1 D$
L2 256 KiB < 20
L2 $ L2 $
L3 ~ 2 MiB per core < 50
* DRAM subsystems are complicated entities themselves, and latency/bandwidth of the same module varies by situation…
Cache Lines Recap
Caches are managed in Cache Line granularity
o Typically 64 Bytes for modern CPUs
o 64 Bytes == 16 4-byte integers
Reading/Writing happens in cache line granularity
o Read one byte not in cache -> Read all 64 bytes from memory
o Write one byte -> Eventually write all 64 bytes to memory
o Inefficient cache access patterns really hurt performance!
Cache Line Effects Example
Multiplying two 2048 x 2048 matrices
o 16 MiB, doesn’t fit in any cache
Machine: Intel i5-7400 @ 3.00GHz
Time to transpose B is also counted
A B A BT
…
× VS ×
…
…
63.19 seconds 10.39 seconds
(6x performance!)
Cache Prefetching
CPU speculatively prefetches cache lines
o While CPU is working on the loaded 64 bytes, 64 more bytes are being loaded
Hardware prefetcher is usually not very complex/smart
o Sequential prefetching (N lines forward or backwards)
o Strided prefetching
Programmer-provided prefetch hints
o __builtin_prefetch(address, r/w, temporal_locality); for GCC
Cache Coherence Recap
We won’t go into architectural details
Simply put:
o When a core writes a cache line
o All other instances of that cache line needs to be invalidated
Emphasis on cache line
Issue #1:
Capacity Considerations: Matrix Multiply
Performance is best when working set fits into cache
o But as shown, even 2048 x 2048 doesn’t fit in cache
o -> 2048 * 2048 * 2048 elements read from memory for matrix B
Solution: Divide and conquer! – Blocked matrix multiply
o For block size 32 × 32 -> 2048 * 2048 * (2048/32) reads
A BT C
A1 B1 … C1
B2
× B3
=