Lec3 - Cache and Memory System

CS 295: Modern Systems
Cache And Memory System
Sang-Woo Jun
Spring, 2019
Watch Video!
CPU Caches and Why You Care – Scott Meyers
o His books are great!
Some History
 80386 (1985) :
Last Intel desktop CPU with no on-chip cache
(Optional on-board cache chip though!)
 80486 (1989) : 4 KB on-chip cache
 Coffee Lake (2017) :

64 KiB L1 Per core
256 KiB L2 Per core
Up to 2 MiB L3 Per core (Shared)
What is the Y-axis? Most likely normalized latency reciprocal
Source: Extreme tech, “How L1 and L2 CPU Caches Work, and Why They’re an Essential Part of Modern Chips,” 2018
Memory System Architecture
Package Package
Core Core Core Core
L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$
L2 $ L2 $ L2 $ L2 $
L3 $ L3 $
QPI / UPI
DRAM DRAM
Memory System Bandwidth Snapshot
Cache Bandwidth Estimate
64 Bytes/Cycle ~= 200 GB/s/Core
Core Core
DDR4 2666 MHz
128 GB/s
QPI / UPI
DRAM Ultra Path Interconnect DRAM

Unidirectional
20.8 GB/s
Memory/PCIe controller used to be on a separate “North bridge” chip, now integrated on-die
All sorts of things are now on-die! Even network controllers! (Specialization!)
Cache Architecture Details
Numbers from modern Xeon processors (Broadwell – Kaby lake)
Cache Level Size Latency (Cycles) Core Core
L1 64 KiB <5 L1 I$ L1 D$ L1 I$ L1 D$
L2 256 KiB < 20
L2 $ L2 $
L3 ~ 2 MiB per core < 50
DRAM 100s of GB > 100* L3 $
L1 cache is typically divided into separate instruction and data cache

Each with 32 KiB
L1 cache accesses can be hidden in the pipeline

o All others take a performance hit
* DRAM subsystems are complicated entities themselves, and latency/bandwidth of the same module varies by situation…
Cache Lines Recap
Caches are managed in Cache Line granularity
o Typically 64 Bytes for modern CPUs
o 64 Bytes == 16 4-byte integers
Reading/Writing happens in cache line granularity
o Read one byte not in cache -> Read all 64 bytes from memory
o Write one byte -> Eventually write all 64 bytes to memory
o Inefficient cache access patterns really hurt performance!
Cache Line Effects Example
Multiplying two 2048 x 2048 matrices
o 16 MiB, doesn’t fit in any cache
Machine: Intel i5-7400 @ 3.00GHz
Time to transpose B is also counted
A B A BT
…
× VS ×
…
…
63.19 seconds 10.39 seconds
(6x performance!)
Cache Prefetching
CPU speculatively prefetches cache lines
o While CPU is working on the loaded 64 bytes, 64 more bytes are being loaded
Hardware prefetcher is usually not very complex/smart
o Sequential prefetching (N lines forward or backwards)
o Strided prefetching
Programmer-provided prefetch hints
o __builtin_prefetch(address, r/w, temporal_locality); for GCC
Cache Coherence Recap
We won’t go into architectural details
Simply put:
o When a core writes a cache line
o All other instances of that cache line needs to be invalidated
Emphasis on cache line
Issue #1:
Capacity Considerations: Matrix Multiply
Performance is best when working set fits into cache
o But as shown, even 2048 x 2048 doesn’t fit in cache
o -> 2048 * 2048 * 2048 elements read from memory for matrix B
Solution: Divide and conquer! – Blocked matrix multiply
o For block size 32 × 32 -> 2048 * 2048 * (2048/32) reads
A BT C
A1 B1 … C1
B2
× B3
=
C1 sub-matrix = A1×B1 + A1×B2 + A1×B3 …

Blocked Matrix Multiply Evaluations
Benchmark Elapsed (s) Normalized
Performance
Naïve 63.19 1
Transposed 10.39 6.08
Blocked Transposed 7.35 8.60
Bottlenecked by computations
AVX Transposed 2.20 28.72
Bottlenecked by memory
Blocked AVX Transposed 1.50 42.13
Bottlenecked by computation
8 Thread AVX Transposed 1.09 57.97
Bottlenecked by memory (Not scaling!)
AVX Transposed reading from DRAM at 14.55 GB/s
o 20483 * 4 (Bytes) / 2.20 (s) = 14.55 GB/s
o 1x DDR4 2400 MHz on machine -> 18.75 GB/s peak
o Pretty close! Considering DRAM also used for other things (OS, etc)
Multithreaded getting 32 GB/s effective bandwidth
o Cache effects with small chunks
Issue #2:
False Sharing
Different memory locations, written to by different cores, mapped to
same cache line
o Core 1 performing “results[0]++;”
o Core 2 performing “results[1]++;”
Remember cache coherence
o Every time a cache is written to, all other instances need to be invalidated!
o “results” variable is ping-ponged across cache coherence every time
o Bad when it happens on-chip, terrible over processor interconnect (QPI/UPI)
Remember the case in the Scott Meyers video
Results From Scott Meyers Video Again
Issue #3
Instruction Cache Effects
Instruction cache misses can effect performance
o “Linux was routing packets at ~30Mbps [wired], and wireless at ~20. Windows CE
was crawling at barely 12Mbps wired and 6Mbps wireless. […] After we changed
the routing algorithm to be more cache-local, we started doing 35MBps [wired],
and 25MBps wireless – 20% better than Linux.
– Sergey Solyanik, Microsoft
o [By organizing function calls in a cache-friendly way, we] achieved a 34% reduction
in instruction cache misses and a 5% improvement in overall performance.
-- Mircea Livadariu and Amir Kleen, Freescale
Improving Instruction Cache Locality #1
Unfortunately, not much we can do explicitly… We can…
Careful with loop unrolling and inlining
o They reduce branching overhead, but reduces effective I$ size
o When gcc’s –O3 performs slower than –O2, this is usually what’s happening
o Inlining is typically good for very small* functions
Move conditionals to front as much as possible
o Long paths of no branches good fit with instruction cache/prefetcher
Improving Instruction Cache Locality #2
Organize function calls to create temporal locality
Sequential algorithm Ordering changed for Balance to reduce

cache locality memory footprint
Livadariu et. al., “Optimizing for instruction caches,” EETimes
https://frankdenneman.nl/2016/07/11/numa-deep-dive-part-3-cache-coh
erency/

Lec3 - Cache and Memory System

Uploaded by

Copyright:

Available Formats

Lec3 - Cache and Memory System

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec3 - Cache and Memory System

Uploaded by

Copyright:

Available Formats

CS 295: Modern Systems

Cache And Memory System

 80486 (1989) : 4 KB on-chip cache

 Coffee Lake (2017) :

What is the Y-axis? Most likely normalized latency reciprocal

Core Core Core Core

DRAM Ultra Path Interconnect DRAM

DRAM 100s of GB > 100* L3 $

L1 cache is typically divided into separate instruction and data cache

L1 cache accesses can be hidden in the pipeline

C1 sub-matrix = A1×B1 + A1×B2 + A1×B3 …

Sequential algorithm Ordering changed for Balance to reduce

You might also like