Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Lec3 - Cache and Memory System

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 18

CS 295: Modern Systems

Cache And Memory System

Sang-Woo Jun
Spring, 2019
Watch Video!
CPU Caches and Why You Care – Scott Meyers
o His books are great!
Some History

 80386 (1985) :
Last Intel desktop CPU with no on-chip cache
(Optional on-board cache chip though!)

 80486 (1989) : 4 KB on-chip cache

 Coffee Lake (2017) :


64 KiB L1 Per core
256 KiB L2 Per core
Up to 2 MiB L3 Per core (Shared)

What is the Y-axis? Most likely normalized latency reciprocal

Source: Extreme tech, “How L1 and L2 CPU Caches Work, and Why They’re an Essential Part of Modern Chips,” 2018
Memory System Architecture
Package Package

Core Core Core Core

L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$

L2 $ L2 $ L2 $ L2 $

L3 $ L3 $

QPI / UPI

DRAM DRAM
Memory System Bandwidth Snapshot
Cache Bandwidth Estimate
64 Bytes/Cycle ~= 200 GB/s/Core
Core Core
DDR4 2666 MHz
128 GB/s

QPI / UPI

DRAM Ultra Path Interconnect DRAM


Unidirectional
20.8 GB/s

Memory/PCIe controller used to be on a separate “North bridge” chip, now integrated on-die
All sorts of things are now on-die! Even network controllers! (Specialization!)
Cache Architecture Details
Numbers from modern Xeon processors (Broadwell – Kaby lake)
Cache Level Size Latency (Cycles) Core Core
L1 64 KiB <5 L1 I$ L1 D$ L1 I$ L1 D$
L2 256 KiB < 20
L2 $ L2 $
L3 ~ 2 MiB per core < 50

DRAM 100s of GB > 100* L3 $

L1 cache is typically divided into separate instruction and data cache


Each with 32 KiB

L1 cache accesses can be hidden in the pipeline


o All others take a performance hit

* DRAM subsystems are complicated entities themselves, and latency/bandwidth of the same module varies by situation…
Cache Lines Recap
Caches are managed in Cache Line granularity
o Typically 64 Bytes for modern CPUs
o 64 Bytes == 16 4-byte integers
Reading/Writing happens in cache line granularity
o Read one byte not in cache -> Read all 64 bytes from memory
o Write one byte -> Eventually write all 64 bytes to memory
o Inefficient cache access patterns really hurt performance!
Cache Line Effects Example
Multiplying two 2048 x 2048 matrices
o 16 MiB, doesn’t fit in any cache
Machine: Intel i5-7400 @ 3.00GHz
Time to transpose B is also counted
A B A BT


× VS ×


63.19 seconds 10.39 seconds
(6x performance!)
Cache Prefetching
CPU speculatively prefetches cache lines
o While CPU is working on the loaded 64 bytes, 64 more bytes are being loaded
Hardware prefetcher is usually not very complex/smart
o Sequential prefetching (N lines forward or backwards)
o Strided prefetching
Programmer-provided prefetch hints
o __builtin_prefetch(address, r/w, temporal_locality); for GCC
Cache Coherence Recap
We won’t go into architectural details
Simply put:
o When a core writes a cache line
o All other instances of that cache line needs to be invalidated
Emphasis on cache line
Issue #1:
Capacity Considerations: Matrix Multiply
Performance is best when working set fits into cache
o But as shown, even 2048 x 2048 doesn’t fit in cache
o -> 2048 * 2048 * 2048 elements read from memory for matrix B
Solution: Divide and conquer! – Blocked matrix multiply
o For block size 32 × 32 -> 2048 * 2048 * (2048/32) reads
A BT C
A1 B1 … C1

B2
× B3
=

C1 sub-matrix = A1×B1 + A1×B2 + A1×B3 …


Blocked Matrix Multiply Evaluations
Benchmark Elapsed (s) Normalized
Performance
Naïve 63.19 1
Transposed 10.39 6.08
Blocked Transposed 7.35 8.60
Bottlenecked by computations
AVX Transposed 2.20 28.72
Bottlenecked by memory
Blocked AVX Transposed 1.50 42.13
Bottlenecked by computation
8 Thread AVX Transposed 1.09 57.97
Bottlenecked by memory (Not scaling!)
AVX Transposed reading from DRAM at 14.55 GB/s
o 20483 * 4 (Bytes) / 2.20 (s) = 14.55 GB/s
o 1x DDR4 2400 MHz on machine -> 18.75 GB/s peak
o Pretty close! Considering DRAM also used for other things (OS, etc)
Multithreaded getting 32 GB/s effective bandwidth
o Cache effects with small chunks
Issue #2:
False Sharing
Different memory locations, written to by different cores, mapped to
same cache line
o Core 1 performing “results[0]++;”
o Core 2 performing “results[1]++;”
Remember cache coherence
o Every time a cache is written to, all other instances need to be invalidated!
o “results” variable is ping-ponged across cache coherence every time
o Bad when it happens on-chip, terrible over processor interconnect (QPI/UPI)
Remember the case in the Scott Meyers video
Results From Scott Meyers Video Again
Issue #3
Instruction Cache Effects
Instruction cache misses can effect performance
o “Linux was routing packets at ~30Mbps [wired], and wireless at ~20. Windows CE
was crawling at barely 12Mbps wired and 6Mbps wireless. […] After we changed
the routing algorithm to be more cache-local, we started doing 35MBps [wired],
and 25MBps wireless – 20% better than Linux.
– Sergey Solyanik, Microsoft
o [By organizing function calls in a cache-friendly way, we] achieved a 34% reduction
in instruction cache misses and a 5% improvement in overall performance.
-- Mircea Livadariu and Amir Kleen, Freescale
Improving Instruction Cache Locality #1
Unfortunately, not much we can do explicitly… We can…
Careful with loop unrolling and inlining
o They reduce branching overhead, but reduces effective I$ size
o When gcc’s –O3 performs slower than –O2, this is usually what’s happening
o Inlining is typically good for very small* functions
Move conditionals to front as much as possible
o Long paths of no branches good fit with instruction cache/prefetcher
Improving Instruction Cache Locality #2
Organize function calls to create temporal locality

Sequential algorithm Ordering changed for Balance to reduce


cache locality memory footprint
Livadariu et. al., “Optimizing for instruction caches,” EETimes
https://frankdenneman.nl/2016/07/11/numa-deep-dive-part-3-cache-coh
erency/

You might also like