Module 4
Module 4
Module 4
Introduction to
Multiprocessors
● Grids, etc.
2
UMA vs. NUMA
Computers
Latency = several
milliseconds to seconds
P1 P2 Pn P1 P2 Pn
Cache Cache Cache Cache Cache Cache
Bus
Main Main Main
Memor Memor Memor
Main y y y
Memor
y
Network
Latency = 100s of ns
Symmetric
Multiprocessors(SMPs)
14
Cache Organizations for
Multicores
● L1 caches are always private to a core
● L2 caches can be private or shared
– which is better?
P1 P2 P3 P4 P1 P2 P3 P4
L1 L1 L1 L1 L1 L1 L1 L1
L2 L2 L2 L2
L2
15
L2 Organizations
● Advantages of a shared L2 cache:
– Efficient dynamic use of space by each core
– Data shared by multiple cores is not
replicated.
– Every block has a fixed “home” – hence, easy
to find the latest copy.
● Advantages of a private L2 cache:
– Quick access to private L2
– Private bus to private L2, less contention.
16
An Important Problem with
Shared-Memory: Coherence
● When shared data are cached:
– These are replicated in multiple
caches.
– The data in the caches of different
processors may become inconsistent.
● How to enforce cache coherency?
– How does a processor know changes in
the caches of other processors?
17
The Cache Coherency
Problem
5
P1 4 P2 P3
1 U:? U:? 3 U:7 3
1 2
U:5
18
Cache Coherence Solutions
(Protocols)
● The key to maintain cache coherence:
– Track the state of sharing of every
data block.
● Based on this idea, following can be
an overall solution:
– Dynamically recognize any potential
inconsistency at run-time and carry out
preventive action.
19
Basic Idea Behind Cache
Coherency Protocols
P P P P
I/O
Main memory
system
Bus
20
Pros and Cons of the
Solution
● Pro:
– Consistencymaintenance becomes
transparent to programmers,
compilers, as well as to the
operating system.
● Con:
– Increased hardware complexity .
21
Two Important Cache
Coherency Protocols
● Snooping protocol:
– Each cache “snoops” the bus to find out
which data is being used by whom.
● Directory-based protocol:
– Keep track of the sharing state of each
data block using a directory.
– A directory is a centralized register for
all memory blocks.
– Allows coherency protocol to avoid
broadcasts.
22
Snooping vs.
Directory-based
●
Protocols
Snooping protocol reduces memory
traffic.
– More efficient.
● Snooping protocol requires broadcasts:
– Can meaningfully be implemented only when
there is a shared bus.
– Even when there is a shared bus, scalability
is a problem.
– Some work arounds have been tried: Sun
Enterprise server has up to 4 buses. 23
Snooping Protocol
● As soon as a request for any data block
by a processor is put out on the bus:
– Other processors “snoop” to check if they
have a copy and respond accordingly.
● Works well with bus interconnection:
– All
transmissions on a bus are essentially
broadcast:
● Snooping is therefore effortless.
– Dominates almost all small scale machines.
24
Categories of Snoopy
Protocols
● Essentially two types:
– Write Invalidate Protocol
– Write Broadcast Protocol
● Write invalidate protocol:
– When one processor writes to its cache, all
other processors having a copy of that
data block invalidate that block.
● Write broadcast:
– When one processor writes to its cache, all
other processors having a copy of that
data block update that block with the
recent written value. 25
Write Invalidate Protocol
● Handling a write to shared data:
– An invalidate command is sent on bus ---
all caches snoop and invalidate any copies
they have.
● Handling a read Miss:
– Write-through: memory is always
up-to-date.
– Write-back: snooping finds most recent
copy.
26
Write Invalidate in Write
Through Caches
● Simple implementation.
● Writes:
– Write to shared data: broadcast on bus,
processors snoop, and update any copies.
– Read miss: memory is always up-to-date.
● Concurrent writes:
– Write serialization automatically achieved
since bus serializes requests.
– Bus provides the basic arbitration support.
27
Write Invalidate versus
Broadcast cont…
CPU Write
Place Write Miss on Bus
Exclusive
(read/wr
CPU read hit CPU Write Miss
CPU write hit
ite) Write back cache block
Place write miss on bus 31
Snoopy-Cache State
Machine-II
● State machine Write miss
considering only Shared
bus requests for this block
Invalid (read/o
for each cache nly)
block.
Write miss
for this block
Write Back Read miss
Block; (abort for this block
memory access) Write Back
Block; (abort
memory
Exclusive access)
(read/wr
ite)
32
Combined Snoopy-Cache
● State machine
State Machine CPU Read hit
Write miss
considering both
CPU requests for this block Shared
and bus requests Invalid CPU Read (read/o
Place read miss
for each nly)
CPU Write on bus
cache block. Place Write
Write miss Miss on bus
for this block CPU read miss CPU Read miss
Write back block, Place read miss
Write Back
Place read miss on bus
Block; Abort CPU Write
on bus
memory access. Place Write Miss on Bus
Write Back
Read miss
Block; (abort
for this block
Exclusive memory access)
(read/wr
CPU read hit CPU Write Miss
CPU write hit
ite) Write back cache block
Place write miss on bus 33
Directory-based Solution
● In NUMA computers:
– Messages have long latency.
– Also, broadcast is inefficient --- all
messages have explicit responses.
● Main memory controller to keep track of:
– Which processors are having cached copies
of which memory locations.
● On a write,
– Only need to inform users, not everyone
● On a dirty read,
– Forward to owner
34
Directory Protocol
● Three states as in Snoopy Protocol
– Shared: 1 or more processors have data,
memory is up-to-date.
– Uncached: No processor has the block.
– Exclusive: 1 processor (owner) has the block.
● In addition to cache state,
– Must track which processors have data when
in the shared state.
– Usually implemented using bit vector, 1 if
processor has copy.
35
Directory Behavior
● On a read:
– Unused:
● give (exclusive) copy to requester
● record owner
– Exclusive or shared:
● send share message to current exclusive
owner
● record owner
● return value
– Exclusive dirty:
● forward read request to exclusive owner.
36
Directory Behavior
● On Write
– Send invalidate messages to all
hosts caching values.
● On Write-Thru/Write-back
– Update value.
37
CPU-Cache State Machine
Invalidate CPU Read hit
● State machine or Miss due to
for CPU requests Uncacheed address conflict: Shared
for each (read/o
CPU Read
memory block Send Read Miss
nly)
● Invalid state message
CPU Write:
if in Fetch/Invalidate Send Write Miss
CPU Write:
memory or Miss due to Send
msg to h.d.
address conflict: Write Miss message
send Data Write Back message to home directory
to home directory
Fetch: send
Data Write Back message
Exclusive to home directory
(read/wri
CPU read hit
te)
CPU write hit 38
State Transition Diagram
for the Directory
● Tracks all copies of memory block.
● Same states as the transition diagram
for an individual cache.
● Memory controller actions:
– Update of directory state
– Send msgs to statisfy requests.
– Also indicates an action that updates the
sharing set, Sharers, as well as sending a
message.
39
Directory State Machine Read miss:
Sharers += {P};
● State machine Read miss: send Data Value Reply
for Directory Sharers = {P}
requests for each send Data Value
Shared
Reply
memory block Uncached (read
only)
● Uncached state
if in memory Write Miss:
Write Miss:
Sharers = {P};
Data Write Back: send Invalidate
send Data
Sharers = {} to Sharers;
Value Reply
(Write back block) then Sharers = {P};
msg
send Data Value
Reply msg
Read miss:
Sharers += {P};
Write Miss: Exclusive send Fetch;
Sharers = {P}; (read/wri send Data Value Reply
send Fetch/Invalidate; te) msg to remote cache
send Data Value Reply (Write back block)
40
msg to remote cache