Distributed memory architecture
Distributed memory architecture
2
Kinds of distributed memory architectures
• PC/Workstation Cluster Dedicated processors
architectures:
• Multicluster
static allocation of processes
• Massively Parallel Processor (MPP) to processing nodes,
• …
possibly, dynamic
• Grid reconfiguration
3
Interprocess communication support
Executable version of parallel application:
collection of communicating processes
Local memory
PCI-DMA chip
Communication
Processor To/from
Swiching unit of
communication
network
6
Interprocess communication run-time support
Source process
Channel of Destination process
type T
UC
Processing UC Message Processing
Node copy Node
+
scheduling
actions
Communication Network
7
Distributed run-time support
Principles:
• Channel descriptor allocated in destination node Nj
• receive is executed locally by Destination process in Nj
• send call by Source process in Ni: delegated to destination node Nj
• Delegation consists in a firmware message from Ni to Nj via
communication network, containing:
FW_MSG = (header, channel identifier, message value, Source identifier)
• In Nj, this message is received by the network interface unit (UCj)
and transformed into an interrupt (for CPUj or KPj)
• The interrupt handler executes the send primitive locally, according
to a shared memory implementation
– and possibly returns an outcome to Ni (Source) via communication network: this
action can be avoided according to the detailed implementation scheme
8
Implementation
• Channel descriptor
– data structure CHsource allocated in Ni: contains information about the current
number of buffered messages and the sender_wait boolean
– data structure CHdest allocated in Nj: the “real” channel descriptor, with the usual
structure for a shared memory implementation
• Send
– verifies the asynchrony degree saturation and, if buffer_full, suspends the Source
process
– in any case, the interprocessor message FW_MSG is sent to UCi, then to Nj via
communication network
– local execution of send on Nj, without checking buffer_full; no outcome is returned
to Ni in this scheme
• Receive
– causes the updating of the number of buffered messages in CHsource (interprocessor
message to Ni)
– In Ni, sender_wait is checked: if true, Source process is waked up.
9
send implementation – source node
M M
msg CHs msg vtg
CHd
...
... ... Bus DMA
Bus DMA
CPU CPU
CPU Bus I/O CPU Bus I/O
Communication Network
verifies the asynchrony degree saturation and, if buffer_full, suspends the Source process
10
send implementation – destination node
M M
msg msg vtg
CHs CHd
...
... Bus DMA
Bus DMA
CPU CPU
CPU Bus I/O CPU Bus I/O
Communication Network
The pipeline trasmission is continued in Nj: UCj copies FW_MSG, via DMA, in Nj memory directly
(without any intermediate copy)
Running process (or KP) is interrupted by UCj; the interrupt message is the reference (capability) to
FW_MSG
Running process (or KP) acquires FW_MSG, then Chdest and VTG, into its own addressing space. So,
the local send can be executed (without checking buffer_full).
Optimization: UC is KP, thus the additional copy of FW_MSG is saved ! (on the fly execution)
11
send implementation – memory-to-memory copy
M M
msg vtg
CHs CHd
...
Bus DMA Bus DMA
CPU CPU
CPU Bus I/O CPU Bus I/O
Communication Network
provided that the communication network protocol is the primitive, firmware one.
If IP protocol is adopted, then several additional copies and administrative operations are
done. IP overhead is prevailing, also compared to the network latency.
12
Implementation
• A key point for the local send execution on Nj is the addressing
space of the process executing the interrupt handler.
• Any process should contain all possible channel descriptors,
all possible target variables and process control blocks for all
processes allocated on Nj.
• In practice, static allocation of such objects is impossible.
• Solution: dynamic allocation by means of Capability
mechanism.
13
Interprocess communication cost model
• Base latency: takes into account
– latency on Ni:
• operations on CHsource,
• formatting of FW_MSG and delegation to KPj or UCj
• operations in KPj or UCj
– network latency (depending on network kind and dimension, routing and flow
control strategies, link latency, link size, number of crossed units: SEE Shared
Memory Arch.)
– latency on Nj: latency of local send execution (SEE Shared Memory Arch.)
• Under-load latency:
– resolution of a client-server model (SEE Shared Memory Arch.), where the
destination node (thus, any node) is the server and the possible source nodes (thus,
any node) are the clients
– M/M/1 is a typical (worst-case) assumption
– parameter p: average number of nodes acting as clients, according to the structure of
the parallel program and to the process mapping strategies
14
Typical latencies
• The communication network is used with the primitive
firmware routing and flow-control protocol:
– similar result of shared memory run-time, for systems realized in a rack
Tsetup 103 t, Ttransm 102 t
– otherwise, for long distance networks, the transmission latency dominates, e.g.
Tsetup 103 t, Ttransm 104 t till 106 t
15
Exercizes
1. Describe the interprocess communication run-time support
in details, in particular the actions inside the source and
destination nodes.
2. Evaluate the interprocess communication latency in detail,
according to the implementation scheme of Exercize 1.
3. Study the interprocess communication run-time support for
clusters whose nodes are SMP or NUMA machines.
16