Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
9 views

Distributed memory architecture

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Distributed memory architecture

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Distributed Memory Architectures

Distributed memory architectures


• Processing nodes are not able to share a physical memory space
– a node cannot address the memory of another node

• I/O is the only primitive mechanism for node cooperation


– cooperation by explicit value exchange
– possibly, shared memory can be emulated

I/O unit(s) dedicated


M
to node interfacing
M Bus DMA Bus DMA (UC):
CPU CPU Bus I/O Communication
CPU Bus I/O CPU
Unit,
UC Network Interface
Processing UC ... Processing
Unit,
Node Node
Network Card,

Any architecture for
Communication Network Processing Nodes,
e.g. multiprocessor

2
Kinds of distributed memory architectures
• PC/Workstation Cluster Dedicated processors
architectures:
• Multicluster
static allocation of processes
• Massively Parallel Processor (MPP) to processing nodes,
• …
possibly, dynamic
• Grid reconfiguration

• Data Center for load balancing


or fault-tolerance reasons.
• Server Farm
• Cloud
• …

3
Interprocess communication support
Executable version of parallel application:
collection of communicating processes

Run-time support of Run-time support of


communication communication
primitives Interprocess communication primitives
channels
Network communication Network communication
protocols protocols
Internode communication
channels
Processing node
... Processing node

Physical communication network

Run-time support exploits


• network communication protocols
• architectural features internal to processing nodes (notably, I/O mechanisms
via shared memory: DMA and/or Memory Mapped I/O)
4
Communication networks
• Simple cases of network computers: usual network
architectures (LAN / MAN /WAN) with serial links and
standard IP protocol
• High performance architecutures: very local interconnection
network (“Switch”) according to the structures studied for
Shared Memory Architectures:
– multistage Fat Tree, Generalized Fat Tree
– low dimensione cubes
– in the most powerful machines: wormhole flow control
• Fast Ethernet (100 Mb/s)
• Gigabit Ethernet (1 Gb/s)
• Myrinet (1.28 Gb/s)
• Infiniband (till 10 Gb/s)
• Optical technology, fotonic networks are emerging (10 – 100 Gb/s)
5
Communication networks and communication
processors
• Example: Myrinet
– KP included in the network, connected as I/O unit to processing node
– used for interprocess communication run-time support and/or Network Interface
Unit (Network Card)

PCI bus (32-64 bit)

Local memory
PCI-DMA chip

PCI DMA Node Processor Network


controller Interface interface
bridge

Communication
Processor To/from
Swiching unit of
communication
network

6
Interprocess communication run-time support
Source process
Channel of Destination process
type T

send (channel_identifier, message_value)


receive (channel_identifier, target_variable)

allocated to processing node allocated to processing node


Ni Nj
...
M M Bus DMA
Bus DMA

CPU CPU Bus I/O


CPU Bus I/O CPU

UC
Processing UC Message Processing
Node copy Node
+
scheduling
actions
Communication Network
7
Distributed run-time support
Principles:
• Channel descriptor allocated in destination node Nj
• receive is executed locally by Destination process in Nj
• send call by Source process in Ni: delegated to destination node Nj
• Delegation consists in a firmware message from Ni to Nj via
communication network, containing:
FW_MSG = (header, channel identifier, message value, Source identifier)
• In Nj, this message is received by the network interface unit (UCj)
and transformed into an interrupt (for CPUj or KPj)
• The interrupt handler executes the send primitive locally, according
to a shared memory implementation
– and possibly returns an outcome to Ni (Source) via communication network: this
action can be avoided according to the detailed implementation scheme

8
Implementation
• Channel descriptor
– data structure CHsource allocated in Ni: contains information about the current
number of buffered messages and the sender_wait boolean
– data structure CHdest allocated in Nj: the “real” channel descriptor, with the usual
structure for a shared memory implementation
• Send
– verifies the asynchrony degree saturation and, if buffer_full, suspends the Source
process
– in any case, the interprocessor message FW_MSG is sent to UCi, then to Nj via
communication network
– local execution of send on Nj, without checking buffer_full; no outcome is returned
to Ni in this scheme
• Receive
– causes the updating of the number of buffered messages in CHsource (interprocessor
message to Ni)
– In Ni, sender_wait is checked: if true, Source process is waked up.

9
send implementation – source node
M M
msg CHs msg vtg
CHd
...
... ... Bus DMA
Bus DMA
CPU CPU
CPU Bus I/O CPU Bus I/O

Processing Node UC Processing Node UC


Ni Nj

Communication Network

verifies the asynchrony degree saturation and, if buffer_full, suspends the Source process

in any case, the interprocessor message FW_MSG msg


(header, channel identifier, message value, Source process identifier)
is produced and passed to UCi by reference ...
UCi exploits DMA and transmits FW_MSG to UCj, via network, directly in pipeline (flit by flit,
without any intermediate copy in UCi)

10
send implementation – destination node
M M
msg msg vtg
CHs CHd
...
... Bus DMA
Bus DMA
CPU CPU
CPU Bus I/O CPU Bus I/O

Processing Node UC Processing Node UC


Ni Nj

Communication Network

The pipeline trasmission is continued in Nj: UCj copies FW_MSG, via DMA, in Nj memory directly
(without any intermediate copy)
Running process (or KP) is interrupted by UCj; the interrupt message is the reference (capability) to
FW_MSG
Running process (or KP) acquires FW_MSG, then Chdest and VTG, into its own addressing space. So,
the local send can be executed (without checking buffer_full).
Optimization: UC is KP, thus the additional copy of FW_MSG is saved ! (on the fly execution)
11
send implementation – memory-to-memory copy

M M
msg vtg
CHs CHd
...
Bus DMA Bus DMA

CPU CPU
CPU Bus I/O CPU Bus I/O

Processing Node UC Processing Node UC


Ni Nj

Communication Network

In practice, even in a distributed memory architecture, a memory-to-memory copy can be


implemented (plus additional operations for low level scheduling),

provided that the communication network protocol is the primitive, firmware one.

If IP protocol is adopted, then several additional copies and administrative operations are
done. IP overhead is prevailing, also compared to the network latency.
12
Implementation
• A key point for the local send execution on Nj is the addressing
space of the process executing the interrupt handler.
• Any process should contain all possible channel descriptors,
all possible target variables and process control blocks for all
processes allocated on Nj.
• In practice, static allocation of such objects is impossible.
• Solution: dynamic allocation by means of Capability
mechanism.

13
Interprocess communication cost model
• Base latency: takes into account
– latency on Ni:
• operations on CHsource,
• formatting of FW_MSG and delegation to KPj or UCj
• operations in KPj or UCj
– network latency (depending on network kind and dimension, routing and flow
control strategies, link latency, link size, number of crossed units: SEE Shared
Memory Arch.)
– latency on Nj: latency of local send execution (SEE Shared Memory Arch.)
• Under-load latency:
– resolution of a client-server model (SEE Shared Memory Arch.), where the
destination node (thus, any node) is the server and the possible source nodes (thus,
any node) are the clients
– M/M/1 is a typical (worst-case) assumption
– parameter p: average number of nodes acting as clients, according to the structure of
the parallel program and to the process mapping strategies

14
Typical latencies
• The communication network is used with the primitive
firmware routing and flow-control protocol:
– similar result of shared memory run-time, for systems realized in a rack
Tsetup  103 t, Ttransm  102 t
– otherwise, for long distance networks, the transmission latency dominates, e.g.
Tsetup  103 t, Ttransm  104 t till 106 t

• The communication network is used with the IP protocol, i.e.,


the application is IP-dependent
– The network is exploited in the primitive way, however an additional overhead
is paid due to the protocol actions (e.g., formatting, de-formatting) inside the
nodes (+ transmission overhead on long distance networks):
Rack: Tsetup  105 t, Ttransm  104 t
Long distance: Tsetup  107 t, Ttransm  108 t

15
Exercizes
1. Describe the interprocess communication run-time support
in details, in particular the actions inside the source and
destination nodes.
2. Evaluate the interprocess communication latency in detail,
according to the implementation scheme of Exercize 1.
3. Study the interprocess communication run-time support for
clusters whose nodes are SMP or NUMA machines.

16

You might also like