A Low Overhead, High Throughput
NIFDY:
Timothy
Callahan
{timothyc,
and
Seth
sethg}@cs
Computer
University
Science
of
Copen
.berkeley.
network
received by a processor in the order in which they were sent, even
the packets out of order.
We present results from
ify NIFDY’S efficacy.
simulations
of a variety
and fat trees) and traffic
Our simulations
throughput and decreases overhead.
as a network’s bisection bandwidth
DeliverY—which
adapts the WAN style
only if the destination
is expected to be able to accept the
packet.
The
When the network
basic idea behind NIFDY is that each processors allowed to have at
most one outstanding packet to any other processor unless the destination processor has granted the sender the right to send multiple
unacknowledged
packets. Further, there is a low upper limit on the
number of outstanding packets to all processors.
(meshes, tori, butterflies,
and in-order
solutions to MPP networks.
In short, NIFDY performs admission
control at the edges of the network; a packet is injected into the
In this paper we present NIFDY, a network interface that uses admission control to reduce congestion and ensures that packets are
delivers
edu
Division
Flow-control
network
Goldstein
California–Berkeley
Abstract
if the underlying
Interface
Network
is running
within
its operating
range, soft-
ware overhead represents the largest cost in message transmission.
Some of this overhead arises in matching the functionality
of the
network fabric to the application requirements, NIFDY removes the
overhead required for reordering packets by delivering packets to
the processor in the order in which they were sent. This allows
network designers to exploit various techniques, e.g. adaptive routing, to increase network performance without imposing additional
of networks
patterns to ver-
show that NIFDY increases
overhead on the applications.
In the rest of this section we explain
The utility of NIFDY increases
decreases. When combined
underlying
network
our assumptions
about the
and present the basic design of NIFDY, In Sec-
with the increased payload allowed by in-order delivery NIFDY increases total bandwidth delivered for all networks. The resources
tion 2 we present the complete design of NIFDY, its implementation
cost, and how it interacts with the processor and network. Section 3
needed to implement
network size.
describes the simulator which was used to verify the performance
of NIFDY and Section 4 presents the results we obtained from it. In
Section 5 we compare our approach to previous work on network
design. In Section 6 we propose some extensions to NIFDY to handle
unreliable networks and networks of workstations.
1
NIFDY are small and constant with respect to
Introduction
An efficient
parallel
interconnection
computing.
performance
network
Although
is essential
have increased dramatically,
not efficiently
for high-speed
1.1
processor speeds and raw network
network
interfaces
have
NIFDY is designed
integrated these resources. Thus system performance
tightly
has not kept pace with the performance of the individual
components.
In this paper we present a network interface that more
closely matches modern network characteristics while remaining
independent of the network design.
Aga91,
BK94].
Researchers
and MPP networks.
deep networks
have investigated
The WAN
this problem
solutions
with long messages and generally
primarily
Network
for MPPs where the processors
coupled by a fast. shallow
interconnect.
Unlike
are
previous
work to increase performance of such systems, our approach does
not presuppose a particular kind of network or router. We assume
only that once the network has accepted a packet it will eventually
be delivered to its destination,
if processors continue to accept
packets. (In Section 6 we show how NIFDY can be extended to
handle unreliable networks.)
On most networks proposed for MPPs, the main source of performance degradation is congestion. Congestion can be caused in
two places: at the end-points and internally in the network fabric. End-point congestion arises when packets arrive at a node
Interconnection
networks deliver maximum performance when
the offered load is limited to a fraction of the maximum bandwidth.
We call this the operating range of the network. Many people have
observed that when the offered load exceeds the operating range,
throughput falls off dramatically
[Jac88, Jai90, SS89, RJ90, KS9 1,
both WAN
The Underlying
for
are based on
faster than the node can process them.
use software pro-
Internal
congestion
can
arise for several reasons. First, a number of senders may combine
to generate traffic that exceeds the network’s bisection bandwidth.
tocols at the end points [Jac88, RJ90, KMCL93,
SBB+91].
Most
MPP networks, which are shallower and have shorter messages,
either ignore the issue or control congestion in the network fabric Itself [CBLK94,
Da191, LAD+ 92, Da190]. In this paper we
propose a network interface called NIFDY—Network Interface with
Second, hot spots in the network may cause unnecessary blocking
Third,
faults in the network may restrict
and reduce utilization.
the available bandwidth.
Finally, end-point congestion can cause
congestion internal to the network; we call this secondary Mocking.
To handle hardware faults and transient congestion, many network topologies-e.g.
fat trees and multibutterflies-provide
multiple paths that spread out traffic between nodes. While networks
with alternative paths can provide some congestion tolerance, they
don’t solve the problem entirely and can even aggravate it. If there
is no direct feedback to the sending node (as is the case in most, if
not all, MPP networks), then backpressure is the only mechanism to
Permission to copy without fee all or part of this material is
granted provided that the copies are not made or distributed for
direct commercial advantage, the ACM copyright notice and the
title of the publication and its date appear, and notice is given
that copyin is by permission of the Association of Computing
Machinery. ? o copy otherwise, or to republish, requires
a fee and/or specific permission.
ISCA ’95, Santa Margherita Ligure Italy
@ 1995 ACM 0-89791 -698-0/95/0006...$3 .50
stop the sender from sending packets. In this case, adaptive routing
230
may fill up the network
buffers along all possible paths between the
sender and the bottleneck,
one packet to it; no further packets will be sent until the destination
causing extreme secondary blocking.
processor wakes up and accepts a packet.
At this point NIFDY
will send an ack, allowing another packet to be sent. NIFDY also
To avoid secondary blocking, when a node is sending to a blocked
or overloaded receiver it must stop or slow its injection of packets
into the network. Two schemes have been proposed to accomplish
this: rate-based flow control (RBFC) and credit-based flow control
incorporates an outgoing buffer pool which reduces head-of-line
blocking in the network interface.
Thus, if several messages are
ready to go to different processors, they can be interleaved up to
(CBFC). RBFC limits each sender to a rate that N known not to
induce secondary blocking, assuming the receiver is pulling packets
the limit of the OPT.
To accommodate deeper networks
out of the network. CBFC gives each sender a credit of packets that
it can inject before secondary blocking will happen. The problem
the NIFDY protocol has a transfer mode in which multiple unacknowledged packets can be in transit between two processors. In
and large round-trip
times,
with both these schemes is that MPP traffic is bimodal—processors
the case where the sender has multiple
are usually sending either at full speed or hardly at all. With RBFC
a fixed rate will not properly utilize the network; for instance,
destination it can request a bulk dialog. If the receiver grants such
a dialog in the ack, then the sender can send more than one packet
the optimal
the optimal
per ack. By limiting the number and size of bulk dialogs a receiver
will grant, we can again limit secondary blocking even for bimodal
rate when only one sender is active is different from
rate when all senders are active. While CBFC solves
packets to be sent to a single
this problem, it does not eliminate secondary blocking if many
senders have accumulated credits and simultaneously send a burst of
traffic.
In short, NIFDY implements
packets. In addition,
between the sender
terfaces that allows increased flexibility
and the receiver or requires global information
to be maintained
the network,
for maintaining
limiting congestion and decreasing software overhead, NIFDY’S resource requirements increase with desired performance, not with
while
RBFC requires negotiation
CBFC
requires
overhead
in
the
credits, possibly on a per-receiver basis. These costs, combined
with the bimodality
of MPP traffic, have prevented designers from
a simple extension
to network
in network
in-
design while
the number of nodes in the machine,
using RBFC or CBFC in MPP networks.
The price of randomized routing techniques is that packets may
be delivered out of order. Even meshes and tori using dimension-
2
order routing may deliver packets out of order if they utilize multiple
virtual channels to alleviate congestion [Da190].
For mediumsized transfers on the CM-5, [KC94] showed that reconstructing
the original transmission order accounted for as much as 3070 of
the total transfer time. The Synoptics ATM Switch routes packets
NIFDY is a network interface that increases system performance
by decoupling the processor and the underlying
network fabric.
The processor sends packets by inserting theml into NIFDY, then
NIFDY takes over and injects them into the network at the earliest
opportunity,
according to the protocol described below.
NIFDY
adaptively within the switch and then reorders them before they
leave the switch. This link-by-link
reordering increases the latency
handles flow control, ordering of packets,
unreliable networks, packet retransmission.
of the switch by a factor of five [BT89]. From these observations
we conclude that reordering should be performed only once, at the
describe the basic design, the parameters to tune NIFDY to match the
processor and the network, and the implementation
costs of NIFDY.
destination,
The ideas in NIFDY can be added to any network
and that if possible the reordering
should be performed
in hardware.
1.2
NIFDY
bimodality
and end-to-end
of MPP traffic
and, if extended
In this section
for
we
interface.
keep packets in order and to provide access control. Every scalar
packet is acked individually
and bulk packets are acked using a
interface that uses admission control to perform
delivery
Unit
NIFDY distinguishes two types of network data packets, scalar
and bulk. Scalar packets are best used for short messages while
bulk packets are best used for large block transfers. In addition,
NIFDY generates acknowledgment
(ack) packets, which are used to
in a Nutshell
NIFDY is a network
both in-order
The NIFDY
flow control.
and ensure in-order
To handle the
delivery,
sliding
NIFDY has
window
protocol.
The ack packets share the same network
two communication
modes. The default case is scalar mode in
which only a single packet can be outstanding to a given destina-
as the data packets, but are consumed by the receiving
tion processor.
any other processor. For every scalar packet sent, the destination
processor number is recorded in an outstandingpacket
table (OPT).
To allow for higher bandwidth
communication,
a
processor can request bulk mode which, if granted, gives the sender
extra credits that can be used for communicating
only with the
granting processor.
For every scalar packet sent, the destination processor number is
recorded in an outstanding packet table (OPT). Until an acknowledgment (ack) is received from the destination and the entry in
the the OPT is cleared, NIFDY will not inject any further packets
Until an acknowledgment
Furthermore,
is received from the destination
processor
and the entry in the the OPT is cleared, NIFDY will not inject any
more packets bound for that processor. However, if the OPT is not
full, it can send a scalar packet to a different destination.
Clearly, there is no way that packets can become disordered in
the network if there is at most one outstanding packet between each
sender/receiver pair at any instant. The basic flow control is also
destined for that processor. However, if the OPT is not full, it can
send a packet to another processor. This keeps packets from one
processor to another in order.
NIFDY.
Each processor can send only one scalar packet at a time to
evident:
If the receiving
node is ignoring
the network,
or for some
other reason is not pulling packets out of the network rapidly, the
sender will not get its ack and will refrain from sending any more to
that node. This also provides a mechanism for congestion control
within the network; if the packets or acknowledgements
between
a sender/receiver pair must cross a hot spot, the round-trip delay
(and thus the delay between consecutive packets sent to the same
destination)
will increase, throttling the bandwidth of conversations
and reducing congestion.
The restriction of having only one outstanding packet may seem
it reduces end-point
congestion and adjusts to hot-spots, the bisection bandwidth, and
possible faults.
For shallow networks, the round-trip latency is
smaller than the time it takes to inject a packet into the network and
no extra latency is noticed even for consecutive sends to the same
destination.
By keeping the OPT small enough, we can adjust for network
volume, ensuring that secondary blocking is reduced. If a processor
is not responding to the network, each processor will send at most
231
From
Processor
To Processor
I*
A
.
1
1
(
,.
1
‘nc”rning~
1
v
From
throughput.
through
networks
we are considering,
1 We are relying
routing
will
Network
1: Block diagram of the NIFDY unit with support for bulk dialogs.
excessive at first, but for the types of low-latency
multiprocessor
1
I
To Network
Figure
I
r-’ver
L_KL---
I
2.1
tightly-coupled
it has little effect on
on the fact that worrnhole
Protocol
Networks
or cut-
Implementation
have different
characteristics
which affect the amount of
traffic that they can handle before congestion reduces throughput.
Thus, for best performance, NIFDY will have to be tuned for each
be used, so that in the absence of contention,
the head of the packet can often reach the destination before the
tail has even left the source. Since the ack can be sent as soon as
network.
This is done by adjusting
four parameters.
the header of the incoming packet is processed, in many cases the
sender will receive an ack for the packet it is currently sending. We
0:
Size of outstanding
will look at this issue more in Section 2.4.
When a network has a high round-trip latency, sending multipacket messages as scalar packets may not fully utilize the network.
We overcome this by sending multi-packet
messages using the
bulk protocol. In this protocol, the sender requests a bulk dialog,
which, if granted, allows the sender to have more than one packet
B:
Size of the outgoing
D:
Maximum number of bulk dialogs each receiver can maintain
simultaneously.
outstanding
to the destination.
the multiple
outstanding
Although
the network
W:
may deliver
packets out of order, the receiving
For most shallow
NIFDY
sender’s NIFDY. This reduces secondary blocking and increases
throughput. Since both the packet and the ack have to traverse the
any hot spot or network
congestion
back up all the way to the node’s network port; this has the key
benefit that NIFDY can start sending to other ready destinations. By
is the only way of telling
size for the bulk dialog protocol.
networks,
the most important
parameters are
hold outgoing packets. As long as the OPT is not full, any eligible
packet in the pool (we define eligibility
below) can be sent. This
allows the processor to interleave small packet streams for multiple
processors.
The parameters D and W determine the number and size of bulk
will slow down both,
delaying injection of more packets into the network. NIFDY usually
reacts to a slow receiver or network congestion long before packets
contrast, if backpressure
buffer pool.
O and B. If the OPT is large, then the processor can have more
outstanding packets in the network. To reduce head-of-line blocking at the sending NIFDY unit, there can be a pool of buffers to
puts them back in order before presenting them to the processor.
Instead of piling up in the network, packets are blocked in the
network,
Receiver window
packet table (OPT).
dialogs. Each sender can maintain only one outgoing bulk dialog,
although it can send packets in non-bulk mode to other destinations
when to slow
concurrently
down, a sender will continue injecting packets to a slow receiver
until its entrance to the network is blocked, at which point it is
usually blocked from sending to any other destination. In fact, we
expect that by reducing secondary blocking NIFDY will enhance the
value of adaptive routing, since alternative paths will be available
more often.
with
a bulk dialog.
Each receiver
can maintain
D
incoming bulk dialogs, each with a different sender. For each bulk
dialog, W packet buffers are available in hardware at the receiver
to provide storage for the sliding window protocol.
2.1.1
Scalar Packets
Figure 1 is a block diagram of the NIFDY unit. (This figure also
shows extensions for the bulk protocol, which will be explained
later.) Packets enter NIFDY from the processor if there is an empty
1In fact, the number of outstanding messages per processor for these
networks under lightly loaded conditions is often less than one [CU194].
232
buffer in the outgoing pool. To maintain the correct transmission
order of packets to the same destination, the rankleligibility
writ
ranks each packet in the pool relatwe to the other packets for the
bulk mode on its own.3 Second, every packet includes its source
node address. Finally, packets are delivered in the order in which
they are sent.
In order to utilize the bulk mode of NIFDY, the communication
layer will have to turn on the bulk-mode request bit in the header
same destination. The rank value indicates how many other packets
there are in front of it. When a packet arrives at the pool, its rank
is assigned based on the contents
rank is one plus the number
for the same destination.
of the pool and the OPT: the
of waltmg
Whenever
and outstanding
of outgoing packets. The designer will have to decide what size
transfers will request bulk mode.
If the size is too small, the
packets
an ack from a processor N
resources might go to the wrong sender. If too large, unnecessary
received, all packets in the pool to the same processor have their
rank decremented by one, bringing to zero the rank of the next
delays will result.
Since the NIFDY protocol
packet to be transmitted
sender, the sender’s address is encoded in the header of every packet.
(making
it “eligible”).
requires an ack to be returned
to the
When the network can accept another packet, and there is a free
entry in the OPT, and there is at least one eligible packet in the
If this is exposed to the receive handlers, then the source node never
needs to be included in the data portion of the packets. For instance,
buffer pool, then one of the eligible packets is chosen for sending.
The chosen packet is injected into the network, and the destination
51 % of the request messages in the Split-C library include the source
processor ID in the message. The generic active message specifica-
processor number is recorded in the OPT. Until an ack is received
from that processor, no further packets are eligible for transmission
tion requires that all request messages include the source ID [Mar].
In all these cases, the source ID required in the packet header by
to it. Note that every packet must contain the source processor ID
NIFDY could be put to good use.
in its header so that the destination processor can return an ack.
When a data packet is received from the network, it is inserted
including
into the arrivals FIFO buffer.
an ack 1sretumed.2
Thus, NIFDY’S requirement
the source ID in every packet does not actually
overhead.
Because the messages are delivered
When it is accepted by the processor
be accomplished
without
requiring
of
increase
in order, large transfers can
a round
trip to initialize
the
destination processor’s data structures or buffers. The first message
can initialize the destination processor while subsequent messages
contain the data. The payload per packet is increased because later
2.1.2
The Bulk
Protocol
Figure
1 also shows how NIFDY handles bulk dialogs.
The header
packets need not include any bookkeeping information.
If the extension in Section 6.1 is implemented, then messages that
of each packet includes a bulk-request bit. A sender requests bulk
mode by setting the bulk-request bit in the header of a non-bulk
expect replies could be marked as not requiring
packet. The receiver grants bulk mode to the sender by including
reply itself would serve as an ack, The reply could also be marked
a
bulk dialog number in the ack it returns. A receiver may maintain
multiple bulk dialogs, so it must give each active sender a different
dialog
number.
If the receiver
an ack. Instead, the
as not needing an ack, reducing the overhead of acks to those cases
where the sender is unsure whether the receiver can respond.
can’t grant bulk mode because it
is already participating
in the maximum number of bulk dialogs,
the ack will indicate the rejection to the requesting sender. In this
2.3
case, the sender will continue sending its data using scalar packets,
and can continue requesting bulk mode, which may eventually be
granted if a bulk dialog slot becomes available.
When a node sends to a receiver that has granted it a bulk di-
chip area, there are three sets of buffers and two content-addressable
memories needed to implement NIFDY. The buffers can be implemented using single-ported RAM, taking up less area per bit than
alog, it does not insert the receiver’s ID into the On, instead the
rank/eligibility
unit tracks the outstanding packets for the bulk dialog. The multiple outstanding packets may arrive at the receiving
typical three-ported register files.
Thus, the D W + B buffers
needed can be implemented in a small space.
In order to implement
the outstanding packet table, a small
NIFDY out of orde~ hardware buffers provide a place to store such
packets until the intervening ones arrive. Packets that arrive in order
only the tags, which must be long enough to contain the node iden-
Aside from the control
content-addressable
are not held up and can be streamed to the processor immediately
logic, which is relatively
memory
M required.
small in terms of
The memory
contains
usually more than sufficient. If we assume that 16 bits are enough
for node identification
(allowing 65536 different nodes), then we
have a 16-bit by 8-entry content-addressable memory. The rank determination logic also requires a small CAM of size log the number
orderrng information.
A {sequence numbe~ dialog number} pair
replaces the bits that would have been used as the source identifier.
The NIFDY unit at the receiving end replaces the dialog number in
the header with the source identifier before giving the packet to the
of nodes (e.g. 16) bits by 1? (e.g. 8) entries.
processor.
A sender exits bulk mode by setting a bulk exit bit in the header
of the last packet. A receiver can also terminate a bulk dialog in
which case the transmission continues in scalar mode.
Software
Cost
tifiers. The number of tags is equal to O, the mi~ximum number
of outstanding scalar packets. As shown in Section 4.2, eight is
via cut-through buffering. Sequence numbers, which need only be
as large as W, are included in the header of each packet to provide
2.2
Implementation
2.4
Parameter
Analysis
Selection
and
Performance
Initial estimates of the parameters for the NIFDY unit can be obtained
by considering some parameters of the connected network and the
expected traffic distributions.
We will consider many distributions
in Section 4. Here we give a flavor of how NIFDY would be tuned to
a network by looking at network parameters and traffic between a
Issues
To get full performance out of NIFDY, the software communication
layer must take into account three features of NIFDY. First, the
processor must initiate bulk mode requests; NIFDY won’t attempt
3Of course, NIFDY could be extended to set the bulk-mode request blt
automatically
based on the locally observed traffic pattero; we have not
2An alternative, but surprisingly less effective, strategy is to send the
ack earlier, when the packet is inserted into the arrivals FIFO.
investigated
233
this posslblity
in depth,
injection
time for W/2
+ 1 packets.
Parameter
Meaning
round-trip
P
the other half of the window.)
d
Number of nodes
Distance to destination
~~e~d
Packet payload in bytes
Total time for processor to send packet (software
T Tece8ve
overhead)
Total time for processor to receive packet (soft-
(W/2
(i.e.
hardware
bandwidth
+ l)Trec.t.e
W
(3)
>
T,owzdtr,p(d)
>
2(Tround,,,p(d)/T,,c,,o.
- 1)
We could instead have used a sliding window protocol in which
every packet is acknowledged
as it is received.
In this case, to
Total time for one packet to cross a link along the
path from source to destination in the absence of
contention
of the last packet from
in hops
ware overhead)
Tl,n~
(The + 1 is there because the
time overlaps with the injection
reach maximum
limitation
throughput
we have
W > Tround,jyp (d)/T,ece,ve
(4)
Tackproc
on interpacket arrival times)
Total latency involved in generating and process-
D, the parameter controlling
T roundtrtp
ing ack
Total latency
ceiver, is normally set to one. However, in the unlikely event that
the send rate is much slower than the receive rate, it would be de-
from
the time the header of the
packet leaves NIFDY unit to the time the ack has
been processed
Table
1: Network
characteristics
influencing
selection
of NIFDY
per re-
larger windows) will give better performance with light traffic but
may lead to excessive congestion when all processors try to send
simultaneously.
single source/destination
pair separated by d hops. Table 1 defines
the parameters we are using. Without the NIFDY unit, the maximum
predictable
bandwidth
light traffic.
between two nodes in the network
Bandwidth
=
is
w
max(Tse~~,
Treceive, TI,~k)
expresses that the bandwidth
can be limited
Scalar Mode
bottleneck,
to network
the critical
network
2.4.3
as
where 7&kP,0C
+ Tackproc
Example
Network
Parameters
In this section we try to estimate good NIFDY parameters for two
specific networks. We will assume that the NIFDY processing takes
2 cycles at each end, for a total of T.CkP,oc = 4. We will also
Tla, (d). The time from when a packet starts leaving until the ack is
(d), can be calculated
received and processed, defined as TTOU~d~~zP
= 2&,(d)
bulk packets will wait in the reorder buffers and not add
congestion.
parameter is
packet latency. In most networks this latency is a function of d,
the number of hops between the nodes, so we will write latency as
T ,.wn&,,P(d)
parameters will give better, more
with heavy traffic, but may unduly restrict
that a few extra packets will cause congestion more quickly. Also,
a small bisection bandwidth means that excess packets are more
likely to get blocked within the network, compounding congestion.
Note that if a slow receiver, rather than bisection bandwidth, is the
by the send
Parameters
When the NIFDY unit is included,
More restrictive
performance
Network characteristics determine at what point generous NIFDY
parameters lead to congestion.
A small network volume means
(1)
overhead, the receive overhead, or the physical bandwidth.
2.4.1
of bulk dialogs
sirable to increase D to the maximum point at which one receiver
can handle D senders without falling behind.
When choosing parameters for the bulk dialogs, performance
under light traffic loads must be balanced with performance under
heavy traffic loads. Less restrictive parameters (more bulk dialogs,
parameters.
which
the number
assume that the T,e~d is 40 cycles and Tr,c,,a. is 60 cycles.
First we look at an 8-by-8 mesh using wormhole routing. Mul-
(2)
is the time it takes the NIFDY unit to generate and
process the ack at both ends. Because the sending node must wait
tiple virtual channels are not needed because it is a mesh, not a
torus. The flit size used is one word (32 bits), and each flit buffer
until
holds at most two flits. Our simulated
it gets the ack before
sending the next packet to the same
(d)
node, packets can be sent no faster than once every T~~~~dt~tp
cycles. To attain full bandwidth between two nodes separated by
d hops using the basic NIFDY protocol (with no bulk dialogs),
need T ,~ti~~t,,p(d) < maX(Tsen& Tk~~tv~, Tktk).
2.4.2
Parameters
for Bulk
we
Since the limiting factor without the NIFDY unit would be the
60-cycle receive overhead, it is clear that the roundtrip latency of
the basic NIFDY protocol will often be the limiting factor in pairwise
bandwidth with an uncongestednetwork.
Thus it appears that using
3 indicates
that in order to hide
a bulk dialog may help. Equation
the maximum
NIFDY roundtrip
latency of 144 cycles, we will need
a bulk window size of W > 2(T,oun&,tP (d)/Tr,ce,o,
– 1). So
we would want at least 2 packets, possibly 3 or 4 if we can afford
to be generous.
This wormhole mesh has an exceptionally
low volume-eight
Dialogs
When bulk dialogs are included because pairwise bandwidth would
be unnecessarily limited using the basic protocol, we can use similar
calculations to decide the size of the window. For simplicity, we
will assume that Trcce~~ e is the limiting factor. If this is not the
case, then Tl,~k or Ts~~d would be substituted.
We use a sliding window protocol in which acks are combined
so that only one ack is sent for every W/2 packets. (Recall that W
is the receiver window size.) In this case an ack will be sent only
when all of the packets in that half of the window have arrived; to
words per node (two words for each incoming link).
Thus even
if each node has only one eight-word packet in the network, the
network will be full. This, combined with the mesh’s low bisection
bandwidth of n,
leads to a conservative decision regarding how
avoid bandwidth restriction, this ack must get back to the sender
before all of the packets in the other half of the window have been
injected.
The round-trip
mesh had a one-way latency
of Ttat (d) = 4d + 14. With uniform traffic, the maximum and
average internode distances are 14 and 6 hops respectively; hence
Equation 2 gives maximum and average roundtrip latencies of 144
and 80 cycles respectively.
time must be less than or equal to the
234
many packets to allow on the network.
0=4,
B=4,
D=l,
The other network
An initial
guess would have
Operations
and W=2.
we will consider is a full 4-ary fat tree of 64
nodes. With three levels of routers, the maximum
internode distance
I Processor cv~
Active message send
46
Actwe message poll (no message)
22
is 6 hops, and the average distance is not much less than that. In this
Active
case Tla~ = 5d + 2, giving
(dispatch, handle, return)
68 cycles.
sufficient,
a round-trip
latency of 32 + 32+4
Thus it appears that the basic NIFDY protocol
=
may be
and bulk dialogs will help only marginally.
Our simulated fat tree’s volume is 10 buffers per node, much
greater than that of the mesh. This large volume, along with the fattree’s large bisection bandwidth, means we can be less restrictive in
+
One-way latency (incl.
software)
from send to be~innin~
of handler
Table 2: Measured
allowing packets into the network. Thus, although bulk dialogs are
only marginally useful, they probably won ‘t hurt much either, The
main effort should be to reduce the restrictions on scalar packets
as much as possible. This can be done by making the OPT large
(O = 8 entries) and by making the buffer pool for waiting
large (B = 8 buffers) to reduce head-of-line blocking.
message receive
CM-5 parameters
I
56
d
I
I
used in our simulato~
networks are demand-multiplexed
over the same physical links in
order to make use of all available bandwidth even when the traffic
is unevenly divided between the two logical networks.
With the
packets
CM-5 fat tree, the two networks are strictly time-lmultiplexed
every
other cycle, so that each network is limited to eight bits every two
cycles regardless of the traffic on the other network.
3
The simulator
Simulation
●
Empirical
results were gathered using a parallel simulator
C++ and executed on a Thinking Machines
ments, the simulated objects are distributed
written in
CM-5. In these experiacross the CM-5 nodes
and connected using links provided by the simulator framework.
Most simulation parameters are supplied at run time, allowing easy
exploration of the design space.
Each cycle is simulated
explicitly
jects; at any time in the simulation,
and synchronously
by all ob-
all objects have executed up to
●
The simulator supports the following
not used in this paper4).
Two- and three-dimensional
b
channels.
were one byte wide for all simu-
electromagnetic
application
[CDG+ 93].
●
Radix sort, which uses single-packetmessages
for both counting the keys and transferring each key to its appropriate des[Dus94].
using either cut-through
included but disabled.
In this configuration,
the extra buffering
in the outgoing message pool and the arrivals queue of the NIFDY
units can still be utilized. This allows us to separate the effects of
or
the NIFDY protocol itself from the benefit of simply having extra
buffering. When comparing NIFDY to buffering only, the same total
amount of buffering is always used, although in order to make the
fairest comparison it is redistributed to be most effective for each
routing.
Fat tree more similar to the CM-5 [LADY
an irregular
When the NIFDY units are included in the simulation, all NIFDY
parameters are adjustable. An option allows the NIFDY units to be
The size in each dimen-
92]. Routers in the
first two levels are connected to two parents rather than four,
case,
reducing bisection bandwidth
tree. Also, the link bandwidth
arrivals queue is at most two packets; without the protocol, best
performance results from allocating at least half of the total buffering resources to the arrivals queue. Of course, the acks used in the
NIFDY protocol are included directly in the simulations, competing
as compared to a full 4-ary fatwas reduced to 4 bits per cycle
as in the CM-5 network.
●
EM3D,
worm-
run-time parameters. Links
lations reported here.
1-byte links,
●
tination
meshes and tori utilizing
channels, and buffer sizes are all
store-and-forward
●
networks (as well as others
sion, the number of virtual
4-ary fat tree with
The cyclic-shift
all-to-all communication
lpattem described
in [B K94]. This and the following
two traffic patterns use
the CMAM and Split-C libraries from the CM-5, and thus
use six-word packets (for all networks, not just the CM-5
imitation).
network simulation and the computation is simplified because only
polling message reception is allowed; thus the computation always
initiates interaction with the network.
with virtual
Pseudo-random,
bursty traffic.
Burst length distributions
are adjustable, global barriers can be inclutied between send
bursts, and nodes can programmed to enter ‘non-responsive’
periods during which they neither send paclcets nor pull them
from the network interface. Dedicated state for each pseudorandom number generator ensures that the same sequence of
node is allowed to run ahead (in simulated time) up to the next point
where it interacts with the network. Synchronization
between the
hole routing
traffic loads.
bursts is generated regardless of network and NIFDY configuration used. Packet size is eight words including header.
the same point. The only exception is when real Split-C programs
are driving the simulator. In this case, the network simulations on
the CM-5 nodes are still synchronous, but the computation on each
●
supports the following
Multibutterflies,
with adjustable dilation and radix. In this report we use a butterfly (d~lation 1, radix 4) and a mtdtibutterfly
(dilation 2, radix 4).
With all topologies
with the NIFDY protocol,
with data packets for network
For realistic timings
a real CM-5 to estimate
well as CM-5 network
summarized in Table 2,
All topologies support two logically
independent networks, the
request network and the reply network, m order to deal with fetch
deadlock.
For example,
other than the CM-5 fat tree, the two
4The simulator n available at
ftp://ftp.cs.berkeley.edu/pub/packages/nlfdy/nifdy.html
235
the capacity
of the
bandwidth.
on our simulations, we ran several tests on
packet sending and receiving overheads as
latency and bandwidth.
These parameters,
agree closely with those reported in [vE93].
4
Results
4.1
Synthetic
Workload
To learn which NIFDY parameters were best for which networks and
to measure the overall effectiveness of NIFDY’S flow control, we ran
many simulations for each network. Because performance at both
heavy and light network loads is important, we used two different
traffic patterns forthese
runs: onewhich
rewards graceful handling
of heavy traffic loads, and one which rewards rapid packet delivery
under light traffic.
Both traffic
patterns consist ofphases
separated by bafiers.
A
node that is sending during a phase will attempt to send its packets
(typically
100t03000f
them) asquickly
as possible.
Processors
send single- or multi-packet
messages; all the packets in a single
message aresent consecutively andtothe samedestination.
Atthe
end of a message, a sender randomly chooses a new destination
and message length and immediately
starts sending to the new
destination.
To ensure that every node makes progress sending, no node can
start the next phase until
all sending nodes complete
F;ll
Fat Tren
Str*wrd
Fat Tree
26
Mesh
Zi
Torus
3-D
Mesh
Bult;rfly
Mu-ilbuttefrly
Figure 2: Performance benefit from fiow controJ of NIFDY for different networks: packets delivered in 1,000,000 cycles. “Heavy”
Does not reflect additional
benefit of in-order
synthetic traffic.
the current
delivery
phase. As with real MPPbulk-synchronous
applications, if some
nodes are favored by the topology and are able to send their outgoing data quickly,
chi-5
Fat Tree
from NIFDY.
sooner or later they will have to wait until the
other nodes catchup.
In abulk-synchronous
tom line is how quickly each communication
application the betphase is completed;
thus our metric is thenumber
ofpackets delivered within a fixed
number of cycles. Note that this metric measures only the benefit
of reduced network congestion; the in-order delivery provided by
NIFDY Will give an additional bandwidth benefit, dependent upon
the application (see Section 2.2).
In the heavy traffic pattern, all nodes send each phase, and message lengths — the number of consecutive packets a processor
sends to its destination before changing to a new destination —
are uniformly
chosen from one to five packets. In the light traffic
pattern, each node has only a 33% chance of sending each phase,
reducing contention in the network.
Since nodes are less likely
to poll during light traffic,
our simulated
nodes periodically
ignore
F;ll
Fat Tree
the network; these periods of ignoring the network are triggered
pseudo-randomly
and independently for each node. With light traffic the message length distribution
includes lengths of 10 and 20
packets; most messages are short, but long messages account for
set of parameters for that network,
The graph compares packet
throughput for each case, showing the benefit just from the reduced
network congestion allowing more packets to get to their destinations. For the networks that deliver packets out of order, the actual
NIFDY
will
likely
in more payload
be greateq
NIFDY’S
in-order
per packet in multi-packet
:D
Mesh
Butt-tily
Miinlbuttetfly
additional
benefit
of in-order
els only three hops, resulting in very low round-trip latency, and
there are no alternative paths between nodes, making congestion
avoidance more critical.
delivery
messages,
and can also reduce the receive processing time.
The best NIFDY parameters, chosen to give the best average performance with both test traffic patterns, are shown in Table 3. The
ideal NIFDY parameters for the fat-tree variations are less restrictive
than those for the meshes; fat trees have greater
width, greater volume, and more alternative paths
so that having a few extra packets in the network
much as it does with the mesh. The CM-5 network
2-D
Torus
windows than the full fat tree even though the round-trip latency
is twice as grea~ this is because of the CM-5 network’s smaller
volume and bisection bandwidth, which makes congestion a more
important factor.
Finally, observe that the buttertly is the only
network where it is best to have no bulk dialogs: every packet trav-
ous networks under both traffic loads, comparing no NIFDY; buffering only (without the NIFDY protocol); and NIFDY using the best
of
Zti
Mesh
synthetic traffic.
Does not reflect
delivery from NIFDY.
the same receiver) and unresponsive receivers.
Figures 2 and 3 show the performance benefit of NIFDY for vari-
can result
Str/&vrd
Fat Tree
Figure 3: Performance benefit from flow control of NIFDY for different networks: packets delivered in 1,000,000 cycles. “Light”
more packets overall. Thus the light traffic benchmark mainly measures pairwise bandwidth with only some contention in the network
and some possibility of target collisions (multiple nodes sending to
benefit
Ck-s
Fat Tree
4.2
Scalability
Is it necessary to increase the size of the OPT or the outgoing buffer
pool (O or B) as the number of nodes in the network gets larger
in order to maintain the same relative benefit from NIFDY? This
bisection bandbetween nodes,
does not hurt as
has smaller bulk
would be an undesirable finding, since we would like NIFDY to be
scalable—we don’t want to have to make all the NIFDY units bigger
when we increase the number of nodes in our MPP.
236
Maximum
NIFDY parameters
n.,(d)
(packets per processor)
w
B
Full Fat Tree
5d+2
10
1
4
8
8
CM-5 Network
9d+2
6.5
1
4
8
8
1
8
16
8
‘D
o
(L+l)d+3
10
8x8 Mesh, 1 virt. ch., 2-flit buffers
8x8 Torus, 2 virt. ch., 4-flit buffers
4d+
14
1
4
1
2
8
4
4x4x4 Mesh, 2 virt. ch., 4-tiit buffers
4d+
18
6
1
2
8
4
6
10
0
0
8
8
I
4
8
8
Store-&-Forward
Table 3:
Volume
Network
Fat Tree
4-ary Butterfly
5d+2
4-ary Multibutterfly
5d+2
Characteristics
of simulated
64-node networks
aJorrg with best NIFDY parameters for each network,
used in Figure 2 and 3. d is the
number of hops, L is packet length in bytes.
Vatymg
Pool
Size
and
Network
Size
Varying
OPT S!ze and Network
4.3
Size
Pza-
In this subsection
1 4!
/ “ ‘\
f=
125
j
12Q.‘~.
‘\
The C-shift
,/’
z
E112
;
pattern, the cyclic
communication
pattern consists of P – 1 phases.
In the first phase, processor i sends to processor (i + 1) mod P;
in phase p, processor i sends to processor (i + p) mod P; until
1 ?0
/’
h4
p = P – 1. As long as the phases remain separate,, each receiver is
matched with exactly one sender. However, as observed in [BK94],
1
105 ~
$2
we consider a specific traffic
shift (C-shift) studied in [BK94], which provides a~l-to-all communication. We implemented this traffic pattern using the “real traffic”
interface to our simulator and the CM-5–style network in order to
make comparisons with [BK94].
135
~130
m
Cyclic Shift
128
Number of Prmessms
Figure 4: Throughput
trees.
256
‘“oh--k-+---k
Number.1
some nodes may finish the current phase early and move to the next
phase, resulting in one node receiving from two senders. This slows
the progress of both senders, allowing other senders to catch up and
Processors
aggravating the condition.
Figure 5 shows the number of packets
in the network for each receiver as the pattern progresses, clearly
for various O and B on various sized fat-
To answer this question we ran some simulations
indicating the accumulation
of packets outside certain receivers.
One solution used in Strata [BK94] is to insert global barriers
between phases.
of the full fat
tree, using only short messages and no bulk dialogs
Results are summarized
in order to
When NIFDY’S in-order
delivery
than optimized
is exploited,
barriers.
the benefit is even
greater. These results can be explained by looking at Figure 5.
Some piling up does occur with NIFDY(due to the different path
lengths between different pairs of nodes), but these perturbations
dissipate and the network returns to even utilization of all receivers.
This dissipation occurs because the “rightfuf”
sender to a receiver
has the advantage that it owns the bulk dialog to that receiver.
Thus it will be allowed to finish rapidly and move on to the next
receiveq at that point the sender behind it can attain the bulk dialog.
machine size for different values of B. In general, increasing B
gives better performance for any size network.
However, for a
fixed l?, the relative benefit of NIFDY does not decrease and in
most cases increases as the size of the MPP grows. This result
means that a system designer can choose B once, depending on the
desired performance and cost, and then can expect to maintain
performance benefit even as the MPP scales to large sizes.
in Figure 6. Using NIFDY’S congestion
control alone results in better performance
concentrate on the effects of O and B. The first part of Figure 4
shows throughput (normalized
to a network without NIFDY) vs.
that
The second part of Figure 4 shows normalized throughput vs.
machine size for different values of O. The most important thing
Although
this effect is dependent on NIFDY parameters and network
characteristics, in all cases performance
than with nothing at all.
to see from this graph is that O = 8 is the best parameter across all
machine sizes except for the largest we looked at (where the best
value of O is 4).
was much better with NIFDY
While these figures may differ depending on network volume and
other factors, we do expect NIFDY performance to stay constant or
4.4
increase as the network size grows—while
keeping the same small
fixed parameters in NIFDY. In fact, the results should be even more
favorable on networks in which the bisection bandwidth does not
scale linearly with the number of nodes, such as a two-dimensional
mesh. In these cases the per-node bandwidth would have to de-
EM3D, a program for solving electromagnetic
problems in three
dimensions and a common parallel benchmark [CDG+ 93], was
also used to drive our simulations.
The results of our simulations
for a number of different networks are summarized in Figure 7 (for
the light network load) and Figure 8 (for the heavy network load).
crease as the machine size grows in order to avoid congestion at the
bisection, making smaller values of O and B more desirable.
For networks that deliver packets out of order, two NIFDY results are
presented: one which gives the benefit just from the flow control
(“NIFDY-”), and another m which the Split-C library that interfaced
to our network simulator was altered to take advantage of the inorder delivery provided by NIFDY. For networks that deliver packets
237
EM3D
Wtthout
NIFDY,
no barriers
Time x 100,000
With NIFDY,
cycles
one dialog,
Figure7:
EM3Dcycles
periteration
with less communication.
(In the computation graph generated by the parameters, most arcs
no barriers
are local to processors.)
n-nodes = 200, dnodes
= 10, locaLp
80, dist-span =5. NIFDY– reflects benefit from tlow control
NIFDY expJoits in-order delivery as well.
-o
Figure
2
1
4
5
Time x 100,000
3
5: Network
congestion
7
6
8
=
only;
9
cycles
with C-shift:
pending
packets per
receiver without and with NIFDY (no bm”ers in either case). Shading
is inte~olated
between white for no pending packets and black for
20 or more pending
packets.
packets are transfened,
In both cases, the same number
of
but NIFDY finishes earlier
Figure 8:
C-shift
I
,Ca_
;0
4-
Cslm
3
Bandwidth
I
I
1
-----
.
\
‘-”x.,,
.-. ....,,,
Without in-order delivery,
the buffers-only configurations
___
~ -----------,,,,,.,, ,.. ---......,,,,,.,,,
..,,
,.,-----
.-%.,.,,,,
..,,,,,,.,, +-
g
the library
only;
intended
for
the difference between NIFDY and
is negligible.
Once the library takes
advantage of the in-order delivery provided by NIFDY, h outperforms
the buffers-only configuration in all cases.
. ..
=4im
with more communication.
in order (the 2D mesh and the butterfly),
in-order delivery was used for all runs.
B~
.
5
cycles per iteration
3, dist-span = 20. NIFDY– reflects benefit from flow control
NIFDY exploits in-order delivery as well.
1
,
:
<,,0
EM3D
(In the computation graph generated by the parameters, most arcs
are between processors.) nnodes = 100, d-nodes= 20, local-p =
.\
..
..
.. .
,.
, m
~.
..
,-. -.-.-,
,.
ICa
.2”’
“v”’
tL
I
,“1
1
,m
I
Blo.k2~ze
Figure6:
4.5
.’
. ..
Throughput
I
,W
1 I
,Mm
(byt~$
for C-shift
on32-node
Radix Sort
Finally, we ran simulations of a radix sort based on [Dus94]. Each
iteration of radix sort consists of two communication
phases: scan
and coalesce.
In the scan phase, a scan addition is performed
across all processors for each bucket; this involves nearest-neighbor
communication.
The most notable feature of this is that the overall
CM-5network,
communication
238
phase runs faster if delays
are inserted
between
reordering
.8 8
the packets that otherwise
would arrive out of order.
Finally, we saw that on most networks many of the communication patterns can be sped up by carefully crafted software techniques.
Without
either the techniques
patterns ran poorly.
With
or NIFDY, many of these
NIFDY, intelligent
software
techniques
were useful, but they were not as important. In general, NIFDY provides a safety net when software network management techniques
are not or cannot be applied.
5
Related
Work
Flow control (FC) and congestion
control (CC) have received m,uch
attention in LAN and WAN research. The larger packets and longer
latencies in these types of network make software implementaF;ll
Fat-Tree
CM-5
Fat-Tree
F;ll
Fat-Tree
StrlFwrd
Fat-Tree
CM-5
Fe4.Tree
StrlFwrd
Fat-Tree
tion of FC and CC protocols practical. Our method, being simple
enough to implement in hardware, provides FC and CC for the type
of low-latency, high-bandwidth
networks where software protocols
are not practical. Specifically, NIFDY does not require any intelligence within the network switches and it does not require nodes to
keep per-receiver connection or credit state.
In multiprocessor
networks, the need to reduce software com-
Figure 9: Cycles for one scan phase of radix sort “With Delay ’’indicates that artificial delays are inserted between consecutive sends
to the same destination.
successive sends. Without delays, the sends from one processor
cause the next processor in the pipeline to continually receive with
no chance to send, serializing
munication
overhead while making good use of network
width has inspired many attempts to “raise the functionality
the entire scan. We studied versions
network’’-usually
by reducing
congestion
or providing
bandof the
in-order
both with and without the delay. In the coalesce operation, the keys
are sent to the appropriate destination using one message for each
packet delivery. Most of these projects have taken a different approach from NIFDY’S: they have added functionality
to the network
key;
routers rather than just to the network interfaces.
The METRO router [DCB+ 94] provides in-order
assuming
a random
initial
key distribution,
the one-packet
messages are sent to a random sequence of destination processors.
Figure 9 shows results for the scan phase using an 8-bit radix
delivery
while
taking advantage of random wiring in expansion networks.
The
router is a dilated crossbar and is used as a building block for
indirect expander networks such as multibutterflies
and metabutter-
on 64 processors. While adding delays between successive sends
helped in all cases, it was more critical when NIFDY was not included. When NIFDY is included, its protocol causes the sender to
slow down; this allows all the processors to continue to send as
flies [CBLK94].
A sender attempts to make a connection
rando:mly
through successive dilated crossbars; if a connection attempt is
blocked, the path is tom down, and the connection is retried later.
well as receive. Networks with higher latencies, e.g., the store and
forward fat tree, get a bigger gain from NIFDY, than those with lower
Once a connection is established, it remains fixed, and thus transfers are in order. The cost of blocked connection attempts means
latencies, like the full fat tree. This exemplifies what we found in
many cases: the locally restrictive NIFDY protocol actually results
in more global throughput.
in-order
that METRO must make sure that most connection attempts succeed; thus it is important to have large bandwidth throughout the
network, probably much more than is needed to carry the average
load. NIFDY allows network utilization closer to its theoretical maximum, while preventing the user from pushing the network out of
delivery is not beneficial. On the other hand, NIFDY’S restrictiveness
did not hurt performance.
its operating range. While METRO requires nontrivial intelligence
at the transfer endpoints, its key characteristics arise from its router
Results for the coalesce phase (not shown) were virtually idenThere was not enough congestion
tical with and without NIFDY.
for NIFDY’S flow control
to help, and with this algorithm
design. NIFDY, in contrast, can be used with a variety of networks.
4.6
Compressionless
Discussion
delivery.
Routing
CR, which
(CR) [KLC94]
relies on wormhole
also provides in-order
routing,
pads packets
NIFDY performs well for three real communication
patterns which
form the basis for many parallel programs. The NIFDY protocol may
with enough space to ensure that pushing the entire packet onto
the network implies that the head of the packet has already en-
seem restrictive, but NIFDY’S admission control reduces congestion
in the network. Our results show that it delivers more packets than
tered the destination, at which point the packet is guaranteed to be
completely consumed. If the packet cannot be pushed out within
the same network without NIFDY, and roughly the same as when
NIFDY’S buffering is used without the protocol. When NIFDY’S inorder delivery is taken into account, NIFDY is seen to give a clear
benefit for all shallow networks.
NIFDY helps different networks in different ways. For networks
with a single path between each sender and receiver, packets are
a preset amount of time, the transmission is aborted and the flits
already in the network are killed. Abstractly, there are some simWith both
ilarities between CR and the basic NIFDY protocol.
there can beat most one unreceived packet in the network between
any source/destination
pair. In addition, CR also uses an ack, albeit an implicit one—lack of backpressure—which
travels from
the destination to the source on the switch-level ack control wires.
already delivered in-order, so NIFDY gives no additional benefit
in that respect; however, such networks degrade rapidly in the
presence of congestion, which NIFDY helps avoid. For networks
with
multiple
does
routes
not
routes between
degrade
around
hot
as rapidly
spots.
each sender and receiver,
because
packets
In such cases,
our implementation
some
differs markedly.
bandwidth,
but there
The explicit
is no need
acks
to add
padding to short packets. The tearing down of packets due
to blockage causes instability
at high network load: the average
amount of bandwidth consumed per successful transfer increases,
wastefal
performance
can travel
NIFDY’S
However,
in NIFDY consume
alternate
main benefit is
239
making
the congestion
worse.
In contrast,
with a medium-sized OPT were shown to be sufficient due to the
small round-trip latency. For deeper networks, we added bulk dialogs to overcome this latency. Here we extend NIFDY for networks
our method performs
best under high load and prevents the network from being pushed
into a regime of declining throughput.
NIFDY
is fairly insensitive
of workstations which may drop packets. Our goal is to make the
network transparent to the application and for it to be scalable.
To handle networks that drop packets the sender must be able
to retransmit packets. In addition, the receiver must be able to
distinguish
and eliminate duplicate packets.
To accomplish retransmission we add one timer and one message buffer per entry
to our preset parameters; with CR, a poorly chosen timeout period may drastically affect performance (although our extension to
NIFDY for handling
dropped
in this respect).
is very general since it is logically
separate
it can be used with wormhole,
cut-through, or
from
packets
will
have the same sensitivity
NIFDY
the network;
store-and-forward
routing, and can be added to an existing network
with no change to the network itself. In contrast, CR can be used
only with wormhole
interfaces
routing,
in the OPT and per outgoing
to support the killing
of packets.
NIFDY, can be used with networks
However,
bulk packet.
The outgoing
packet is
copied into the buffer and the timer is set when the packet is sent.
and it requires the network routers and
If an ack is received before the timer expires, the timer is reset and
CR, unlike
the buffer is freed for future use. If no ack is received before the
that are not deadlock-free.
Finally, there are many software techniques that can be used
to reduce network congestion [B K94]. These techniques, such as
timer goes off, the packet is retransmitted. To distinguish duplicate
packets, one additional bit in the header is enough for both scalar
structuring communication
as series of permutations allowing oneHowever, NIFDY
on-one transfers, are beneficial even with NIFDY.
wdl add robustness to the system and be especially effective with
traffic patterns that are difficult for software to manage, in particular
those with no global structure.
In some cases the behavior of
and bulk packets.
This simple extension—a
NIFDY with
with
irregular
regular
interleaves
packets
implements
which
traffic
mimic
For
to different
bandwidth
it receives
processor
will
communication.
destinations.
which
packets
techniques
NIFDY
ceptional condition (viz., the dropping of a packet), which should
reduce software overhead at both the sender and the receiver. Ac-
used
automatically
cording
And NIFDY effectively
matching—injecting
acks,
is pulling
the software
example,
packets
is the rate
at which
out of the network.
injection
to [KC94],
receiving
at the rate at
this should
messages by
to
3070
reduce the cost of sending
and
st)~o.
the receiving
NIFDY also handles
6.3
the more general case with multiple nodes sending to one receiver,
returning acks only at the rate at which the receiver accepts packets.
This throttles the combined
bit in each header and some additional
NIFDY
to hide the implestate and buffering on each NI—allows
mentation details of the network from system software and user
applications alike. We have used simple hardware to mask an ex-
Further
In addition
rate of all the senders to a level
Experiments
to the extensions
proposed above, we believe that we
that the receiver can handle. It would be difficult and expensive to
implement such dynamic bandwidth matching in software.
have just begun to understand how the network parameters affect
the throughput and latency of messages on the network. While we
have a good understanding of the O and P parameters and how they
interact with traffic patterns, we have yet to study the interaction
6
bulk dialog.
among transfer lengths, W, and the optimal
Future
Work
We also plan to extend the simulator
6.1
point for requesting
a
to study how NIFDY interacts
with adaptive routing on a mesh, which in the past has not performed
Changes to ack strategy
well enough to justify
There are two changes to the current protocol that we would like to
study: allowing acks to be combined with reply messages, which
and in-order
delivery
its expense.
Adding
the admission
control
of NIFDY may help adaptive routing
reach its
potential.
should reduce network traffic; and allowing packets that don’t require acks.
In the protocols described in this paper, the ack packets are
the user code will
always generated by NIFDY. In many situations
also send a reply message to the source processor.
Instead of
In this paper we have proposed a network
sending both a NIFDY-generated ack and a user reply we could
piggyback the ack in the reply. This seems to be a good idea, since
if the sender M waiting for a reply it probably won’t have any other
increases network performance and decreases software overhead
without restricting routing choices in the network, We have shown
that it is possible to achieve these goals simultaneously
by adding
7
packets for the destination processor until the reply is received.
Adding this protocol requires only an additional bit in the header
that packets
does not care about
will
not contribute
the risk).
Such
and would
be handled
packets
would
just
that no acknowledgements
co-exist
with
traffic
like
delivery
be eligible
be sent.
and knows
(or is willing
modest resources only at the network
the
to take
at the receiver
except
This type of traffic
could
the NIFDY protocol.
verified
6.2
Networks
of Workstations
We have shown how NIFDY increases
works
for MPPs.
For shallower
performance
networks
of reliable
scalar packets
interface,
interface
NIFDY,
and without
which
having
credit-
based scheme where every sender implicitly
has one credit for
each receiver. Because senders record only receivers with zero
credits rather than maintaining state for all receivers, the resources
consumed at each sender scale with the number of outstanding
packets rather than the total number of nodes. Because credits are
good only for a particular processor, the protocol can easily adapt
to bimodal MPP traffic.
We built a general-purpose
simulator to test these ideas, We
to be sent immediately,
scalar packets
would
obeying
in-order
to congestion
and Conclusion
to push any functionality
throughout the network.
In essence, the basic NIFDY protocol is an optimized
and a comparator in the “to processor” block in Figure 1.
NIFDY
could be configured so that the processor indicates when
it wants to bypass the NIFDY protocol.
This could be done when
the processor
Summary
net-
combined
240
the simulator
against a real machine,
used the simulator
to evaluate
network
interface
attached
meshes,
tori,
network,
and all synthetic
butterflies,
the performance
to a variety
the CM-5,
advantage
of network
and fat trees.
and real traffic
fabrics,
We showed
patterns,
that
and then
of a NIFDY
including
on every
NIFDY increased
packet throughput
more
increased
total
patterns
payload
since
delivered
we saw increases
in EM3D)
having
to a level comparable
In addition,
buffers.
packets
on all networks.
10% (under
more
in order,
[Da191]
W.J. Dally. Express cubes: improving the performance
of k-ary n-cube interconnection networks. IEEE Transactions on Computers, vol.40(no.9’): 101 6–23, Sept.
1991.
[DCB+94]
A. DeHon, F. Chong, M. Becker, E. Egozy, H. Minsky,
S. Peretz, and Jr. Knight, T.F. Metro: a router architecture for high-performance,
short-lhaul routing networks. In Proceedings the 2 Ist Annual International
Symposium on Computer Architecture, pages 266–77.
IEEE Comput. Sot. Press, 1994.
[Dus94]
Andrea Carol Dusseau. Modeling parallel sorts with
LogP on the CM-5. Technical Report UCB//CSD-94829, University of California at BerkAey, May 1994.
[Jac88]
V. Jacobson.
Congestion
In Computer Communication
Aug. 1988.
[Jai90]
R. Jain. Congestion control in computer networks:
issues and trends. IEEE Network, vol.4(no.3):24-30,
May 1990.
[KC94]
Vijay Karamcheti and Andrew A. Chien.
overhead in messaging layers: Where does
go? In Proc. of 6th Int. Confi on Architectural
for Programming
Languages and Operating
San Jose, CA, october 1994.
[KLC94]
J.H. Kim, Ziqiang Liu, and A.A. Chien.
Compressionless routing: a framework for adaptive and faulttolerant routing. In Proceedings the 21st Annual International Symposium on Computer A rchitecture, pages
289-300. IEEE Comput. Sot. Press, 1994.
[KMCL93]
H.T. Kung, Robert Morns, Thomas Chaaruhas, and
Dong Lin. Use of link-by-link
flow control in maximizing atm networks performance: Simulation results. In
Proceedings IEEE Hot Interconnects Symposium ’93,
August 1993.
[KS91]
S. Konstantinidou
and L. Snyder. Chaos router
architecture and performance. In Computer Architecture
News, pages 212-21, May 1991.
it
For real traffic
hght
to as much as 1007o (under all-to-all
added
Using
from
to that of having added
NIFDY delivers
loads
presented
transfers) over just
buffers.
we also showed
the simulator
by NIFDY are constant
(or decreasing)
that the resources needed
with
respect
to the number
in the network. In particular, for all the networks studied,
an outstanding packet table of size 8 combined with a packet pool
of nodes
of 16 and a single bulk dialog with a window
than enough resources for even large machines.
of 8 were more
In fact, on most
networks fewer resources than these gave better results. Thus, given
the performance
needed
over plain
advantages of NIFDY, the small additional
network
interfaces
is a worthwhile
chip area
investment.
Acknowledgments
We are grateful
comments.
to the anonymous
We would
referees
also like to thank
for
Krste
their
valuable
Asanovi6,
Eric
Brewer, David Culler, Andrea Dusseau, Steve Lumetta, Klaus Erik
Schauser, Nathan Tawil, and John Wawrzynek for their comments
on earlier versions of this paper, and Su-Lin Wu for her contributions to early stages of this work.
Computational
support at
Berkeley
was provided
CDA-8722788.
by the NSF Infrastructure
Seth Copen Goldstein
is supported
Grant number
by an AT&T
Graduate Fellowship.
Timothy Callahan received support from
an NSF Graduate Fellowship and ONR Grant NOOO14-92-J-1617.
This work also received support through NSF Presidential Faculty
Fellowship CCR-92-53705 and LLNL Grant LLL-B283537-Culler.
References
[Aga91 ]
A. Agarwal. Limits on interconnection network performance. IEEE Transactions on Parallel and Distributed
Systems, vol.2(no.4):398-412,
Oct. 1991.
[BK94]
E.A. Brewer and B.C. Kuszmaul. How to get good performance from the CM-5 data network. In Proceedings
Eighth International
Parallel Processing Symposium,
pages 858–67. IEEE Comput. Sot. Press, 1994.
[BT89]
R.G. Bubenik and J.S. Turner. Performance of a broadcast packet switch. IEEE Transactions on Communications, vol.37(no.l):60–9,
Jan, 1989,
[CBLK94]
F.T. Chong, E.A, Brewer, F.T. Leighton,
and T,F.
Knight, Jr. Building
a better butterfly:
The Multiplexed Multibutterfly.
In Proc. International
Symposium on Parallel Architectures, Algorithms,
and Networks, Kanazawa, Japan, December 1994.
[LAD+
92]
avoidance and control.
Review, pages 314-29,
Software
the time
Support
Systems,
C.E. Leiserson, Z.S. Abuhamdeh, D.(2. Douglas, C.R.
Feynmann, M.N. Ganmukhi,
J.V. Hill, W.D. Hillis,
B.C. Kuszmaul, M.A. St. Pierre, D.S. Wells, M.C.
Wong, Shaw-Wen Yang, and R. Zak. The network architecture of the connection machine CM-5. In SPAA
’92. 4th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 272-85. ACM, 1992.
[Mar]
Richard Martin.
[RJ90]
K.K. Ramakrishnan
and R. Jain.
A binary feedback scheme for congestion avoidance in computer
networks. ACM Transactions on Computer Systems,
vol.8(no.2): 158-81, May 1990.
[SBB+ 91]
M.D. Schroeder, A.D. Birrell, M. Burrows, H. Murray, R.M. Needham, T.L. Rodeheffer, E.H. Satterthwaite, and C.P. Thacker. Autonet: a high-speed, selfconfiguring
local area network using point-to-point
links.
IEEE Journal on Selected Areas in Communications, vol.9(no.8): 13 18–35, Oct. 1991.
[CDG+ 931 David E. Culler,
Andrea
Dusseau,
Seth Copen
Goldstein,
Arvind Krishnamurthy,
Steven Lumetta,
Thorsten von Eicken, and Katherine Yelick.
Parallel programming
in Split-C. In Proc. Supercomputing
’93, Portland, Oregon, November 1993.
Personal Communication.
[CU194]
David E. Culler. Multithreading:
Fundamental limits,
potential gains, and alternatives. In R.A. Iannuci, G.R.
Gao, Jr. Halstead, R. H., and B. Smith, editors, Multithreaded Computer Architecture, chapter 6, pages 97–
138. Kluwer Academic Publishers, 1994.
[SS89]
S.L. Scott and G.S. Sohi. Using feedback to control
tree saturation in multistage interconnection
networks,
Symposium on Computer
In 16th Annual International
Architecture, pages 167–76. IEEE Comput. Sot. Press,
1989.
[Da190]
W,J. Dally, Virtual-channel
flow control, In Proceedings. The 17th Annual International
Symposium on
Computer Architecture,
pages 60-8. IEEE Comput.
Sot. Press, 1990.
[vE93]
Thorsten von Eicken.
Active Messages:
an E&cient Communication
Architecture ~for Multiprocessors. PhD thesis, University of Califclmia at Berkeley,
December 1993.
241