Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Nifdy

1995, Computer architecture news

A Low Overhead, High Throughput NIFDY: Timothy Callahan {timothyc, and Seth sethg}@cs Computer University Science of Copen .berkeley. network received by a processor in the order in which they were sent, even the packets out of order. We present results from ify NIFDY’S efficacy. simulations of a variety and fat trees) and traffic Our simulations throughput and decreases overhead. as a network’s bisection bandwidth DeliverY—which adapts the WAN style only if the destination is expected to be able to accept the packet. The When the network basic idea behind NIFDY is that each processors allowed to have at most one outstanding packet to any other processor unless the destination processor has granted the sender the right to send multiple unacknowledged packets. Further, there is a low upper limit on the number of outstanding packets to all processors. (meshes, tori, butterflies, and in-order solutions to MPP networks. In short, NIFDY performs admission control at the edges of the network; a packet is injected into the In this paper we present NIFDY, a network interface that uses admission control to reduce congestion and ensures that packets are delivers edu Division Flow-control network Goldstein California–Berkeley Abstract if the underlying Interface Network is running within its operating range, soft- ware overhead represents the largest cost in message transmission. Some of this overhead arises in matching the functionality of the network fabric to the application requirements, NIFDY removes the overhead required for reordering packets by delivering packets to the processor in the order in which they were sent. This allows network designers to exploit various techniques, e.g. adaptive routing, to increase network performance without imposing additional of networks patterns to ver- show that NIFDY increases overhead on the applications. In the rest of this section we explain The utility of NIFDY increases decreases. When combined underlying network our assumptions about the and present the basic design of NIFDY, In Sec- with the increased payload allowed by in-order delivery NIFDY increases total bandwidth delivered for all networks. The resources tion 2 we present the complete design of NIFDY, its implementation cost, and how it interacts with the processor and network. Section 3 needed to implement network size. describes the simulator which was used to verify the performance of NIFDY and Section 4 presents the results we obtained from it. In Section 5 we compare our approach to previous work on network design. In Section 6 we propose some extensions to NIFDY to handle unreliable networks and networks of workstations. 1 NIFDY are small and constant with respect to Introduction An efficient parallel interconnection computing. performance network Although is essential have increased dramatically, not efficiently for high-speed 1.1 processor speeds and raw network network interfaces have NIFDY is designed integrated these resources. Thus system performance tightly has not kept pace with the performance of the individual components. In this paper we present a network interface that more closely matches modern network characteristics while remaining independent of the network design. Aga91, BK94]. Researchers and MPP networks. deep networks have investigated The WAN this problem solutions with long messages and generally primarily Network for MPPs where the processors coupled by a fast. shallow interconnect. Unlike are previous work to increase performance of such systems, our approach does not presuppose a particular kind of network or router. We assume only that once the network has accepted a packet it will eventually be delivered to its destination, if processors continue to accept packets. (In Section 6 we show how NIFDY can be extended to handle unreliable networks.) On most networks proposed for MPPs, the main source of performance degradation is congestion. Congestion can be caused in two places: at the end-points and internally in the network fabric. End-point congestion arises when packets arrive at a node Interconnection networks deliver maximum performance when the offered load is limited to a fraction of the maximum bandwidth. We call this the operating range of the network. Many people have observed that when the offered load exceeds the operating range, throughput falls off dramatically [Jac88, Jai90, SS89, RJ90, KS9 1, both WAN The Underlying for are based on faster than the node can process them. use software pro- Internal congestion can arise for several reasons. First, a number of senders may combine to generate traffic that exceeds the network’s bisection bandwidth. tocols at the end points [Jac88, RJ90, KMCL93, SBB+91]. Most MPP networks, which are shallower and have shorter messages, either ignore the issue or control congestion in the network fabric Itself [CBLK94, Da191, LAD+ 92, Da190]. In this paper we propose a network interface called NIFDY—Network Interface with Second, hot spots in the network may cause unnecessary blocking Third, faults in the network may restrict and reduce utilization. the available bandwidth. Finally, end-point congestion can cause congestion internal to the network; we call this secondary Mocking. To handle hardware faults and transient congestion, many network topologies-e.g. fat trees and multibutterflies-provide multiple paths that spread out traffic between nodes. While networks with alternative paths can provide some congestion tolerance, they don’t solve the problem entirely and can even aggravate it. If there is no direct feedback to the sending node (as is the case in most, if not all, MPP networks), then backpressure is the only mechanism to Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copyin is by permission of the Association of Computing Machinery. ? o copy otherwise, or to republish, requires a fee and/or specific permission. ISCA ’95, Santa Margherita Ligure Italy @ 1995 ACM 0-89791 -698-0/95/0006...$3 .50 stop the sender from sending packets. In this case, adaptive routing 230 may fill up the network buffers along all possible paths between the sender and the bottleneck, one packet to it; no further packets will be sent until the destination causing extreme secondary blocking. processor wakes up and accepts a packet. At this point NIFDY will send an ack, allowing another packet to be sent. NIFDY also To avoid secondary blocking, when a node is sending to a blocked or overloaded receiver it must stop or slow its injection of packets into the network. Two schemes have been proposed to accomplish this: rate-based flow control (RBFC) and credit-based flow control incorporates an outgoing buffer pool which reduces head-of-line blocking in the network interface. Thus, if several messages are ready to go to different processors, they can be interleaved up to (CBFC). RBFC limits each sender to a rate that N known not to induce secondary blocking, assuming the receiver is pulling packets the limit of the OPT. To accommodate deeper networks out of the network. CBFC gives each sender a credit of packets that it can inject before secondary blocking will happen. The problem the NIFDY protocol has a transfer mode in which multiple unacknowledged packets can be in transit between two processors. In and large round-trip times, with both these schemes is that MPP traffic is bimodal—processors the case where the sender has multiple are usually sending either at full speed or hardly at all. With RBFC a fixed rate will not properly utilize the network; for instance, destination it can request a bulk dialog. If the receiver grants such a dialog in the ack, then the sender can send more than one packet the optimal the optimal per ack. By limiting the number and size of bulk dialogs a receiver will grant, we can again limit secondary blocking even for bimodal rate when only one sender is active is different from rate when all senders are active. While CBFC solves packets to be sent to a single this problem, it does not eliminate secondary blocking if many senders have accumulated credits and simultaneously send a burst of traffic. In short, NIFDY implements packets. In addition, between the sender terfaces that allows increased flexibility and the receiver or requires global information to be maintained the network, for maintaining limiting congestion and decreasing software overhead, NIFDY’S resource requirements increase with desired performance, not with while RBFC requires negotiation CBFC requires overhead in the credits, possibly on a per-receiver basis. These costs, combined with the bimodality of MPP traffic, have prevented designers from a simple extension to network in network in- design while the number of nodes in the machine, using RBFC or CBFC in MPP networks. The price of randomized routing techniques is that packets may be delivered out of order. Even meshes and tori using dimension- 2 order routing may deliver packets out of order if they utilize multiple virtual channels to alleviate congestion [Da190]. For mediumsized transfers on the CM-5, [KC94] showed that reconstructing the original transmission order accounted for as much as 3070 of the total transfer time. The Synoptics ATM Switch routes packets NIFDY is a network interface that increases system performance by decoupling the processor and the underlying network fabric. The processor sends packets by inserting theml into NIFDY, then NIFDY takes over and injects them into the network at the earliest opportunity, according to the protocol described below. NIFDY adaptively within the switch and then reorders them before they leave the switch. This link-by-link reordering increases the latency handles flow control, ordering of packets, unreliable networks, packet retransmission. of the switch by a factor of five [BT89]. From these observations we conclude that reordering should be performed only once, at the describe the basic design, the parameters to tune NIFDY to match the processor and the network, and the implementation costs of NIFDY. destination, The ideas in NIFDY can be added to any network and that if possible the reordering should be performed in hardware. 1.2 NIFDY bimodality and end-to-end of MPP traffic and, if extended In this section for we interface. keep packets in order and to provide access control. Every scalar packet is acked individually and bulk packets are acked using a interface that uses admission control to perform delivery Unit NIFDY distinguishes two types of network data packets, scalar and bulk. Scalar packets are best used for short messages while bulk packets are best used for large block transfers. In addition, NIFDY generates acknowledgment (ack) packets, which are used to in a Nutshell NIFDY is a network both in-order The NIFDY flow control. and ensure in-order To handle the delivery, sliding NIFDY has window protocol. The ack packets share the same network two communication modes. The default case is scalar mode in which only a single packet can be outstanding to a given destina- as the data packets, but are consumed by the receiving tion processor. any other processor. For every scalar packet sent, the destination processor number is recorded in an outstandingpacket table (OPT). To allow for higher bandwidth communication, a processor can request bulk mode which, if granted, gives the sender extra credits that can be used for communicating only with the granting processor. For every scalar packet sent, the destination processor number is recorded in an outstanding packet table (OPT). Until an acknowledgment (ack) is received from the destination and the entry in the the OPT is cleared, NIFDY will not inject any further packets Until an acknowledgment Furthermore, is received from the destination processor and the entry in the the OPT is cleared, NIFDY will not inject any more packets bound for that processor. However, if the OPT is not full, it can send a scalar packet to a different destination. Clearly, there is no way that packets can become disordered in the network if there is at most one outstanding packet between each sender/receiver pair at any instant. The basic flow control is also destined for that processor. However, if the OPT is not full, it can send a packet to another processor. This keeps packets from one processor to another in order. NIFDY. Each processor can send only one scalar packet at a time to evident: If the receiving node is ignoring the network, or for some other reason is not pulling packets out of the network rapidly, the sender will not get its ack and will refrain from sending any more to that node. This also provides a mechanism for congestion control within the network; if the packets or acknowledgements between a sender/receiver pair must cross a hot spot, the round-trip delay (and thus the delay between consecutive packets sent to the same destination) will increase, throttling the bandwidth of conversations and reducing congestion. The restriction of having only one outstanding packet may seem it reduces end-point congestion and adjusts to hot-spots, the bisection bandwidth, and possible faults. For shallow networks, the round-trip latency is smaller than the time it takes to inject a packet into the network and no extra latency is noticed even for consecutive sends to the same destination. By keeping the OPT small enough, we can adjust for network volume, ensuring that secondary blocking is reduced. If a processor is not responding to the network, each processor will send at most 231 From Processor To Processor I* A . 1 1 ( ,. 1 ‘nc”rning~ 1 v From throughput. through networks we are considering, 1 We are relying routing will Network 1: Block diagram of the NIFDY unit with support for bulk dialogs. excessive at first, but for the types of low-latency multiprocessor 1 I To Network Figure I r-’ver L_KL--- I 2.1 tightly-coupled it has little effect on on the fact that worrnhole Protocol Networks or cut- Implementation have different characteristics which affect the amount of traffic that they can handle before congestion reduces throughput. Thus, for best performance, NIFDY will have to be tuned for each be used, so that in the absence of contention, the head of the packet can often reach the destination before the tail has even left the source. Since the ack can be sent as soon as network. This is done by adjusting four parameters. the header of the incoming packet is processed, in many cases the sender will receive an ack for the packet it is currently sending. We 0: Size of outstanding will look at this issue more in Section 2.4. When a network has a high round-trip latency, sending multipacket messages as scalar packets may not fully utilize the network. We overcome this by sending multi-packet messages using the bulk protocol. In this protocol, the sender requests a bulk dialog, which, if granted, allows the sender to have more than one packet B: Size of the outgoing D: Maximum number of bulk dialogs each receiver can maintain simultaneously. outstanding to the destination. the multiple outstanding Although the network W: may deliver packets out of order, the receiving For most shallow NIFDY sender’s NIFDY. This reduces secondary blocking and increases throughput. Since both the packet and the ack have to traverse the any hot spot or network congestion back up all the way to the node’s network port; this has the key benefit that NIFDY can start sending to other ready destinations. By is the only way of telling size for the bulk dialog protocol. networks, the most important parameters are hold outgoing packets. As long as the OPT is not full, any eligible packet in the pool (we define eligibility below) can be sent. This allows the processor to interleave small packet streams for multiple processors. The parameters D and W determine the number and size of bulk will slow down both, delaying injection of more packets into the network. NIFDY usually reacts to a slow receiver or network congestion long before packets contrast, if backpressure buffer pool. O and B. If the OPT is large, then the processor can have more outstanding packets in the network. To reduce head-of-line blocking at the sending NIFDY unit, there can be a pool of buffers to puts them back in order before presenting them to the processor. Instead of piling up in the network, packets are blocked in the network, Receiver window packet table (OPT). dialogs. Each sender can maintain only one outgoing bulk dialog, although it can send packets in non-bulk mode to other destinations when to slow concurrently down, a sender will continue injecting packets to a slow receiver until its entrance to the network is blocked, at which point it is usually blocked from sending to any other destination. In fact, we expect that by reducing secondary blocking NIFDY will enhance the value of adaptive routing, since alternative paths will be available more often. with a bulk dialog. Each receiver can maintain D incoming bulk dialogs, each with a different sender. For each bulk dialog, W packet buffers are available in hardware at the receiver to provide storage for the sliding window protocol. 2.1.1 Scalar Packets Figure 1 is a block diagram of the NIFDY unit. (This figure also shows extensions for the bulk protocol, which will be explained later.) Packets enter NIFDY from the processor if there is an empty 1In fact, the number of outstanding messages per processor for these networks under lightly loaded conditions is often less than one [CU194]. 232 buffer in the outgoing pool. To maintain the correct transmission order of packets to the same destination, the rankleligibility writ ranks each packet in the pool relatwe to the other packets for the bulk mode on its own.3 Second, every packet includes its source node address. Finally, packets are delivered in the order in which they are sent. In order to utilize the bulk mode of NIFDY, the communication layer will have to turn on the bulk-mode request bit in the header same destination. The rank value indicates how many other packets there are in front of it. When a packet arrives at the pool, its rank is assigned based on the contents rank is one plus the number for the same destination. of the pool and the OPT: the of waltmg Whenever and outstanding of outgoing packets. The designer will have to decide what size transfers will request bulk mode. If the size is too small, the packets an ack from a processor N resources might go to the wrong sender. If too large, unnecessary received, all packets in the pool to the same processor have their rank decremented by one, bringing to zero the rank of the next delays will result. Since the NIFDY protocol packet to be transmitted sender, the sender’s address is encoded in the header of every packet. (making it “eligible”). requires an ack to be returned to the When the network can accept another packet, and there is a free entry in the OPT, and there is at least one eligible packet in the If this is exposed to the receive handlers, then the source node never needs to be included in the data portion of the packets. For instance, buffer pool, then one of the eligible packets is chosen for sending. The chosen packet is injected into the network, and the destination 51 % of the request messages in the Split-C library include the source processor ID in the message. The generic active message specifica- processor number is recorded in the OPT. Until an ack is received from that processor, no further packets are eligible for transmission tion requires that all request messages include the source ID [Mar]. In all these cases, the source ID required in the packet header by to it. Note that every packet must contain the source processor ID NIFDY could be put to good use. in its header so that the destination processor can return an ack. When a data packet is received from the network, it is inserted including into the arrivals FIFO buffer. an ack 1sretumed.2 Thus, NIFDY’S requirement the source ID in every packet does not actually overhead. Because the messages are delivered When it is accepted by the processor be accomplished without requiring of increase in order, large transfers can a round trip to initialize the destination processor’s data structures or buffers. The first message can initialize the destination processor while subsequent messages contain the data. The payload per packet is increased because later 2.1.2 The Bulk Protocol Figure 1 also shows how NIFDY handles bulk dialogs. The header packets need not include any bookkeeping information. If the extension in Section 6.1 is implemented, then messages that of each packet includes a bulk-request bit. A sender requests bulk mode by setting the bulk-request bit in the header of a non-bulk expect replies could be marked as not requiring packet. The receiver grants bulk mode to the sender by including reply itself would serve as an ack, The reply could also be marked a bulk dialog number in the ack it returns. A receiver may maintain multiple bulk dialogs, so it must give each active sender a different dialog number. If the receiver an ack. Instead, the as not needing an ack, reducing the overhead of acks to those cases where the sender is unsure whether the receiver can respond. can’t grant bulk mode because it is already participating in the maximum number of bulk dialogs, the ack will indicate the rejection to the requesting sender. In this 2.3 case, the sender will continue sending its data using scalar packets, and can continue requesting bulk mode, which may eventually be granted if a bulk dialog slot becomes available. When a node sends to a receiver that has granted it a bulk di- chip area, there are three sets of buffers and two content-addressable memories needed to implement NIFDY. The buffers can be implemented using single-ported RAM, taking up less area per bit than alog, it does not insert the receiver’s ID into the On, instead the rank/eligibility unit tracks the outstanding packets for the bulk dialog. The multiple outstanding packets may arrive at the receiving typical three-ported register files. Thus, the D W + B buffers needed can be implemented in a small space. In order to implement the outstanding packet table, a small NIFDY out of orde~ hardware buffers provide a place to store such packets until the intervening ones arrive. Packets that arrive in order only the tags, which must be long enough to contain the node iden- Aside from the control content-addressable are not held up and can be streamed to the processor immediately logic, which is relatively memory M required. small in terms of The memory contains usually more than sufficient. If we assume that 16 bits are enough for node identification (allowing 65536 different nodes), then we have a 16-bit by 8-entry content-addressable memory. The rank determination logic also requires a small CAM of size log the number orderrng information. A {sequence numbe~ dialog number} pair replaces the bits that would have been used as the source identifier. The NIFDY unit at the receiving end replaces the dialog number in the header with the source identifier before giving the packet to the of nodes (e.g. 16) bits by 1? (e.g. 8) entries. processor. A sender exits bulk mode by setting a bulk exit bit in the header of the last packet. A receiver can also terminate a bulk dialog in which case the transmission continues in scalar mode. Software Cost tifiers. The number of tags is equal to O, the mi~ximum number of outstanding scalar packets. As shown in Section 4.2, eight is via cut-through buffering. Sequence numbers, which need only be as large as W, are included in the header of each packet to provide 2.2 Implementation 2.4 Parameter Analysis Selection and Performance Initial estimates of the parameters for the NIFDY unit can be obtained by considering some parameters of the connected network and the expected traffic distributions. We will consider many distributions in Section 4. Here we give a flavor of how NIFDY would be tuned to a network by looking at network parameters and traffic between a Issues To get full performance out of NIFDY, the software communication layer must take into account three features of NIFDY. First, the processor must initiate bulk mode requests; NIFDY won’t attempt 3Of course, NIFDY could be extended to set the bulk-mode request blt automatically based on the locally observed traffic pattero; we have not 2An alternative, but surprisingly less effective, strategy is to send the ack earlier, when the packet is inserted into the arrivals FIFO. investigated 233 this posslblity in depth, injection time for W/2 + 1 packets. Parameter Meaning round-trip P the other half of the window.) d Number of nodes Distance to destination ~~e~d Packet payload in bytes Total time for processor to send packet (software T Tece8ve overhead) Total time for processor to receive packet (soft- (W/2 (i.e. hardware bandwidth + l)Trec.t.e W (3) > T,owzdtr,p(d) > 2(Tround,,,p(d)/T,,c,,o. - 1) We could instead have used a sliding window protocol in which every packet is acknowledged as it is received. In this case, to Total time for one packet to cross a link along the path from source to destination in the absence of contention of the last packet from in hops ware overhead) Tl,n~ (The + 1 is there because the time overlaps with the injection reach maximum limitation throughput we have W > Tround,jyp (d)/T,ece,ve (4) Tackproc on interpacket arrival times) Total latency involved in generating and process- D, the parameter controlling T roundtrtp ing ack Total latency ceiver, is normally set to one. However, in the unlikely event that the send rate is much slower than the receive rate, it would be de- from the time the header of the packet leaves NIFDY unit to the time the ack has been processed Table 1: Network characteristics influencing selection of NIFDY per re- larger windows) will give better performance with light traffic but may lead to excessive congestion when all processors try to send simultaneously. single source/destination pair separated by d hops. Table 1 defines the parameters we are using. Without the NIFDY unit, the maximum predictable bandwidth light traffic. between two nodes in the network Bandwidth = is w max(Tse~~, Treceive, TI,~k) expresses that the bandwidth can be limited Scalar Mode bottleneck, to network the critical network 2.4.3 as where 7&kP,0C + Tackproc Example Network Parameters In this section we try to estimate good NIFDY parameters for two specific networks. We will assume that the NIFDY processing takes 2 cycles at each end, for a total of T.CkP,oc = 4. We will also Tla, (d). The time from when a packet starts leaving until the ack is (d), can be calculated received and processed, defined as TTOU~d~~zP = 2&,(d) bulk packets will wait in the reorder buffers and not add congestion. parameter is packet latency. In most networks this latency is a function of d, the number of hops between the nodes, so we will write latency as T ,.wn&,,P(d) parameters will give better, more with heavy traffic, but may unduly restrict that a few extra packets will cause congestion more quickly. Also, a small bisection bandwidth means that excess packets are more likely to get blocked within the network, compounding congestion. Note that if a slow receiver, rather than bisection bandwidth, is the by the send Parameters When the NIFDY unit is included, More restrictive performance Network characteristics determine at what point generous NIFDY parameters lead to congestion. A small network volume means (1) overhead, the receive overhead, or the physical bandwidth. 2.4.1 of bulk dialogs sirable to increase D to the maximum point at which one receiver can handle D senders without falling behind. When choosing parameters for the bulk dialogs, performance under light traffic loads must be balanced with performance under heavy traffic loads. Less restrictive parameters (more bulk dialogs, parameters. which the number assume that the T,e~d is 40 cycles and Tr,c,,a. is 60 cycles. First we look at an 8-by-8 mesh using wormhole routing. Mul- (2) is the time it takes the NIFDY unit to generate and process the ack at both ends. Because the sending node must wait tiple virtual channels are not needed because it is a mesh, not a torus. The flit size used is one word (32 bits), and each flit buffer until holds at most two flits. Our simulated it gets the ack before sending the next packet to the same (d) node, packets can be sent no faster than once every T~~~~dt~tp cycles. To attain full bandwidth between two nodes separated by d hops using the basic NIFDY protocol (with no bulk dialogs), need T ,~ti~~t,,p(d) < maX(Tsen& Tk~~tv~, Tktk). 2.4.2 Parameters for Bulk we Since the limiting factor without the NIFDY unit would be the 60-cycle receive overhead, it is clear that the roundtrip latency of the basic NIFDY protocol will often be the limiting factor in pairwise bandwidth with an uncongestednetwork. Thus it appears that using 3 indicates that in order to hide a bulk dialog may help. Equation the maximum NIFDY roundtrip latency of 144 cycles, we will need a bulk window size of W > 2(T,oun&,tP (d)/Tr,ce,o, – 1). So we would want at least 2 packets, possibly 3 or 4 if we can afford to be generous. This wormhole mesh has an exceptionally low volume-eight Dialogs When bulk dialogs are included because pairwise bandwidth would be unnecessarily limited using the basic protocol, we can use similar calculations to decide the size of the window. For simplicity, we will assume that Trcce~~ e is the limiting factor. If this is not the case, then Tl,~k or Ts~~d would be substituted. We use a sliding window protocol in which acks are combined so that only one ack is sent for every W/2 packets. (Recall that W is the receiver window size.) In this case an ack will be sent only when all of the packets in that half of the window have arrived; to words per node (two words for each incoming link). Thus even if each node has only one eight-word packet in the network, the network will be full. This, combined with the mesh’s low bisection bandwidth of n, leads to a conservative decision regarding how avoid bandwidth restriction, this ack must get back to the sender before all of the packets in the other half of the window have been injected. The round-trip mesh had a one-way latency of Ttat (d) = 4d + 14. With uniform traffic, the maximum and average internode distances are 14 and 6 hops respectively; hence Equation 2 gives maximum and average roundtrip latencies of 144 and 80 cycles respectively. time must be less than or equal to the 234 many packets to allow on the network. 0=4, B=4, D=l, The other network An initial guess would have Operations and W=2. we will consider is a full 4-ary fat tree of 64 nodes. With three levels of routers, the maximum internode distance I Processor cv~ Active message send 46 Actwe message poll (no message) 22 is 6 hops, and the average distance is not much less than that. In this Active case Tla~ = 5d + 2, giving (dispatch, handle, return) 68 cycles. sufficient, a round-trip latency of 32 + 32+4 Thus it appears that the basic NIFDY protocol = may be and bulk dialogs will help only marginally. Our simulated fat tree’s volume is 10 buffers per node, much greater than that of the mesh. This large volume, along with the fattree’s large bisection bandwidth, means we can be less restrictive in + One-way latency (incl. software) from send to be~innin~ of handler Table 2: Measured allowing packets into the network. Thus, although bulk dialogs are only marginally useful, they probably won ‘t hurt much either, The main effort should be to reduce the restrictions on scalar packets as much as possible. This can be done by making the OPT large (O = 8 entries) and by making the buffer pool for waiting large (B = 8 buffers) to reduce head-of-line blocking. message receive CM-5 parameters I 56 d I I used in our simulato~ networks are demand-multiplexed over the same physical links in order to make use of all available bandwidth even when the traffic is unevenly divided between the two logical networks. With the packets CM-5 fat tree, the two networks are strictly time-lmultiplexed every other cycle, so that each network is limited to eight bits every two cycles regardless of the traffic on the other network. 3 The simulator Simulation ● Empirical results were gathered using a parallel simulator C++ and executed on a Thinking Machines ments, the simulated objects are distributed written in CM-5. In these experiacross the CM-5 nodes and connected using links provided by the simulator framework. Most simulation parameters are supplied at run time, allowing easy exploration of the design space. Each cycle is simulated explicitly jects; at any time in the simulation, and synchronously by all ob- all objects have executed up to ● The simulator supports the following not used in this paper4). Two- and three-dimensional b channels. were one byte wide for all simu- electromagnetic application [CDG+ 93]. ● Radix sort, which uses single-packetmessages for both counting the keys and transferring each key to its appropriate des[Dus94]. using either cut-through included but disabled. In this configuration, the extra buffering in the outgoing message pool and the arrivals queue of the NIFDY units can still be utilized. This allows us to separate the effects of or the NIFDY protocol itself from the benefit of simply having extra buffering. When comparing NIFDY to buffering only, the same total amount of buffering is always used, although in order to make the fairest comparison it is redistributed to be most effective for each routing. Fat tree more similar to the CM-5 [LADY an irregular When the NIFDY units are included in the simulation, all NIFDY parameters are adjustable. An option allows the NIFDY units to be The size in each dimen- 92]. Routers in the first two levels are connected to two parents rather than four, case, reducing bisection bandwidth tree. Also, the link bandwidth arrivals queue is at most two packets; without the protocol, best performance results from allocating at least half of the total buffering resources to the arrivals queue. Of course, the acks used in the NIFDY protocol are included directly in the simulations, competing as compared to a full 4-ary fatwas reduced to 4 bits per cycle as in the CM-5 network. ● EM3D, worm- run-time parameters. Links lations reported here. 1-byte links, ● tination meshes and tori utilizing channels, and buffer sizes are all store-and-forward ● networks (as well as others sion, the number of virtual 4-ary fat tree with The cyclic-shift all-to-all communication lpattem described in [B K94]. This and the following two traffic patterns use the CMAM and Split-C libraries from the CM-5, and thus use six-word packets (for all networks, not just the CM-5 imitation). network simulation and the computation is simplified because only polling message reception is allowed; thus the computation always initiates interaction with the network. with virtual Pseudo-random, bursty traffic. Burst length distributions are adjustable, global barriers can be inclutied between send bursts, and nodes can programmed to enter ‘non-responsive’ periods during which they neither send paclcets nor pull them from the network interface. Dedicated state for each pseudorandom number generator ensures that the same sequence of node is allowed to run ahead (in simulated time) up to the next point where it interacts with the network. Synchronization between the hole routing traffic loads. bursts is generated regardless of network and NIFDY configuration used. Packet size is eight words including header. the same point. The only exception is when real Split-C programs are driving the simulator. In this case, the network simulations on the CM-5 nodes are still synchronous, but the computation on each ● supports the following Multibutterflies, with adjustable dilation and radix. In this report we use a butterfly (d~lation 1, radix 4) and a mtdtibutterfly (dilation 2, radix 4). With all topologies with the NIFDY protocol, with data packets for network For realistic timings a real CM-5 to estimate well as CM-5 network summarized in Table 2, All topologies support two logically independent networks, the request network and the reply network, m order to deal with fetch deadlock. For example, other than the CM-5 fat tree, the two 4The simulator n available at ftp://ftp.cs.berkeley.edu/pub/packages/nlfdy/nifdy.html 235 the capacity of the bandwidth. on our simulations, we ran several tests on packet sending and receiving overheads as latency and bandwidth. These parameters, agree closely with those reported in [vE93]. 4 Results 4.1 Synthetic Workload To learn which NIFDY parameters were best for which networks and to measure the overall effectiveness of NIFDY’S flow control, we ran many simulations for each network. Because performance at both heavy and light network loads is important, we used two different traffic patterns forthese runs: onewhich rewards graceful handling of heavy traffic loads, and one which rewards rapid packet delivery under light traffic. Both traffic patterns consist ofphases separated by bafiers. A node that is sending during a phase will attempt to send its packets (typically 100t03000f them) asquickly as possible. Processors send single- or multi-packet messages; all the packets in a single message aresent consecutively andtothe samedestination. Atthe end of a message, a sender randomly chooses a new destination and message length and immediately starts sending to the new destination. To ensure that every node makes progress sending, no node can start the next phase until all sending nodes complete F;ll Fat Tren Str*wrd Fat Tree 26 Mesh Zi Torus 3-D Mesh Bult;rfly Mu-ilbuttefrly Figure 2: Performance benefit from fiow controJ of NIFDY for different networks: packets delivered in 1,000,000 cycles. “Heavy” Does not reflect additional benefit of in-order synthetic traffic. the current delivery phase. As with real MPPbulk-synchronous applications, if some nodes are favored by the topology and are able to send their outgoing data quickly, chi-5 Fat Tree from NIFDY. sooner or later they will have to wait until the other nodes catchup. In abulk-synchronous tom line is how quickly each communication application the betphase is completed; thus our metric is thenumber ofpackets delivered within a fixed number of cycles. Note that this metric measures only the benefit of reduced network congestion; the in-order delivery provided by NIFDY Will give an additional bandwidth benefit, dependent upon the application (see Section 2.2). In the heavy traffic pattern, all nodes send each phase, and message lengths — the number of consecutive packets a processor sends to its destination before changing to a new destination — are uniformly chosen from one to five packets. In the light traffic pattern, each node has only a 33% chance of sending each phase, reducing contention in the network. Since nodes are less likely to poll during light traffic, our simulated nodes periodically ignore F;ll Fat Tree the network; these periods of ignoring the network are triggered pseudo-randomly and independently for each node. With light traffic the message length distribution includes lengths of 10 and 20 packets; most messages are short, but long messages account for set of parameters for that network, The graph compares packet throughput for each case, showing the benefit just from the reduced network congestion allowing more packets to get to their destinations. For the networks that deliver packets out of order, the actual NIFDY will likely in more payload be greateq NIFDY’S in-order per packet in multi-packet :D Mesh Butt-tily Miinlbuttetfly additional benefit of in-order els only three hops, resulting in very low round-trip latency, and there are no alternative paths between nodes, making congestion avoidance more critical. delivery messages, and can also reduce the receive processing time. The best NIFDY parameters, chosen to give the best average performance with both test traffic patterns, are shown in Table 3. The ideal NIFDY parameters for the fat-tree variations are less restrictive than those for the meshes; fat trees have greater width, greater volume, and more alternative paths so that having a few extra packets in the network much as it does with the mesh. The CM-5 network 2-D Torus windows than the full fat tree even though the round-trip latency is twice as grea~ this is because of the CM-5 network’s smaller volume and bisection bandwidth, which makes congestion a more important factor. Finally, observe that the buttertly is the only network where it is best to have no bulk dialogs: every packet trav- ous networks under both traffic loads, comparing no NIFDY; buffering only (without the NIFDY protocol); and NIFDY using the best of Zti Mesh synthetic traffic. Does not reflect delivery from NIFDY. the same receiver) and unresponsive receivers. Figures 2 and 3 show the performance benefit of NIFDY for vari- can result Str/&vrd Fat Tree Figure 3: Performance benefit from flow control of NIFDY for different networks: packets delivered in 1,000,000 cycles. “Light” more packets overall. Thus the light traffic benchmark mainly measures pairwise bandwidth with only some contention in the network and some possibility of target collisions (multiple nodes sending to benefit Ck-s Fat Tree 4.2 Scalability Is it necessary to increase the size of the OPT or the outgoing buffer pool (O or B) as the number of nodes in the network gets larger in order to maintain the same relative benefit from NIFDY? This bisection bandbetween nodes, does not hurt as has smaller bulk would be an undesirable finding, since we would like NIFDY to be scalable—we don’t want to have to make all the NIFDY units bigger when we increase the number of nodes in our MPP. 236 Maximum NIFDY parameters n.,(d) (packets per processor) w B Full Fat Tree 5d+2 10 1 4 8 8 CM-5 Network 9d+2 6.5 1 4 8 8 1 8 16 8 ‘D o (L+l)d+3 10 8x8 Mesh, 1 virt. ch., 2-flit buffers 8x8 Torus, 2 virt. ch., 4-flit buffers 4d+ 14 1 4 1 2 8 4 4x4x4 Mesh, 2 virt. ch., 4-tiit buffers 4d+ 18 6 1 2 8 4 6 10 0 0 8 8 I 4 8 8 Store-&-Forward Table 3: Volume Network Fat Tree 4-ary Butterfly 5d+2 4-ary Multibutterfly 5d+2 Characteristics of simulated 64-node networks aJorrg with best NIFDY parameters for each network, used in Figure 2 and 3. d is the number of hops, L is packet length in bytes. Vatymg Pool Size and Network Size Varying OPT S!ze and Network 4.3 Size Pza- In this subsection 1 4! / “ ‘\ f= 125 j 12Q.‘~. ‘\ The C-shift ,/’ z E112 ; pattern, the cyclic communication pattern consists of P – 1 phases. In the first phase, processor i sends to processor (i + 1) mod P; in phase p, processor i sends to processor (i + p) mod P; until 1 ?0 /’ h4 p = P – 1. As long as the phases remain separate,, each receiver is matched with exactly one sender. However, as observed in [BK94], 1 105 ~ $2 we consider a specific traffic shift (C-shift) studied in [BK94], which provides a~l-to-all communication. We implemented this traffic pattern using the “real traffic” interface to our simulator and the CM-5–style network in order to make comparisons with [BK94]. 135 ~130 m Cyclic Shift 128 Number of Prmessms Figure 4: Throughput trees. 256 ‘“oh--k-+---k Number.1 some nodes may finish the current phase early and move to the next phase, resulting in one node receiving from two senders. This slows the progress of both senders, allowing other senders to catch up and Processors aggravating the condition. Figure 5 shows the number of packets in the network for each receiver as the pattern progresses, clearly for various O and B on various sized fat- To answer this question we ran some simulations indicating the accumulation of packets outside certain receivers. One solution used in Strata [BK94] is to insert global barriers between phases. of the full fat tree, using only short messages and no bulk dialogs Results are summarized in order to When NIFDY’S in-order delivery than optimized is exploited, barriers. the benefit is even greater. These results can be explained by looking at Figure 5. Some piling up does occur with NIFDY(due to the different path lengths between different pairs of nodes), but these perturbations dissipate and the network returns to even utilization of all receivers. This dissipation occurs because the “rightfuf” sender to a receiver has the advantage that it owns the bulk dialog to that receiver. Thus it will be allowed to finish rapidly and move on to the next receiveq at that point the sender behind it can attain the bulk dialog. machine size for different values of B. In general, increasing B gives better performance for any size network. However, for a fixed l?, the relative benefit of NIFDY does not decrease and in most cases increases as the size of the MPP grows. This result means that a system designer can choose B once, depending on the desired performance and cost, and then can expect to maintain performance benefit even as the MPP scales to large sizes. in Figure 6. Using NIFDY’S congestion control alone results in better performance concentrate on the effects of O and B. The first part of Figure 4 shows throughput (normalized to a network without NIFDY) vs. that The second part of Figure 4 shows normalized throughput vs. machine size for different values of O. The most important thing Although this effect is dependent on NIFDY parameters and network characteristics, in all cases performance than with nothing at all. to see from this graph is that O = 8 is the best parameter across all machine sizes except for the largest we looked at (where the best value of O is 4). was much better with NIFDY While these figures may differ depending on network volume and other factors, we do expect NIFDY performance to stay constant or 4.4 increase as the network size grows—while keeping the same small fixed parameters in NIFDY. In fact, the results should be even more favorable on networks in which the bisection bandwidth does not scale linearly with the number of nodes, such as a two-dimensional mesh. In these cases the per-node bandwidth would have to de- EM3D, a program for solving electromagnetic problems in three dimensions and a common parallel benchmark [CDG+ 93], was also used to drive our simulations. The results of our simulations for a number of different networks are summarized in Figure 7 (for the light network load) and Figure 8 (for the heavy network load). crease as the machine size grows in order to avoid congestion at the bisection, making smaller values of O and B more desirable. For networks that deliver packets out of order, two NIFDY results are presented: one which gives the benefit just from the flow control (“NIFDY-”), and another m which the Split-C library that interfaced to our network simulator was altered to take advantage of the inorder delivery provided by NIFDY. For networks that deliver packets 237 EM3D Wtthout NIFDY, no barriers Time x 100,000 With NIFDY, cycles one dialog, Figure7: EM3Dcycles periteration with less communication. (In the computation graph generated by the parameters, most arcs no barriers are local to processors.) n-nodes = 200, dnodes = 10, locaLp 80, dist-span =5. NIFDY– reflects benefit from tlow control NIFDY expJoits in-order delivery as well. -o Figure 2 1 4 5 Time x 100,000 3 5: Network congestion 7 6 8 = only; 9 cycles with C-shift: pending packets per receiver without and with NIFDY (no bm”ers in either case). Shading is inte~olated between white for no pending packets and black for 20 or more pending packets. packets are transfened, In both cases, the same number of but NIFDY finishes earlier Figure 8: C-shift I ,Ca_ ;0 4- Cslm 3 Bandwidth I I 1 ----- . \ ‘-”x.,, .-. ....,,, Without in-order delivery, the buffers-only configurations ___ ~ -----------,,,,,.,, ,.. ---......,,,,,.,,, ..,, ,.,----- .-%.,.,,,, ..,,,,,,.,, +- g the library only; intended for the difference between NIFDY and is negligible. Once the library takes advantage of the in-order delivery provided by NIFDY, h outperforms the buffers-only configuration in all cases. . .. =4im with more communication. in order (the 2D mesh and the butterfly), in-order delivery was used for all runs. B~ . 5 cycles per iteration 3, dist-span = 20. NIFDY– reflects benefit from flow control NIFDY exploits in-order delivery as well. 1 , : <,,0 EM3D (In the computation graph generated by the parameters, most arcs are between processors.) nnodes = 100, d-nodes= 20, local-p = .\ .. .. .. . ,. , m ~. .. ,-. -.-.-, ,. ICa .2”’ “v”’ tL I ,“1 1 ,m I Blo.k2~ze Figure6: 4.5 .’ . .. Throughput I ,W 1 I ,Mm (byt~$ for C-shift on32-node Radix Sort Finally, we ran simulations of a radix sort based on [Dus94]. Each iteration of radix sort consists of two communication phases: scan and coalesce. In the scan phase, a scan addition is performed across all processors for each bucket; this involves nearest-neighbor communication. The most notable feature of this is that the overall CM-5network, communication 238 phase runs faster if delays are inserted between reordering .8 8 the packets that otherwise would arrive out of order. Finally, we saw that on most networks many of the communication patterns can be sped up by carefully crafted software techniques. Without either the techniques patterns ran poorly. With or NIFDY, many of these NIFDY, intelligent software techniques were useful, but they were not as important. In general, NIFDY provides a safety net when software network management techniques are not or cannot be applied. 5 Related Work Flow control (FC) and congestion control (CC) have received m,uch attention in LAN and WAN research. The larger packets and longer latencies in these types of network make software implementaF;ll Fat-Tree CM-5 Fat-Tree F;ll Fat-Tree StrlFwrd Fat-Tree CM-5 Fe4.Tree StrlFwrd Fat-Tree tion of FC and CC protocols practical. Our method, being simple enough to implement in hardware, provides FC and CC for the type of low-latency, high-bandwidth networks where software protocols are not practical. Specifically, NIFDY does not require any intelligence within the network switches and it does not require nodes to keep per-receiver connection or credit state. In multiprocessor networks, the need to reduce software com- Figure 9: Cycles for one scan phase of radix sort “With Delay ’’indicates that artificial delays are inserted between consecutive sends to the same destination. successive sends. Without delays, the sends from one processor cause the next processor in the pipeline to continually receive with no chance to send, serializing munication overhead while making good use of network width has inspired many attempts to “raise the functionality the entire scan. We studied versions network’’-usually by reducing congestion or providing bandof the in-order both with and without the delay. In the coalesce operation, the keys are sent to the appropriate destination using one message for each packet delivery. Most of these projects have taken a different approach from NIFDY’S: they have added functionality to the network key; routers rather than just to the network interfaces. The METRO router [DCB+ 94] provides in-order assuming a random initial key distribution, the one-packet messages are sent to a random sequence of destination processors. Figure 9 shows results for the scan phase using an 8-bit radix delivery while taking advantage of random wiring in expansion networks. The router is a dilated crossbar and is used as a building block for indirect expander networks such as multibutterflies and metabutter- on 64 processors. While adding delays between successive sends helped in all cases, it was more critical when NIFDY was not included. When NIFDY is included, its protocol causes the sender to slow down; this allows all the processors to continue to send as flies [CBLK94]. A sender attempts to make a connection rando:mly through successive dilated crossbars; if a connection attempt is blocked, the path is tom down, and the connection is retried later. well as receive. Networks with higher latencies, e.g., the store and forward fat tree, get a bigger gain from NIFDY, than those with lower Once a connection is established, it remains fixed, and thus transfers are in order. The cost of blocked connection attempts means latencies, like the full fat tree. This exemplifies what we found in many cases: the locally restrictive NIFDY protocol actually results in more global throughput. in-order that METRO must make sure that most connection attempts succeed; thus it is important to have large bandwidth throughout the network, probably much more than is needed to carry the average load. NIFDY allows network utilization closer to its theoretical maximum, while preventing the user from pushing the network out of delivery is not beneficial. On the other hand, NIFDY’S restrictiveness did not hurt performance. its operating range. While METRO requires nontrivial intelligence at the transfer endpoints, its key characteristics arise from its router Results for the coalesce phase (not shown) were virtually idenThere was not enough congestion tical with and without NIFDY. for NIFDY’S flow control to help, and with this algorithm design. NIFDY, in contrast, can be used with a variety of networks. 4.6 Compressionless Discussion delivery. Routing CR, which (CR) [KLC94] relies on wormhole also provides in-order routing, pads packets NIFDY performs well for three real communication patterns which form the basis for many parallel programs. The NIFDY protocol may with enough space to ensure that pushing the entire packet onto the network implies that the head of the packet has already en- seem restrictive, but NIFDY’S admission control reduces congestion in the network. Our results show that it delivers more packets than tered the destination, at which point the packet is guaranteed to be completely consumed. If the packet cannot be pushed out within the same network without NIFDY, and roughly the same as when NIFDY’S buffering is used without the protocol. When NIFDY’S inorder delivery is taken into account, NIFDY is seen to give a clear benefit for all shallow networks. NIFDY helps different networks in different ways. For networks with a single path between each sender and receiver, packets are a preset amount of time, the transmission is aborted and the flits already in the network are killed. Abstractly, there are some simWith both ilarities between CR and the basic NIFDY protocol. there can beat most one unreceived packet in the network between any source/destination pair. In addition, CR also uses an ack, albeit an implicit one—lack of backpressure—which travels from the destination to the source on the switch-level ack control wires. already delivered in-order, so NIFDY gives no additional benefit in that respect; however, such networks degrade rapidly in the presence of congestion, which NIFDY helps avoid. For networks with multiple does routes not routes between degrade around hot as rapidly spots. each sender and receiver, because packets In such cases, our implementation some differs markedly. bandwidth, but there The explicit is no need acks to add padding to short packets. The tearing down of packets due to blockage causes instability at high network load: the average amount of bandwidth consumed per successful transfer increases, wastefal performance can travel NIFDY’S However, in NIFDY consume alternate main benefit is 239 making the congestion worse. In contrast, with a medium-sized OPT were shown to be sufficient due to the small round-trip latency. For deeper networks, we added bulk dialogs to overcome this latency. Here we extend NIFDY for networks our method performs best under high load and prevents the network from being pushed into a regime of declining throughput. NIFDY is fairly insensitive of workstations which may drop packets. Our goal is to make the network transparent to the application and for it to be scalable. To handle networks that drop packets the sender must be able to retransmit packets. In addition, the receiver must be able to distinguish and eliminate duplicate packets. To accomplish retransmission we add one timer and one message buffer per entry to our preset parameters; with CR, a poorly chosen timeout period may drastically affect performance (although our extension to NIFDY for handling dropped in this respect). is very general since it is logically separate it can be used with wormhole, cut-through, or from packets will have the same sensitivity NIFDY the network; store-and-forward routing, and can be added to an existing network with no change to the network itself. In contrast, CR can be used only with wormhole interfaces routing, in the OPT and per outgoing to support the killing of packets. NIFDY, can be used with networks However, bulk packet. The outgoing packet is copied into the buffer and the timer is set when the packet is sent. and it requires the network routers and If an ack is received before the timer expires, the timer is reset and CR, unlike the buffer is freed for future use. If no ack is received before the that are not deadlock-free. Finally, there are many software techniques that can be used to reduce network congestion [B K94]. These techniques, such as timer goes off, the packet is retransmitted. To distinguish duplicate packets, one additional bit in the header is enough for both scalar structuring communication as series of permutations allowing oneHowever, NIFDY on-one transfers, are beneficial even with NIFDY. wdl add robustness to the system and be especially effective with traffic patterns that are difficult for software to manage, in particular those with no global structure. In some cases the behavior of and bulk packets. This simple extension—a NIFDY with with irregular regular interleaves packets implements which traffic mimic For to different bandwidth it receives processor will communication. destinations. which packets techniques NIFDY ceptional condition (viz., the dropping of a packet), which should reduce software overhead at both the sender and the receiver. Ac- used automatically cording And NIFDY effectively matching—injecting acks, is pulling the software example, packets is the rate at which out of the network. injection to [KC94], receiving at the rate at this should messages by to 3070 reduce the cost of sending and st)~o. the receiving NIFDY also handles 6.3 the more general case with multiple nodes sending to one receiver, returning acks only at the rate at which the receiver accepts packets. This throttles the combined bit in each header and some additional NIFDY to hide the implestate and buffering on each NI—allows mentation details of the network from system software and user applications alike. We have used simple hardware to mask an ex- Further In addition rate of all the senders to a level Experiments to the extensions proposed above, we believe that we that the receiver can handle. It would be difficult and expensive to implement such dynamic bandwidth matching in software. have just begun to understand how the network parameters affect the throughput and latency of messages on the network. While we have a good understanding of the O and P parameters and how they interact with traffic patterns, we have yet to study the interaction 6 bulk dialog. among transfer lengths, W, and the optimal Future Work We also plan to extend the simulator 6.1 point for requesting a to study how NIFDY interacts with adaptive routing on a mesh, which in the past has not performed Changes to ack strategy well enough to justify There are two changes to the current protocol that we would like to study: allowing acks to be combined with reply messages, which and in-order delivery its expense. Adding the admission control of NIFDY may help adaptive routing reach its potential. should reduce network traffic; and allowing packets that don’t require acks. In the protocols described in this paper, the ack packets are the user code will always generated by NIFDY. In many situations also send a reply message to the source processor. Instead of In this paper we have proposed a network sending both a NIFDY-generated ack and a user reply we could piggyback the ack in the reply. This seems to be a good idea, since if the sender M waiting for a reply it probably won’t have any other increases network performance and decreases software overhead without restricting routing choices in the network, We have shown that it is possible to achieve these goals simultaneously by adding 7 packets for the destination processor until the reply is received. Adding this protocol requires only an additional bit in the header that packets does not care about will not contribute the risk). Such and would be handled packets would just that no acknowledgements co-exist with traffic like delivery be eligible be sent. and knows (or is willing modest resources only at the network the to take at the receiver except This type of traffic could the NIFDY protocol. verified 6.2 Networks of Workstations We have shown how NIFDY increases works for MPPs. For shallower performance networks of reliable scalar packets interface, interface NIFDY, and without which having credit- based scheme where every sender implicitly has one credit for each receiver. Because senders record only receivers with zero credits rather than maintaining state for all receivers, the resources consumed at each sender scale with the number of outstanding packets rather than the total number of nodes. Because credits are good only for a particular processor, the protocol can easily adapt to bimodal MPP traffic. We built a general-purpose simulator to test these ideas, We to be sent immediately, scalar packets would obeying in-order to congestion and Conclusion to push any functionality throughout the network. In essence, the basic NIFDY protocol is an optimized and a comparator in the “to processor” block in Figure 1. NIFDY could be configured so that the processor indicates when it wants to bypass the NIFDY protocol. This could be done when the processor Summary net- combined 240 the simulator against a real machine, used the simulator to evaluate network interface attached meshes, tori, network, and all synthetic butterflies, the performance to a variety the CM-5, advantage of network and fat trees. and real traffic fabrics, We showed patterns, that and then of a NIFDY including on every NIFDY increased packet throughput more increased total patterns payload since delivered we saw increases in EM3D) having to a level comparable In addition, buffers. packets on all networks. 10% (under more in order, [Da191] W.J. Dally. Express cubes: improving the performance of k-ary n-cube interconnection networks. IEEE Transactions on Computers, vol.40(no.9’): 101 6–23, Sept. 1991. [DCB+94] A. DeHon, F. Chong, M. Becker, E. Egozy, H. Minsky, S. Peretz, and Jr. Knight, T.F. Metro: a router architecture for high-performance, short-lhaul routing networks. In Proceedings the 2 Ist Annual International Symposium on Computer Architecture, pages 266–77. IEEE Comput. Sot. Press, 1994. [Dus94] Andrea Carol Dusseau. Modeling parallel sorts with LogP on the CM-5. Technical Report UCB//CSD-94829, University of California at BerkAey, May 1994. [Jac88] V. Jacobson. Congestion In Computer Communication Aug. 1988. [Jai90] R. Jain. Congestion control in computer networks: issues and trends. IEEE Network, vol.4(no.3):24-30, May 1990. [KC94] Vijay Karamcheti and Andrew A. Chien. overhead in messaging layers: Where does go? In Proc. of 6th Int. Confi on Architectural for Programming Languages and Operating San Jose, CA, october 1994. [KLC94] J.H. Kim, Ziqiang Liu, and A.A. Chien. Compressionless routing: a framework for adaptive and faulttolerant routing. In Proceedings the 21st Annual International Symposium on Computer A rchitecture, pages 289-300. IEEE Comput. Sot. Press, 1994. [KMCL93] H.T. Kung, Robert Morns, Thomas Chaaruhas, and Dong Lin. Use of link-by-link flow control in maximizing atm networks performance: Simulation results. In Proceedings IEEE Hot Interconnects Symposium ’93, August 1993. [KS91] S. Konstantinidou and L. Snyder. Chaos router architecture and performance. In Computer Architecture News, pages 212-21, May 1991. it For real traffic hght to as much as 1007o (under all-to-all added Using from to that of having added NIFDY delivers loads presented transfers) over just buffers. we also showed the simulator by NIFDY are constant (or decreasing) that the resources needed with respect to the number in the network. In particular, for all the networks studied, an outstanding packet table of size 8 combined with a packet pool of nodes of 16 and a single bulk dialog with a window than enough resources for even large machines. of 8 were more In fact, on most networks fewer resources than these gave better results. Thus, given the performance needed over plain advantages of NIFDY, the small additional network interfaces is a worthwhile chip area investment. Acknowledgments We are grateful comments. to the anonymous We would referees also like to thank for Krste their valuable Asanovi6, Eric Brewer, David Culler, Andrea Dusseau, Steve Lumetta, Klaus Erik Schauser, Nathan Tawil, and John Wawrzynek for their comments on earlier versions of this paper, and Su-Lin Wu for her contributions to early stages of this work. Computational support at Berkeley was provided CDA-8722788. by the NSF Infrastructure Seth Copen Goldstein is supported Grant number by an AT&T Graduate Fellowship. Timothy Callahan received support from an NSF Graduate Fellowship and ONR Grant NOOO14-92-J-1617. This work also received support through NSF Presidential Faculty Fellowship CCR-92-53705 and LLNL Grant LLL-B283537-Culler. References [Aga91 ] A. Agarwal. Limits on interconnection network performance. IEEE Transactions on Parallel and Distributed Systems, vol.2(no.4):398-412, Oct. 1991. [BK94] E.A. Brewer and B.C. Kuszmaul. How to get good performance from the CM-5 data network. In Proceedings Eighth International Parallel Processing Symposium, pages 858–67. IEEE Comput. Sot. Press, 1994. [BT89] R.G. Bubenik and J.S. Turner. Performance of a broadcast packet switch. IEEE Transactions on Communications, vol.37(no.l):60–9, Jan, 1989, [CBLK94] F.T. Chong, E.A, Brewer, F.T. Leighton, and T,F. Knight, Jr. Building a better butterfly: The Multiplexed Multibutterfly. In Proc. International Symposium on Parallel Architectures, Algorithms, and Networks, Kanazawa, Japan, December 1994. [LAD+ 92] avoidance and control. Review, pages 314-29, Software the time Support Systems, C.E. Leiserson, Z.S. Abuhamdeh, D.(2. Douglas, C.R. Feynmann, M.N. Ganmukhi, J.V. Hill, W.D. Hillis, B.C. Kuszmaul, M.A. St. Pierre, D.S. Wells, M.C. Wong, Shaw-Wen Yang, and R. Zak. The network architecture of the connection machine CM-5. In SPAA ’92. 4th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 272-85. ACM, 1992. [Mar] Richard Martin. [RJ90] K.K. Ramakrishnan and R. Jain. A binary feedback scheme for congestion avoidance in computer networks. ACM Transactions on Computer Systems, vol.8(no.2): 158-81, May 1990. [SBB+ 91] M.D. Schroeder, A.D. Birrell, M. Burrows, H. Murray, R.M. Needham, T.L. Rodeheffer, E.H. Satterthwaite, and C.P. Thacker. Autonet: a high-speed, selfconfiguring local area network using point-to-point links. IEEE Journal on Selected Areas in Communications, vol.9(no.8): 13 18–35, Oct. 1991. [CDG+ 931 David E. Culler, Andrea Dusseau, Seth Copen Goldstein, Arvind Krishnamurthy, Steven Lumetta, Thorsten von Eicken, and Katherine Yelick. Parallel programming in Split-C. In Proc. Supercomputing ’93, Portland, Oregon, November 1993. Personal Communication. [CU194] David E. Culler. Multithreading: Fundamental limits, potential gains, and alternatives. In R.A. Iannuci, G.R. Gao, Jr. Halstead, R. H., and B. Smith, editors, Multithreaded Computer Architecture, chapter 6, pages 97– 138. Kluwer Academic Publishers, 1994. [SS89] S.L. Scott and G.S. Sohi. Using feedback to control tree saturation in multistage interconnection networks, Symposium on Computer In 16th Annual International Architecture, pages 167–76. IEEE Comput. Sot. Press, 1989. [Da190] W,J. Dally, Virtual-channel flow control, In Proceedings. The 17th Annual International Symposium on Computer Architecture, pages 60-8. IEEE Comput. Sot. Press, 1990. [vE93] Thorsten von Eicken. Active Messages: an E&cient Communication Architecture ~for Multiprocessors. PhD thesis, University of Califclmia at Berkeley, December 1993. 241