Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 25, 465-479 (2009) Deadlock Detection and Recovery for True Fully Adaptive Routing in Regular Wormhole Networks SOOJUNG LEE Department of Computer Education Gyeongin National University of Education Anyang, Kyunggi-do, 430-804 Korea Deadlock detection and recovery-based routing schemes for wormhole networks have gained attraction because unlike deadlock avoidance-based schemes, they do not restrict routing adaptability. In order to alleviate the overhead of running a recovery procedure, the studies on deadlock detection have focused on the accuracy of deadlock detection, trying to reduce the number of false detections. This paper proposes both deadlock detection and recovery schemes. The proposed detection scheme is based on the turn model and designed to declare only one packet per simple cycle of blocked packets as deadlocked. Our recovery scheme adjusts the time-out value flexibly according to the utilization rate of the recovery resources, rather than fixing a single time-out value as in previous schemes. As a consequence, it not only prevents saturation of the recovery resources by deadlocked packets but also reduces congestion of normal buffers at heavy loads. Simulation experiments show that the proposed deadlock detection scheme significantly reduces the number of false deadlock detections over previous schemes for low to moderate time-out thresholds. It is also found that the proposed recovery scheme prevents overloading of the recovery resources, yielding better network performance. Keywords: multicomputer, wormhole routing, deadlock detection, deadlock recovery, adaptive routing 1. INTRODUCTION Direct communication in multicomputer networks has been used for executing tasks in parallel to achieve better throughput. Wormhole routing has been preferred in multicomputer networks, since it requires smaller buffer requirements and the message latency is less sensitive to the distance from the source and destination, leading to lower message latency for the network with little contention [9]. All the algorithms discussed in this paper focus on wormhole routing. In wormhole routing, a packet is split into several flits for transmission. A header flit leads the route and the remaining flits follow in a pipelined fashion. Since a router is provided with a few flit buffers only, a packet may reside in several routers simultaneously. Therefore, wormhole routing is susceptible to deadlock, where each packet in a set of packets requests a channel resource held by another packet in the set in a circular way. Deadlock avoidance has been a traditional approach in handling deadlock problem, where routing is restricted so that it could fundamentally prevent the occurrence of cyclic dependency between channels. For example, the turn model [4, 16] prohibits turns that may form a cycle. However, such design of routing algorithm results in low adaptivity and increased packet latency. A deadlock avoidance scheme by Duato [3] exploits virtual Received June 7, 2007; revised May 14, 2008; accepted July 3, 2008. Communicated by Chung-Ta King. 465 466 SOOJUNG LEE channels which are divided into two classes; one, referred to as the escape channel, is used for a routing free from cyclic dependencies and the other for fully adaptive minimal routing allowing cyclic dependencies among packets. A packet proceeds on the fully adaptive virtual channels until it blocks, where it moves onto the escape channels. As reported in [11], deadlock rarely occurs if sufficient routing freedom and multiple virtual channels are provided, so it is wasteful to limit adaptivity, as in deadlock avoidance approaches. This motivated a new approach, deadlock detection and recovery [5, 9]. In contrast to deadlock avoidance schemes, this approach provides fully adaptive routing by its nature. However, current deadlock detection schemes cannot distinguish between real deadlocked and simply blocked packets, yielding many false deadlock detections. Another problem is that in principle all the packets involved in cyclic dependency can be presumed as deadlocked, although it is sufficient to select only one for recovery. This might lead to unignorable recovery overhead. The fundamental difficulty of deadlock detection and recovery schemes lies in determining the time-out value; there is no single time-out value that satisfies various network conditions. This paper proposes a simple but efficient idea whose objective is to select only one packet in a simple cycle of blocked packets in most cases under a fully adaptive routing. The basic idea is that only the packets making certain turns are eligible for checking for the existence of any potential deadlock. It is found from simulation experiments that the proposed strategy effectively reduces the number of deadlock detections compared to previous schemes, which implies that it also reduces the number of false deadlock detections, since most detected deadlocks are in fact non-existent. Deadlock recovery schemes are in general classified into two groups, regressive and progressive schemes [8, 10]. In order to resolve deadlock, regressive schemes simply kill deadlocked packets and re-inject them into the network after some delay. On the other hand, progressive recovery schemes do not kill but allow deadlocked packets to keep progressing toward their destinations [15]. One of the progressive recovery mechanisms, named Disha [10], provides buffers other than normal flit buffers, named deadlock buffers, which are centralized at routers. Those buffers form a dedicated recovery path for deadlocked packets to follow by preempting network bandwidth from undeadlocked packets when necessary. Another progressive technique proposed in [8] absorbs deadlocked packets at the current node and later reinjects them. In order to minimize performance degradation, it is recommended not to saturate recovery resources such as the centralized buffers suggested in [10]. So, lower deadlock detection rates are preferable. However, we found through simulation experiments that lower deadlock detection rates do not always lead to better network throughput at saturated loads, when recovery resources are provided as in [10]. We observed that utilization of the recovery resources to some extent leads to better performance in a heavilyloaded situation, since it directs blocked packets to those resources, relieving congestion of normal flit buffers. Therefore, we propose a deadlock recovery scheme that flexibly adjusts the time-out according to the utilization rate of the recovery resources, rather than fixing a single value as in previous schemes. The scheme is proved through simulation to effectively relieve congestion of normal flit buffers as well as saturation of the recovery resources by reflecting the network status on the time-out value. The rest of this paper is organized as follows. Section 2 introduces the proposed deadlock detection and recovery schemes. In section 3, performance of the proposed DEADLOCK DETECTION AND RECOVERY IN REGULAR WORMHOLE NETWORKS 467 schemes is evaluated through simulation experiments and compared with that of previous schemes. Section 4 concludes this paper. 2. THE PROPOSED DEADLOCK DETECTION AND RECOVERY MECHANISMS 2.1 Deadlock Detection The proposed scheme detects deadlock for n-dimensional direct regular networks with true fully adaptive minimal routing that have directions associated with the channels. The basic idea comes from the observation that deadlock involves packets waiting on each other cyclically. This property is also exploited by previous methods where certain turns are prohibited to avoid deadlock [4, 16]. Regarding deadlock detection, this implies that it is enough to examine only those packets making certain turns for potential deadlock. Inactivity time of the channels requested by those packets is measured to check for potential deadlock. By restricting the eligibility of packets to be examined for deadlock, a significant reduction of the number of false deadlock detections can be achieved. Definition 1 Let REQP be the set of all the next feasible output channels requested by a packet P according to the routing algorithm. A feasible output channel is one that leads a packet closer to its destination. Let OCCP be the set of all channels occupied by a packet P. A blocked packet P is dependent on a packet Q if REQP ∩ OCCQ ≠ ∅. ‰ Definition 2 A blocked packet P is said to be transitively dependent on a packet Q if either of the following conditions is true: (i) there exists a packet R such that P is dependent on R and R is dependent on Q. (ii) there exists a packet R such that P is dependent on R and R is transitively dependent on Q. ‰ A packet is either advancing or blocked. The latter is further classified as follows. Definition 3 A packet P is temporarily blocked if (i) all of its next feasible output channels are occupied by some other packets and (ii) at least one of the packets on which P is dependent is advancing. ‰ Definition 4 A packet P is completely blocked if (i) all of its next feasible output channels are occupied by some other packets and (ii) none of the packets on which P is dependent is advancing. ‰ Note that a deadlocked packet is sure to be completely blocked but the opposite may not be the case. 2.1.1 n-dimensional meshes To identify the types of turns in cycles of blocked packets, we focus on directions of packet movement. For nD meshes, the following notations are used. 468 SOOJUNG LEE Notation 1 Let Di denote dimension i for i = 0, …, n − 1. Then directions Di- and Di+ represent the negative and the positive directions along dimension i, respectively. ‰ Fig. 1 illustrates a cycle of blocked packets changing their directions in 2D meshes. A cycle is pictured in two different representations. Fig. 1 (a) shows a channel dependence graph [12] at a particular point of time, with arcs labeled with packet IDs depicting packet progressions or blocking relation and vertices representing physical channels. Fig. 1 (b) depicts only turns made by packets, so named turn-based graph. A vertex in the turn-based graph represents a direction and a directed edge (d1, d2) indicates that some packet is turning or waiting to turn from direction d1 to direction d2; the edge is labeled with that packet. In the figure, m1, holding channel c1, is waiting to change its direction from D1+ to D0+ and is dependent on m2. In the turn-based graph, m2 and m5 are not shown, since they are not involved in any turns. (a) A channel dependence graph. (b) A turn-based graph. Fig. 1. An illustration of a cycle in 2D meshes with two different representations. In general, a blocked packet in the turn-based graph of a cycle may be transitively dependent on a packet ahead of it in the graph. Note that there is no one-to-one correspondence between the two types of graphs in Fig. 1, since the latter focuses on packet turns only. Hence, a turn-based graph may represent several different network states. It is observed that the turn-based graph representation of a cycle in nD meshes has the following properties. P1. Since 180-degree turns are not allowed in our assumption, (Di-, Di+) and (Di+, Di-) are not included in the graph. P2. The graph includes a cycle such that P2-1. the cycle includes at least four vertices and their associated edges and P2-2. if the cycle includes vertex Di-, it also includes vertex Di+, and vice versa. Now we need to select common types of turns that are included in every cycle, in order to detect deadlock. In [4], it is said that prohibition of n(n − 1) turns is sufficient to prevent any deadlock in nD meshes. However, care must be taken so that the selected turns cover all possible cycles in nD meshes. In our scheme, turns from Di+ to Di+1-, Di+1+, …, Dn-1-, and to Dn-1+, for i = 0, …, n − 2, are selected. Note that the total number of turns selected as such is n(n − 1). Theorem 1 Any cycle in nD meshes includes at least one of the turns from Di+ to Di+1-, Di+1+, …, Dn-1-, and to Dn-1+, for i = 0, …, n − 2. DEADLOCK DETECTION AND RECOVERY IN REGULAR WORMHOLE NETWORKS 469 Proof: By contradiction. Consider a m-dimensional cycle, for m = 2, …, n, which excludes all of the specified turns. Fig. 2 illustrates the turn-based graph representing the cycle. The m dimensions are renamed and listed in an ascending order, i1, i2, …, im. As seen in the figure, the graph contains no cycle of m dimensions and it violates Property P2-2. ‰ Fig. 2. A turn-based graph including all edges except those corresponding to the selected turns in nD meshes. To describe the proposed deadlock detection scheme in detail, a bit of a packet header, named turn bit, is allocated to indicate whether the next routing direction of a packet corresponds to one of the selected turns. If the turn bit of a packet is set and all of its next feasible output channels have been inactive for time-out, the packet is presumed as deadlocked. Note that the turn bit of a packet may be set at a router but the packet may be presumed as deadlocked at another router on its way to the destination. Deadlock Detection Scheme for nD meshes: 1. If the next routing direction of a packet corresponds to one of the turns from Di+ to Di+1-, Di+1+, …, Dn-1-, and to Dn-1+, for i = 0, …, n − 2, set its turn bit to true. 2. If a packet has been completely blocked for the time-out interval and its turn bit is set true, the packet is presumed as deadlocked. ‰ 2.1.2 k-ary n-cubes For k-ary n-cubes, in addition to cycles in meshes, there can be a cycle involving wraparound channels. This cycle may not include the selected turns mentioned previously. A simple modification is made to the scheme to detect such deadlock; when the next routing direction of a packet uses a wraparound channel, set its turn bit true. This strategy is more than enough to detect all deadlocks involving wraparound channels. More sophisticated method can be devised to reduce the number of turn bit settings. However, we choose simplicity rather than complexity not to degrade the router performance. Deadlock Detection Scheme for k-ary n-cubes: 1. If the next routing direction of a packet corresponds to one of the turns from Di+ to Di+1-, Di+1+, …, Dn-1-, and to Dn-1+, for i = 0, …, n − 2, set its turn bit to true. 2. Otherwise if the next routing direction of a packet uses a wraparound channel, set its turn bit to true. SOOJUNG LEE 470 3. If a packet has been completely blocked for the time-out interval and its turn bit is set true, the packet is presumed as deadlocked. ‰ 2.2 Deadlock Recovery A time-out value is an important factor affecting the performance of a deadlock handling scheme. Especially for Disha Concurrent [10], prompt deadlock recovery depends on the number of deadlocked packets which is primarily determined by the time-out value, since deadlocked packets are routed through the path of deadlock buffers concurrently. Given too many deadlocked packets, the path is overloaded, degrading performance. To investigate this correlation in detail, we conducted simulations for a wide range of time-out values. Fig. 3 (a) shows the result. It is noted that the deadlock buffer (DB) utilization ranging from 0.05 to 0.11 yields the best throughput. Therefore, we may conjecture that it is possible to improve throughput by managing the time-out flexibly so that deadlock buffer utilization be within some range. To verify our conjecture, we conducted simulation for the same network conditions and varying initial time-out values. As shown in Fig. 3 (b), for each initial time-out value, one experiment is conducted without adjusting the time-out during simulation and another experiment with adjusting the time-out to maintain deadlock buffer utilization in the range of 0.05 and 0.11. It is seen that adjusting the time-out flexibly during simulation yields better throughput regardless of the initial time-out value. 0.5 0.5 Throughput DB Utilization 0.4 0.3 0.3 Rate Throughput 0.4 0.2 0.1 Without Adjusting TO With Adjusting TO 0.2 0.1 0 0 100 200 300 400 500 0 0 100 200 300 400 Time-out Initial Time-out (a) (b) 500 (a) Fixed time-out values during simulation. (b) Throughput improvement by time-out values adjusted flexibly during simulation. Fig. 3. Deadlock buffer utilization and/or throughput vs. time-out value, offered traffic rate = 0.45. The above observation motivates us to develop an idea that adjusts the time-out value flexibly by periodically checking the deadlock buffer usage. Specifically, if the deadlock buffer usage is high, increase the time-out value. Otherwise, consider either of the two situations; the network is lightly loaded or the time-out value is too large in a heavily-loaded network. In the latter case, a shorter time-out would facilitate dissipating congestion as it decreases utilization of normal buffers. To realize this idea of managing the time-out flexibly, named FLEX-TO recovery scheme, it is required to estimate network congestion status and determine the proper rate of deadlock buffer utilization. The time-out needs to be modified only when the network is congested. There have been several approaches developed for measuring and controlling network congestion [1]. We DEADLOCK DETECTION AND RECOVERY IN REGULAR WORMHOLE NETWORKS 471 take one of the useful approximations of congestion in the literature that is obtained by measuring busy output channels at each node [8]. When the network is presumed as heavily loaded, our recovery scheme adjusts the time-out value, thus realizing the selected proper utilization rate of deadlock buffers. The scheme utilizes some predetermined values as follows. 1. Set a threshold for output channel utilization rate (THCH). 2. Set a threshold for deadlock buffer utilization rate (THDB). 3. Set a minimum time-out value (TOmin) and a maximum time-out value (TOmax). 4. Set a time-out update period (I). 5. Set a time-out increment/decrement value (VAL). After determining the above parameters, FLEX-TO recovery scheme runs the following steps at each router at every time-out update period. 1. Calculate the current utilization rate of output channels (CURCH). 2. If CURCH < THCH then exit. 3. Calculate the current utilization rate of the deadlock buffers (CURDB). 4. If THDB − CURDB > α for some α > 0 and TO (current time-out value) > TOmin then decrease TO by VAL. 5. Otherwise if CURDB − THDB > α and TO < TOmax then increase TO by VAL. In steps 4 and 5, if α is too small, the scheme is very sensitive to the change of deadlock buffer utilization rate, thus frequently updating the time-out value. On the other hand, a large α rarely changes the time-out value if the network is stable. 3. PERFORMANCE 3.1 Simulation Model Performance comparison is mainly made with a previous deadlock detection scheme in [7] in combination with two deadlock recovery schemes, Disha Concurrent and Disha Sequential [10]. Disha Sequential requires sequential access to the recovery path; only one deadlocked packet at a time can proceed through the recovery path after capturing a token. To our knowledge, the scheme in [7] is most sophisticated and efficient in terms of the number of deadlock detections. In order to view performance difference between various paradigms of routing strategies, we also experimented on those routing schemes most studied in the literature through simulation. Namely, we evaluated the fully adaptive Duato’s avoidance-based scheme [3], the Negative-First routing algorithm, and Compressionless Routing [6], a true fully adaptive regressive recovery routing scheme. Duato’s scheme has been widely studied as a typical example of deadlock avoidance algorithms and shown to exhibit superior performance over other existing avoidance routing algorithms [5, 14]. As a result, it has been practically accepted in real systems such as the Cray T3E [13] and Reliable Router [2]. The Negative-First algorithm is partially adaptive and shown to yield the best results among the Turn Model schemes dis- 472 SOOJUNG LEE cussed previously [4]. Compressionless routing recovers from deadlock by simply killing the packets involved and later reinjecting them into the network after some delay. All simulations are conducted at the flit level. The routing algorithm can use any minimal path to forward a message toward its destination. Performance comparison is made on two important metrics, the ratio of packets presumed as deadlocked (simply, the ratio of deadlocked packets) and throughput. Throughput is measured as normalized accepted traffic in flits per node per cycle. Considering the sizes of current multicomputers and widely-used topologies in the simulation study, we simulated the schemes on 8x8x8 meshes and 8-ary 3-cubes. A node is equipped with one injection and reception channel at the network interface. A physical channel between nodes is shared by three virtual channels of buffer depth of two flits. It is assigned to a virtual channel in a demand-slotted round robin fashion. We used the true fully adaptive routing except for Duato’s scheme [3], which requires one virtual channel for avoiding deadlock in mesh networks, thus two virtual channels remaining for true fully adaptive routing. For 8-ary 3-cubes, the scheme requires one more channel for deadlock avoidance, leaving only one channel for true fully adaptive routing. It is assumed that both routing time and transmission time of a flit across a channel equal one clock cycle. Also it takes one cycle to transfer a flit from an input buffer to an output buffer unless there is congestion. The crossbar switch allows multiple messages to traverse it simultaneously. Each simulation result was obtained after running the program sufficiently long enough for the network to operate in a steady state or become saturated. We discarded the data collected during the initial transient period, i.e., before the network enters into a steady state. We simulated the packet length of 32 flits and the uniform, perfect-shuffle, and bit-reversal distributions of packet destinations. Packets are generated with the injection rate varying from low load to saturation and exponentially distributed, where the same rate is applied to all nodes. For the simulation of Disha Sequential, the token propagates at twice the router clock speed with no impact on router delay, as also assumed in [10]. The central deadlock buffers for Disha Sequential have the size of two flits. For Disha Concurrent, the router is provided with two types of deadlock buffers, both of two-flit size, using the Hamiltonian-path with shortcuts [17]; one in the increasing order for those packets with the destinations of higher ids and the other in the decreasing order for those packets with lower destinations. 3.2 Tuning the Parameters In this section, we select appropriate values for those parameters required by FLEXTO recovery scheme discussed in section 2.2 for best performance. Fig. 4 shows output channel utilization rates vs. throughput for different time-outs for the uniform traffic patterns. Output channel utilization is measured as the ratio of the number of virtual channels occupied by packets. It is noted that the network begins to be saturated when the utilization rate is over 0.5 for 8-ary 3-cubes. For meshes, the rate is approximately 0.4, regardless of the time-out. Similar results are obtained in [8] where nine out of eighteen virtual output channels are busy at saturation regardless of message length for a 8 × 8 × 8 torus. Therefore, we selected 0.5 as the threshold for output virtual channel utilization rate (THCH) to activate FLEX-TO recovery scheme for torus networks. For meshes, the selected threshold rate is 0.4. DEADLOCK DETECTION AND RECOVERY IN REGULAR WORMHOLE NETWORKS Virtual channel utilization 0.4 0.2 0 TO=16 TO=64 TO=256 TO=1024 0.6 Virtual channel utilization TO=16 TO=64 TO=256 TO=1024 0.6 473 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 Throughput (flits/node/cycle) 0 0.1 0.2 0.3 0.4 0.5 Throughput (flits/node/cycle) (a) 8-ary 3-cubes. (b) 8 × 8 × 8 meshes. Fig. 4. Output channel utilization for varying time-out value of the proposed deadlock detection scheme with Disha Concurrent recovery scheme. Next we determine a proper threshold for deadlock buffer utilization rate (THDB). As mentioned in section 2.2, the utilization rate is calculated as the mean ratio of the number of the deadlock buffers occupied by deadlocked packets per cycle. We used various threshold values ranging from 0.05 to 0.11 for 8-ary 3-cubes, as obtained from the simulation results in section 2.2, and found that throughput differences are almost negligible even at high network load. For meshes, the range of thresholds was chosen from 0.03 to 0.11, which was found to yield better throughput than with the other thresholds. Although any threshold within this range produces almost the same throughput, the thresholds from 0.07 to 0.10 yield slightly higher throughput. Among these values, we selected 0.08 as a threshold for deadlock buffer utilization rate in the results presented next. As discussed in section 2.2, the time-out is adjusted when CURDB (current deadlock buffer utilization rate) differs from the threshold by α. After examining the threshold range yielding better throughput, we chose 0.02 for α which is approximated difference between the threshold (0.08) and each limit value of the range. Now consider the time-out update period (I) and update value (VAL). Intuitively, with a large I, the scheme slowly reflects the status on the time-out, thus rather insensitive to the change of network status. A large VAL will fluctuate the deadlock buffer utilization rate sharply, as well as the time-out. To study their effects on performance, we simulated our schemes for heavy traffic load; recall that FLEX-TO recovery scheme is triggered only when the network load is over the predetermined threshold. The experiments were conducted for (10, 5), (50, 5), and (100, 10), where the first value of the pair indicates I and the second one VAL. It was noted that the scheme reflected the network status most promptly with (10, 5), whereas the scheme tended to make relatively infrequent time-out change with the larger I. It turned out that throughput difference after applying each of these parameter values was minimal. We selected (50, 5) as values for I and VAL, respectively. For TOmin and TOmax, we selected 4 and 1024 cycles, respectively, considering the range of time-out values, 16 to 1024 cycles, used in our experiments. 3.3 Simulation Results This section presents simulation results of several schemes under various network conditions: the scheme in [7] together with Disha Sequential [10] (LPZ-DS), that with Disha Concurrent [10] (LPZ-DC), our proposed deadlock detection scheme with Disha SOOJUNG LEE 474 Concurrent (PRP-DC), our detection scheme with FLEX-TO recovery scheme (PRPFLX), the fully adaptive Duato’s avoidance-based scheme (DUATO) [3], the NegativeFirst routing algorithm (TURN), and Compressionless Routing (COMPL) [6]. The performance of PRP-DC is studied to fairly investigate the efficiency of the proposed deadlock detection scheme in terms of the number of deadlock detections as compared with the scheme in [7] when the same recovery scheme, Disha Concurrent, is applied. We measured the ratio of packets presumed as deadlocked by each deadlock detection scheme for varying packet injection rate (offered traffic) and time-out value (TO). Offered traffic loads are normalized with respect to the network’s maximum wire capacity, defined as all of the network channels transmitting simultaneously. Fig. 5 shows the results for 8 × 8 × 8 meshes for uniform traffic patterns. For PRP-FLX, the specified time-out values are given initially to the simulator and adjusted during simulation according to FLEX-TO recovery scheme. The schemes behave similarly in that the ratios are almost zeros for normal network loads but increase substantially for heavy loads. In general, the ratio is inversely related with the time-out in every scheme except PRP-FLX. Ratio of Deadlocked Packets 0.5 Ratio of Deadlocked Packets 0.5 LPZ-DS LPZ-DC PRP-DC PRP-FLX COMPL 0.4 Ratio of Deadlocked Packets 0.5 LPZ-DS LPZ-DC PRP-DC PRP-FLX COMPL 0.4 Ratio of Deadlocked Packets 0.5 LPZ-DS LPZ-DC PRP-DC PRP-FLX COMPL 0.4 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0 0 0 0.05 0.1 0.15 0.2 0.25 Offered Traffic (flits/node/cycle) 0 0 0.05 0.1 0.15 0.2 0.25 Offered Traffic (flits/node/cycle) LPZ-DS LPZ-DC PRP-DC PRP-FLX COMPL 0.4 0 0 0.05 0.1 0.15 0.2 0.25 Offered Traffic (flits/node/cycle) 0 0.05 0.1 0.15 0.2 0.25 Offered Traffic (flits/node/cycle) (a) (b) (c) (d) Fig. 5. Deadlocked packet ratio in 8 × 8 × 8 meshes for uniform traffic patterns when (a) TO = 16; (b) TO = 64; (c) TO = 256; and (d) TO = 1024 cycles. It is noted that PRP-FLX yields the lowest ratio for TO of 16 cycles, but the highest ratio for the large TO of 1024 cycles at high loads. This is because of the property of FLEX-TO recovery scheme, as it flexibly adjusts the time-out based on the deadlock buffer utilization. That is, for a small time-out in congested networks, the deadlock buffers tend to be heavily loaded, thus having the time-out increased. Consequently, packets would be presumed as deadlocked less frequently than the other schemes. On the other hand, for a large time-out in congested networks, the opposite situation occurs. COMPL performance varies most depending on the time-out, as shown in Fig. 5. The obvious reason is that it adopts the loosest condition for deadlock detection, i.e., simply killing packets that have not progressed for time-out at the sources. Among deadlock recovery schemes, LPZ-DC seems more sensitive to the time-out for saturated networks, as its range of ratios for different time-outs is the largest. Specifically, at saturation (rate of 0.225), the ranges are 0.10, 0.11, and 0.04, for LPZ-DC, LPZ-DS, and PRP-DC, respectively. For highly saturated situation (rate of 0.25), they are 0.37, 0.18, and 0.21 in that order. In terms of the number of deadlock detections, PRP turns out to be more efficient than LPZ, as the ratio for LPZ is as much as 2.5 times higher for TO of DEADLOCK DETECTION AND RECOVERY IN REGULAR WORMHOLE NETWORKS 475 16 cycles at saturation. Even for the large TO of 1024 cycles, LPZ detects approximately 1.2 times more deadlocks than PRP. Note in Fig. 5 that LPZ-DS yields lower ratio than LPZ-DC and PRP-DC at highly saturated networks (rate of over 0.25) for relatively short TOs of 16 and 64 cycles. The reason is as follows. DC enables concurrent recovery of deadlocked packets proceeding along the deadlock buffers. On the contrary, DS allows only one packet to use the recovery path, since a packet for DS must wait for the token for recovery. During the waiting time, there is a higher chance that the packet presumed as deadlocked may obtain a next feasible channel and become unblocked, since a network with a fully adaptive routing rarely enters into deadlock. In case of larger TOs of 256 and 1024 cycles, such effect is diminished, since the time-out is sufficiently long enough for blocked packets to proceed, substantially decreasing the number of deadlock detections for LPZ and PRP. Fig. 6 plots the ratio of deadlocks for 8-ary 3-cubes. The schemes behave almost similarly as in meshes. However, the ratios are lower in general than those for meshes, due to the higher adaptiveness property of tori. Note that the ratios for PRP-DC are still lower than those for LPZ-DC in every result, even though PRP enforces loose criteria for deadlock declaration for tori than for meshes. Specifically, the ratios for LPZ-DC are 2.1 to 3.2 times higher than those for PRP-DC for the TO less than 1024 cycles at saturation. Simulation experiments are also conducted with non-uniform traffic patterns for 3D tori. Fig. 7 depicts the deadlocked packet ratios for perfect-shuffle and bit-reversal traffic patterns for the TOs of 16 and 256 cycles. They tend to increase at lower offered traffic loads, but are in general lower than for uniform traffic patterns, except for PRP-FLX. The ratios quickly level off at high loads regardless of time-outs and traffic patterns. For the TO of 256 cycles, LPZ and PRP-DC yield almost no deadlocks. Different from the uniform traffic pattern case for the TO of 16 cycles, PRP-DC results in the lowest ratio for both non-uniform traffic patterns. It is noted that PRP-FLX consistently yields the ratio of 0.1 to 0.15 at heavy loads for all traffic patterns and all time-outs presented. Fig. 8 plots normalized throughput of each scheme in 3D meshes. It is observed that the schemes except TURN and COMPL perform almost the same for low and moderate loads but differ at high loads. In particular, PRP-FLX generally performs better than the other schemes for all time-outs. This is because PRP-FLX manages to allocate the channel bandwidth properly by not overloading the recovery path. Especially, such management takes effect for the TO of 16 cycles, as PRP-FLX significantly outperforms all the other schemes, yielding from 1.9 to 6.68 times higher throughput at highly offered traffic of 0.25. One exception is COMPL which tends to sustain its performance at high loads for low TOs of 16 and 64 cycles. This is mainly because it kills sufficient number of packets for the network to return to normal condition. However, for larger time-outs, COMPL performance degrades more significantly than the other schemes, demonstrating its degree of sensitivity to time-out, which is also addressed in [6, 10]. It is noticed in Fig. 8 that LPZ-DS performs worst at heavy loads. This is attributed to its sequential recovery procedure. That is, although it results in lower ratio of deadlocked packets than LPZ-DC or PRP-DC in some cases as shown in Fig. 5, only one deadlocked packet at a time is recovered, thus taking much longer time for recovery. For instance, the recovery time of a deadlocked packet for LPZ-DS is approximately 1.98 times longer than that for LPZ-DC, for the TO of 64 cycles at the offered traffic of 0.25. Unless the network is significantly heavy, TURN yields the lowest throughput, due to its restriction on routing adaptiveness. SOOJUNG LEE 476 Ratio of Deadlocked Packets Ratio of Deadlocked Packets 0.5 0.5 0.4 LPZ-DS LPZ-DC PRP-DC PRP-FLX COMPL 0.4 Ratio of Deadlocked Packets Ratio of Deadlocked Packets 0.5 LPZ-DS LPZ-DC PRP-DC PRP-FLX COMPL 0.5 LPZ-DS LPZ-DC PRP-DC PRP-FLX COMPL 0.4 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0 0 0 0 0.1 0.2 0.3 0.4 0.5 Offered Traffic (flits/node/cycle) 0 0.1 0.2 0.3 0.4 0.5 Offered Traffic (flits/node/cycle) LPZ-DS LPZ-DC PRP-DC PRP-FLX COMPL 0.4 0 0 0.1 0.2 0.3 0.4 0.5 Offered Traffic (flits/node/cycle) 0 0.1 0.2 0.3 0.4 0.5 Offered Traffic (flits/node/cycle) (a) (b) (c) (d) Fig. 6. Deadlocked packet ratio in 8-ary 3-cubes for uniform traffic patterns when (a) TO = 16; (b) TO = 64; (c) TO = 256; and (d) TO = 1024 cycles. Ratio of Deadlocked Packets 0.5 0.4 Ratio of Deadlocked Packets 0.5 LPZ-DS LPZ-DC PRP-DC PRP-FLX COMPL 0.4 Ratio of Deadlocked Packets 0.5 LPZ-DS LPZ-DC PRP-DC PRP-FLX COMPL Ratio of Deadlocked Packets 0.5 LPZ-DS LPZ-DC PRP-DC PRP-FLX COMPL 0.4 0.4 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0 0 Offered Traffic (flits/node/cycle) 0.1 0.2 0.3 0.4 0.5 LPZ-DS LPZ-DC PRP-DC PRP-FLX COMPL 0 0 Offered Traffic (flits/node/cycle) 0.1 0.2 0.3 0.4 0.5 Offered Traffic (flits/node/cycle) 0 0.1 0.2 0.3 0.4 0.5 Offered Traffic (flits/node/cycle) (a) (b) (c) (d) Fig. 7. Deadlocked packet ratio in 8-ary 3-cubes. Perfect-shuffle traffic patterns when (a) TO = 16 and (b) TO = 256 cycles. Bit-reversal traffic patterns when (c) TO = 16 and (d) TO = 256 cycles. Accepted Traffic (flits/node/cycle) 0.5 0.4 0.5 LPZ-DS LPZ-DC PRP-DC PRP-FLX DUATO COMPL TURN Accepted Traffic (flits/node/cycle) Accepted Traffic (flits/node/cycle) 0.4 0.5 LPZ-DS LPZ-DC PRP-DC PRP-FLX DUATO COMPL TURN 0.4 LPZ-DS LPZ-DC PRP-DC PRP-FLX DUATO COMPL TURN Accepted Traffic (flits/node/cycle) 0.5 0.4 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0 0 0 0 0.05 0.1 0.15 0.2 0.25 Offered Traffic (flits/node/cycle) 0 0.05 0.1 0.15 0.2 0.25 Offered Traffic (flits/node/cycle) LPZ-DS LPZ-DC PRP-DC PRP-FLX DUATO COMPL TURN 0 0 0.05 0.1 0.15 0.2 0.25 Offered Traffic (flits/node/cycle) 0 0.05 0.1 0.15 0.2 0.25 Offered Traffic (flits/node/cycle) (a) (b) (c) (d) Fig. 8. Throughput in 8 × 8 × 8 meshes for uniform traffic patterns when (a) TO = 16; (b) TO = 64; (c) TO = 256; and (d) TO = 1024 cycles. Interestingly, throughput for the TO of 1024 cycles is lower than that for the TO of 256 cycles for LPZ-DC and PRP-DC at high loads. This is because those schemes redirect blocked packets to the deadlock buffers less frequently for the longer time-out, thus being unable to quickly relieve normal flit buffers from congestion. For the lower TOs of 16 and 64 cycles, PRP-DC outperforms LPZ-DC, although they use the same recovery scheme. This is attributed to the lower ratio of deadlocked packets by PRP-DC, as seen in Fig. 5. If the ratios are too high, deadlocked packets would obviously encounter considerable delay when routed along the deadlock buffers. Fig. 8 shows that DUATO sur- DEADLOCK DETECTION AND RECOVERY IN REGULAR WORMHOLE NETWORKS 477 prisingly outperforms all the progressive deadlock recovery schemes except PRP-FLX for all time-outs. Its behavior is similar to that of PRP-FLX, as its throughput tends to drop slightly after saturation. This indicates that efficient use of the deadlock buffers is critical to network performance. Fig. 9 plots the performance of each scheme in terms of throughput in 3D tori. The schemes behave almost similarly as in meshes, although they show higher peak throughput. In general, PRP-FLX continues to outperform all the other schemes, especially for large time-outs and high loads. One of the main differences from the results shown in Fig. 8 is that DUATO performs relatively worse. The obvious reason is that DUATO requires two virtual channels to avoid deadlock for tori, leaving only one channel for fully adaptive routing, whereas it uses two channels for fully adaptive routing for meshes. TURN also performs worse than in meshes because it provides non-minimal routes in tori, as discussed in [6]. Another difference is that LPZ-DS performs comparably to LPZ-DC and PRP-DC for the TO of 1024 cycles. This is due to its ratio of deadlocked packets as shown in Fig. 6 (d), where all the three schemes generate almost no deadlocked packet, obviously leading to virtually no performance difference. In the figure, COMPL performs better than DUATO for the TO less than 1024 cycles, while DUATO is slightly better for the TO of 1024 cycles at high loads over 0.35. This result matches with that in [5, 10]. Moreover, DUATO significantly outperforms TURN for all the configurations, which is also demonstrated in [10]. Accepted Traffic (flits/node/cycle) 0.5 0.4 LPZ-DS LPZ-DC PRP-DC PRP-FLX DUATO COMPL TURN Accepted Traffic (flits/node/cycle) LPZ-DS LPZ-DC PRP-DC PRP-FLX DUATO COMPL TURN 0.5 0.4 Accepted Traffic (flits/node/cycle) LPZ-DS LPZ-DC PRP-DC PRP-FLX DUATO COMPL TURN 0.5 0.4 Accepted Traffic (flits/node/cycle) 0.5 0.4 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 Offered Traffic (flits/node/cycle) 0 0 0.1 0.2 0.3 0.4 0.5 Offered Traffic (flits/node/cycle) LPZ-DS LPZ-DC PRP-DC PRP-FLX DUATO COMPL TURN 0 0 0.1 0.2 0.3 0.4 0.5 Offered Traffic (flits/node/cycle) 0 0.1 0.2 0.3 0.4 0.5 Offered Traffic (flits/node/cycle) (a) (b) (c) (d) Fig. 9. Throughput in 8-ary 3-cubes for uniform traffic patterns when (a) TO = 16; (b) TO = 64; (c) TO = 256; and (d) TO = 1024 cycles. Accepted Traffic (flits/node/cycle) 0.5 0.4 LPZ-DS LPZ-DC PRP-DC PRP-FLX DUATO COMPL TURN Accepted Traffic (flits/node/cycle) Accepted Traffic (flits/node/cycle) 0.5 0.4 0.5 LPZ-DS LPZ-DC PRP-DC PRP-FLX DUATO COMPL TURN 0.4 LPZ-DS LPZ-DC PRP-DC PRP-FLX DUATO COMPL TURN Accepted Traffic (flits/node/cycle) 0.5 0.4 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0 0 0 0 0.1 0.2 0.3 0.4 0.5 Offered Traffic (flits/node/cycle) 0 0.1 0.2 0.3 0.4 0.5 Offered Traffic (flits/node/cycle) 0 0.1 0.2 0.3 0.4 0.5 Offered Traffic (flits/node/cycle) LPZ-DS LPZ-DC PRP-DC PRP-FLX DUATO COMPL TURN 0 0 0.1 0.2 0.3 0.4 0.5 Offered Traffic (flits/node/cycle) (a) (b) (c) (d) Fig. 10. Throughput in 8-ary 3-cubes. Perfect-shuffle traffic patterns when (a) TO = 16 and (b) TO = 256 cycles. Bit-reversal traffic patterns when (c) TO = 16 and (d) TO = 256 cycles. 478 SOOJUNG LEE Fig. 10 presents throughput of each scheme for non-uniform traffic patterns for 3D tori. The schemes behave very similarly as in the uniform traffic patterns. TURN shows relatively better performance in perfect-shuffle patterns than in the other patterns. This is also the case for DUATO, since its performance difference from PRP-FLX is much less than in the uniform patterns. In particular, DUATO performs comparably to COMPL for the TO of 256 cycles in the non-uniform patterns, while it is outperformed by COMPL in the uniform patterns. It is noticed that PRP-DC does not degrade significantly for the TO of 16 cycles, but yields performance competitive with PRP-FLX in the figure. In fact, for large time-outs, the progressive deadlock recovery schemes exhibit little difference in throughput. 4. CONCLUSIONS This paper proposed efficient deadlock detection and recovery schemes for true fully adaptive routing in wormhole networks. Deadlock detection is based on the turn model and intended to mark only one packet as deadlocked in a simple cycle. Our deadlock recovery scheme operates by adjusting the time-out value flexibly according to the utilization rate of the recovery resources, rather than fixing a single time-out value as in previous schemes. Simulation experiments were conducted to show that the proposed deadlock detection scheme significantly outperforms previous schemes in terms of the number of packets detected as deadlocked for low to moderate time-out intervals. Performance study also shows that the proposed deadlock recovery scheme yields better throughput over previous ones and deadlock avoidance-based routing schemes. Although simulation results for only 8 × 8 × 8 meshes and tori are presented in this paper, similar results were obtained for larger and smaller networks such as 16 × 16 and 4 × 4 × 4 meshes. The main advantage of the proposed scheme is that it virtually removes the need for determining a best time-out value satisfying various network conditions. REFERENCES 1. E. Baydal, P. Lopez, and J. Duato, “A family of mechanisms for congestion control in wormhole networks,” IEEE Transactions on Parallel and Distributed Systems, Vol. 16, 2005, pp. 772-784. 2. W. J. Dally, L. R. Dennison, D. Harris, K. Kan, and T. Xanthopoulos, “The reliable router: A reliable and high-performance communication substrate for parallel computers,” in Proceedings of the 1st Workshop on Parallel Computer Routing and Communication, 1994, pp. 241-255. 3. J. Duato, “A new theory of deadlock-free adaptive routing in wormhole networks,” IEEE Transactions on Parallel and Distributed Systems, Vol. 4, 1993, pp. 1320-1331. 4. C. J. Glass and L. M. Ni, “The turn model for adaptive routing,” Journal of the ACM, Vol. 41, 1994, pp. 874-902. 5. A. Khonsari, A. Shahrabi, M. Ould-Khaoua, and H. Sarbazi-Azad, “Performance comparison of deadlock recovery and deadlock avoidance routing algorithms in wormhole-switched networks,” IEE Proceedings of Computers and Digital Techniques, Vol. 150, 2003, pp. 97-106. 6. J. Kim, Z. Liu, and A. Chien, “Compressionless routing: Aframework for adaptive DEADLOCK DETECTION AND RECOVERY IN REGULAR WORMHOLE NETWORKS 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 479 and fault-tolerant routing,” IEEE Transactions on Parallel and Distributed Systems, Vol. 8, 1997, pp. 229-244. J. M. Martinez, P. Lopez, and J. Duato, “FC3D: Flow control-based distributed deadlock detection mechanism for true fully adaptive routing in wormhole networks,” IEEE Transactions on Parallel and Distributed Systems, Vol. 15, 2003, pp. 765-779. J. M. Martinez, P. Lopez, and J. Duato, “A cost-effective approach to deadlock handling in wormhole networks,” IEEE Transactions on Parallel and Distributed Systems, Vol. 12, 2001, pp. 716-729. P. Mohapatra, “Wormhole routing techniques for directly connected multicomputer systems,” ACM Computing Surveys, Vol. 30, 1998, pp. 374-410. T. M. Pinkston, “Flexible and efficient routing based on progressive deadlock recovery,” IEEE Transactions on Computers, Vol. 48, 1999, pp. 649-669. T. M. Pinkston and S. Warnakulasuriya, “Characterization of deadlocks in k-ary ncube networks,” IEEE Transactions on Parallel and Distributed Systems, Vol. 10, 1999, pp. 904-921. L. Schwiebert, “Deadlock-free oblivious wormhole routing with cyclic dependencies,” IEEE Transactions on Computers, Vol. 50, 2001, pp. 865-876. S. L. Scott and G. M. Thorson, “The cray T3E network: Adaptive routing in a high performance 3D torus,” in Proceedings of Symposium on Hot Interconnects IV, 1996, pp. 147-156. A. Shahrabi and M. Ould-Khaoua, “On the performance of routing algorithms in wormhole-switched multicomputer networks,” in Proceedings of the 11th International Conference on Parallel and Distributed Systems, 2005, pp. 515-519. Y. H. Song and T. M. Pinkston, “A progressive approach to handling message-dependent deadlock in parallel computer systems,” IEEE Transactions on Parallel and Distributed Systems, Vol. 14, 2003, pp. 259-275. Y. M. Sun, C. H. Yang, Y. C. Chung, and T. Y. Huang, “An efficient deadlock-free tree-based routing algorithm for irregular wormhole-routed networks based on the turn model,” in Proceedings of International Conference on Parallel Processing, 2004, pp. 343-352. S. C. Wang, H. Y. Lin, and S. Y. Kuo, “A simple and efficient deadlock recovery scheme for wormhole routed 2-dimensional meshes,” in Proceedings of Pacific Rim International Symposium on Dependable Computing, 1999, pp. 210-217. Soojung Lee (李秀貞) received the B.S. degree in Mathematics from Ewha University, Seoul, Korea in 1985. She got the M.S. and Ph.D. degrees in Computer Science from Texas A&M University, in 1990 and 1994, respectively. She had been a senior engineer at the Telecommunication Research Center, Samsung Electronics Co. from 1994 through 1998. She is currently a professor in the Department of Computer Education, Gyeongin National University of Education. Her research interests include data mining, information retrieval, distributed computing, computer networks, and computer education.