Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Dynamic Reconfiguration in Computer Clusters with Irregular Topologies in the Presence of Multiple Node and Link Failures

Published: 01 May 2005 Publication History

Abstract

Component failures in high-speed computer networks can result in significant topological changes. In such cases, a network reconfiguration algorithm must be executed to restore the connectivity between the network nodes. Most contemporary networks use either static reconfiguration algorithms or stop the user traffic in order to prevent cyclic dependencies in the routing tables. The goal of this paper is to present NetRec, a dynamic network reconfiguration algorithm for tolerating multiple node and link failures in high-speed networks with arbitrary topology. The algorithm updates the routing tables asynchronously and does not require any global knowledge about the network topology. Certain phases of NetRec are executed in parallel, which reduces the reconfiguration time. The algorithm suspends the application traffic in small regions of the network only while the routing tables are being updated. The message complexity of NetRec is analyzed and the termination, liveness, and safety of the proposed algorithm are proven. Additionally, results from validation of the algorithm in a distributed network-validation testbed Distant, based on the MPI 1.2 features for building arbitrary virtual topologies, are presented.

References

[1]
D. Garcia and W. Watson, “ServerNet II,” <i>Proc. Parallel Computer Routing and Comm. Workshop,</i> pp. 119-136, June 1997.]]
[2]
N. Boden D. Cohen R. Felderman A. Kulawik C. Seitz J. Seizovic and W. Su, “Myrinet-A Gigabit per Second Local Area Network,” <i>IEEE Micro,</i> vol. 5, no. 1, pp. 29-36, Feb. 1995.]]
[3]
M. Schroeder A. Birrell M. Burrows H. Murray R. Needham and T. Rodeheffer, “Autonet: A High-Speed, Self-Configuring Local Area Network Using Point-to-Point Links,” <i>IEEE J. Selected Areas in Comm.,</i> vol. 9, no. 8, pp. 1318-1335, Oct. 1991.]]
[4]
D. Oppenheimer A. Brown J. Beck D. Hettena J. Kurode N. Treuhaft D.A. Patterson and K. Yelick, “ROC-1: Hardware Support for Recovery-Oriented Computing,” <i>IEEE Trans. Computers,</i> special issue on fault-tolerant embedded systems, D.nbspAvresky, B.W. Johnson, and F. Lombardi, eds.,vol. 51, no. 2, pp. 100-107, Feb. 2002.]]
[5]
R. Horst, “Tnet: A Reliable System Area Network,” <i>IEEE Micro,</i> vol. 15,no. 1, pp. 37-45, Feb. 1995.]]
[6]
W. Baker R. Horst D. Sonnier and W. Watson, “A Flexible ServerNet-Based Fault-Tolerant Architecture,” <i>Proc. 25th Int'l Symp. Fault-Tolerant Computing,</i> pp. 2-11, June 1995.]]
[7]
J. Duato R. Casado F. Quiles and J. Sanchez, “Dynamic Reconfiguration in High Speed Local Area Networks,” <i>Dependable Network Computing,</i> D. Avresky, ed., Kluwer Academic, 2000.]]
[8]
C. Fang and T. Szymanski, “An Analysis of Deflection Routing in Multi-Dimensional Regular Mesh Networks,” <i>Proc. IEEE INFOCOM '91,</i> Apr. 1991.]]
[9]
G.D. Pfiarre L. Gavano A. Feliperin and J.L.C. Sanz, “Fully Adaptive Minimal Deadlock-Free Packet Routing in Hypercubes, Meshes and Other Networks: Algorithms and Simulations,” <i>IEEE Trans. Parallel and Distributed Systems,</i> vol. 5, no. 3, pp. 247-263, Mar. 1994.]]
[10]
P.E. Berman L. Gravano G.D. Pfiarre L. Gavano and J.L.C. Sanz, “Adaptive Deadlock- and Livelock-Free Routing with All Minimal Paths in Torus Networks,” <i>Proc. Fourth ACM Symp. Parallel Algorithms and Architectures,</i> June 1992.]]
[11]
P.T. Gaughan and S. Yalamanchili, “Adaptive Routing Protocols for Hypercube Interconnection Networks,” <i>Computer,</i> vol. 26,no. 5, pp. 12-23, May 1993.]]
[12]
D. Avresky J. Acosta V. Shurbanov and Z. McAffrey, “Adaptive Minimal-Path Routing in 2-Dimensional Torus ServerNet SAN,” <i>Dependable Network Computing,</i> D. Avresky, ed., Kluwer Academic, 2000.]]
[13]
D. Avresky, et al., “Embedding and Reconfiguration of Spanning Trees in Faulty Hypercube,” <i>IEEE Trans. Parallel and Distributed Systems,</i> vol. 10, no. 3, pp. 211-222, Mar. 1999.]]
[14]
D. Avresky and C. Cunningham, “Single Source Fault-Tolerant Broadcasting for Two-Dimensional Meshes without Virtual Channels,” <i>Microprocessors and Microsystems,</i> vol. 21, pp. 175-182, 1997.]]
[15]
D. Avresky C. Cunningham and H. Ravichandran, “Fault-Tolerant Routing for Wormhole-Routed Two-Dimensional Meshes,” <i>Int'l J. Computer Systems Science & Eng.,</i> vol. 14, no. 6, Nov. 1999.]]
[16]
C. Cunningham and D. Avresky, “Fault-Tolerant Adaptive Routing for Two-Dimensional Meshes,” <i>Proc. IEEE First Int'l Symp. High Performance Computer Architecture,</i> pp. 122-131, Jan. 1995.]]
[17]
W. Qiao and L.M. Ni, “Adaptive Routing in Irregular Networks Using Cut-Through Switches,” <i>Proc. 1996 Int'l Conf. Parallel Processingg,</i> Aug. 1996.]]
[18]
S. Konstantinidou and L. Synder, “The Chaos Router: A Practical Application of Randomization in Network Routing,” <i>Proc. Second Ann. Symp. Parallel Algorithms and Architectures (SPAA 1990),</i> pp.nbsp21-30, 1990.]]
[19]
X. Lin P.K. McKinley and L.M. Ni, “The Message Flow Model for Routing in Wormhole-Routed Networks,” <i>Proc. 1993 Int'l Conf. Parallel Processing,</i> Aug. 1993.]]
[20]
W.J. Daly and C.L. Seitz, “Deadlock-Free Message Routing in Multi-Processor Interconnection Networks,” <i>IEEE Trans. Computers,</i> vol. 36, no. 5, pp. 547-553, May 1987.]]
[21]
D.H. Linder and J.C. Harden, “An Adaptive Deadlock and Fault Tolerant Wormhole Routing Strategy for K-Ary N-Cubes,” <i>IEEE Trans. Computers,</i> vol. 40, no. 1, pp. 2-12, Jan 1991.]]
[22]
F. Silla and J. Duato, “On the Use of Virtual Channels in Networks of Workstations with Irregular Topology,” <i>IEEE Trans. Parallel and Distributed Systems,</i> vol. 11, no. 8, pp. 813-828, Aug. 2000.]]
[23]
F. Silla and J. Duato, “High-Performance Routing in Networks of Workstations with Irregular Topology,” <i>IEEE Trans. Parallel and Distributed Systems,</i> vol. 11, no. 7, pp. 699-719, July 2000.]]
[24]
R. Casado A. Bermudez J. Duato F.J. Quiles and J.L. Sanchez, “A Protocol for Deadlock-Free Dynamic Reconfiguration in High-Speed Local Area Networks,” <i>IEEE Trans. Parallel and Distributed Systems,</i> special issue on dependable network computing, D.nbspAvresky, J. Bruck, and D. Culler, eds., vol. 12, no. 2, pp. 115-132, Feb. 2001.]]
[25]
T.M. Pinkston R. Pang and J. Duato, “Deadlock-Free Dynamic Reconfiguration Schemes for Increased Network Dependability,” <i>IEEE Trans. Parallel and Distributed Systems,</i> vol. 14, no. 8, pp. 780-794, Aug. 2003.]]
[26]
D. Avresky N. Natchev and V. Shurbanov, “Dynamic Reconfiguration in High-Speed Computer Networks,” <i>Proc. IEEE Symp. Cluster Computing,</i> Oct. 2001.]]
[27]
D. Dolev R. Friedman I. Keidar and D. Malkhi, “Failure detectors in Omission Failure Environments,” <i>Proc. 16th Symp. Principles of Distributed Computing (PODC),</i> 1997.]]
[28]
T. Chandra and S. Toueg, “Unreliable Failure Detectors for Reliable Distributed Systems,” <i>J. ACM,</i> vol. 43, no. 1, pp. 225-267, Mar. 1996.]]
[29]
N. Oh S. Mitra and E. McCluskey, “ED<sup>4</sup> I: Error Detection by Diverse Data and Duplicated Instructions,” <i>IEEE Trans. Computers,</i> special issue on fault-tolerant embedded systems, D. Avresky, B.W. Johnson, and F. Lombardi, eds., vol. 51, no. 2, pp. 180-199, Feb. 2002.]]
[30]
N. Lynch, <i>Distributed Algorithms.</i> Morgan Kaufmann, 1996.]]
[31]
H. Samet, <i>Design and Analysis of Spatial Data Structures,</i> pp. 2-40. Addison-Wesley, 1990.]]

Cited By

View all
  • (2017)Transparent lifetime built-in self-testing of networks-on-chip through the selective non-concurrent testing of their communication channelsProceedings of the 2nd International Workshop on Advanced Interconnect Solutions and Technologies for Emerging Computing Systems10.1145/3073763.3073765(12-17)Online publication date: 25-Jan-2017
  • (2015)Synergistic use of multiple on-chip networks for ultra-low latency and scalable distributed routing reconfigurationProceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition10.5555/2755753.2755936(806-811)Online publication date: 9-Mar-2015
  • (2012)Enhancing Routing Robustness of Unstructured Peer-to-Peer Networks Using Mobile AgentsJournal of Network and Systems Management10.1007/s10922-011-9203-320:3(309-352)Online publication date: 1-Sep-2012
  • Show More Cited By

Recommendations

Reviews

Angele M. Hamel

Avresky and Natchev address the problem of network reconfiguration after multiple simultaneous node or link failures. The paper builds on the authors' previous work, which explored network reconfiguration after a single node failure. The algorithm proposed, NetRec, employs a distributed philosophy?each node is presumed to only have knowledge of its immediate neighbors and those nodes two hops away. It is a dynamic algorithm, and seeks to avoid prolonged stopping of user traffic and costly routing table updates, maintenance, and monitoring. It operates by building restoration trees that route around the failure. It then ensures that in the case of multiple failures, these trees do not intersect. The authors also provide simulation results, discuss the complexity of their algorithm, and prove its correctness and its avoidance of infinite loops and cyclic dependencies. The paper is a natural progression from previous work, and is a solid and relevant contribution. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Computers
IEEE Transactions on Computers  Volume 54, Issue 5
May 2005
144 pages

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 May 2005

Author Tags

  1. Index Terms- Dynamic reconfiguration
  2. clusters of workstations
  3. fault tolerance
  4. irregular topologies.
  5. multiple node and link failures

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2017)Transparent lifetime built-in self-testing of networks-on-chip through the selective non-concurrent testing of their communication channelsProceedings of the 2nd International Workshop on Advanced Interconnect Solutions and Technologies for Emerging Computing Systems10.1145/3073763.3073765(12-17)Online publication date: 25-Jan-2017
  • (2015)Synergistic use of multiple on-chip networks for ultra-low latency and scalable distributed routing reconfigurationProceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition10.5555/2755753.2755936(806-811)Online publication date: 9-Mar-2015
  • (2012)Enhancing Routing Robustness of Unstructured Peer-to-Peer Networks Using Mobile AgentsJournal of Network and Systems Management10.1007/s10922-011-9203-320:3(309-352)Online publication date: 1-Sep-2012
  • (2011)An abacus turn model for time/space-efficient reconfigurable routingACM SIGARCH Computer Architecture News10.1145/2024723.200009639:3(259-270)Online publication date: 4-Jun-2011
  • (2011)An abacus turn model for time/space-efficient reconfigurable routingProceedings of the 38th annual international symposium on Computer architecture10.1145/2000064.2000096(259-270)Online publication date: 4-Jun-2011
  • (2009)RecTORProceedings of the 15th International Euro-Par Conference on Parallel Processing10.1007/978-3-642-03869-3_97(1052-1064)Online publication date: 23-Aug-2009
  • (2007)A distributed approach to handle topological changes in advanced switchingProceedings of the 2nd ACM workshop on Performance monitoring and measurement of heterogeneous wireless and wired networks10.1145/1298275.1298283(37-44)Online publication date: 22-Oct-2007

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media