Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A Hierarchical Adaptive Distributed System-Level Diagnosis Algorithm

Published: 01 January 1998 Publication History

Abstract

Consider a system composed of N nodes that can be faulty or fault-free. The purpose of distributed system-level diagnosis is to have each fault-free node determine the state of all nodes of the system. This paper presents a Hierarchical Adaptive Distributed System-level Diagnosis (Hi-ADSD) algorithm, which is a fully distributed algorithm that allows every fault-free node to achieve diagnosis in, at most, (log 2 N)2 testing rounds. Nodes are mapped into progressively larger logical clusters, so that tests are run in a hierarchical fashion. Each node executes its tests independently of the other nodes, i.e., tests are run asynchronously. All the information that nodes exchange is diagnostic information. The algorithm assumes no link faults, a fully-connected network and imposes no bounds on the number of faults. Both the worst-case diagnosis latency and correctness of the algorithm are formally proved. As an example application, the algorithm was implemented on a 37-node Ethernet LAN, integrated to a network management system based on SNMP (Simple Network Management Protocol). Experimental results of fault and repair diagnosis are presented. This implementation by itself is also a significant contribution, for, although fault management is a key functional area of network management systems, currently deployed applications often implement only rudimentary diagnosis mechanisms. Furthermore, experimental results are given through simulation of the algorithm for large systems of 64 nodes and 512 nodes

References

[1]
M. Rose and K. McCloghrie, "Structure and Identification of Management Information for TCP/IP-Based Internets," RFC 1155, 1990.]]
[2]
J.D. Case M.S. Fedor M.L. Schoffstall and J.R. Davin, "A Simple Network Management Protocol," RFC 1157, 1990.]]
[3]
K. McCloghtie and M.T. Rose, "Management Information Base for Network Management of TCP/IP-Based Internets," RFC 1213, 1991.]]
[4]
L. Steinberg, "Techniques for Managing Asynchronously Generated Alerts," RFC 1224, 1991.]]
[5]
R.P. Bianchini and R. Buskens, "An Adaptive Distributed System-Level Diagnosis Algorithm and Its Implementation," Proc. FTCS-21, pp. 222-229, 1991.]]
[6]
R.P. Bianchini and R. Buskens, "Implementation of On-Line Distributed System-Level Diagnosis Theory," IEEE Trans. Computers, vol. 41, pp. 616-626, 1992.]]
[7]
P. Jalote, Fault Tolerance in Distributed Systems. Englewood Cliffs, N.J.: Prentice Hall, 1994.]]
[8]
F. Preparata G. Metze and R.T. Chien, "On The Connection Assignment Problem of Diagnosable Systems," IEEE Trans. Electronic Computers, vol. 16, pp. 848-854, 1968.]]
[9]
S.L. Hakimi and A.T. Amin, "Characterization of Connection Assignments of Diagnosable Systems," IEEE Trans. Computers, vol. 23, pp. 86-88, 1974.]]
[10]
S.L. Hakimi and K. Nakajima, "On Adaptive System Diagnosis" IEEE Trans. Computers, vol. 33, pp. 234-240, 1984.]]
[11]
J.G. Kuhl and S.M. Reddy, "Distributed Fault-Tolerance for Large Multiprocessor Systems," Proc. Seventh Ann. Symp. Computer Architecture, pp. 23-30, 1980.]]
[12]
J.G. Kuhl and S.M. Reddy, "Fault-Diagnosis in Fully Distributed Systems," Proc. FTCS-11, pp. 100-105, 1981.]]
[13]
S.H. Hosseini J.G. Kuhl and S.M. Reddy, "A Diagnosis Algorithm for Distributed Computing Systems with Failure and Repair," IEEE Trans. Computers, vol. 33, pp. 223-233, 1984.]]
[14]
R.P. Bianchini K. Goodwin and D.S. Nydick, "Practical Application and Implementation of System-Level Diagnosis Theory," Proc. FTCS-20, pp. 332-339, 1990.]]
[15]
C.-L. Yang and G.M. Masson, "Hybrid Fault-Diagnosability with Unreliable Communication Links," Proc. FTCS-16, pp. 226-231, 1986.]]
[16]
S. Rangarajan A.T. Dahbura and E.A. Ziegler, "A Distributed System-Level Diagnosis Algorithm for Arbitrary Network Topologies," IEEE Trans. Computers, vol. 44, pp. 312-333, 1995.]]
[17]
M. Stahl R. Buskens and R. Bianchini, "Simulation of the Adapt On-Line Diagnosis Algorithm for General Topology Networks," Proc. IEEE 11th Symp. Reliable Distributed Systems, Oct. 1992.]]
[18]
A. Bagchi and S.L. Hakimi, "An Optimal Algorithm for Distributed System-Level Diagnosis," Proc. FTCS-21, June, 1991.]]
[19]
G. Masson D. Blough and G. Sullivan, "System Diagnosis," Fault-Tolerant Computer System Design, D.K. Pradhan, ed. Prentice Hall, 1996.]]
[20]
E.P. Duarte Jr. and T. Nanya, "Multi-Cluster Adaptive Distributed System-Level Diagnosis Algorithms," IEICE Technical Report FTS 95-73, 1995.]]
[21]
M.H. MacDougall, Simulating Computer Systems: Techniques and Tools. Cambridge, Mass.: The MIT Press, 1987.]]
[22]
M. Malek and J. Maeng, "Partitioning of Large Multicomputer Systems for Efficient Fault Diagnosis," Proc. FTCS-12, pp. 341-348, 1982.]]
[23]
A. Bagchi, "A Distributed Algorithm for System-Level Diagnosis in Hypercubes," Proc. 1992 IEEE Workshop Fault-Tolerant Parallel and Distributed Systems, pp. 106-113, 1992]]
[24]
M. Barborak and M. Malek, "Partitioning for Efficient Consensus," Proc. 26th Hawaii Int'l Conf. System Sciences, vol. II, pp. 438-446, 1993.]]
[25]
J. Altman F. Balbach and A. Hein, "An Approach for Hierarchical System-Level Diagnosis of Massively Parallel Computers Combined with a Simulation-Based Method for Dependability Analysis," Proc. First European Dependable Computing Conf., Lecture Notes in Computer Science, vol. 852, pp. 371-385, 1994.]]
[26]
M.T. Rose, The Simple Book—An Introduction to Internet Management, second ed. Englewood Cliffs, N.J.: Prentice Hall, 1994.]]
[27]
J. Swoboda, et al. http://www.ldv.e-technik.tu-muenchen.de/dist/WILMA/.]]
[28]
W. Stallings, SNMP, SNMPv2, and RMON: Practical Network Management, second ed. Reading, Mass.: Addison Wesley, 1996.]]

Cited By

View all
  • (2024)vCubeChainAd Hoc Networks10.1016/j.adhoc.2024.103461158:COnline publication date: 1-May-2024
  • (2023)Diamond-P-vCube: An Eventually Perfect Hierarchical Failure Detector for Asynchronous Distributed SystemsProceedings of the 12th Latin-American Symposium on Dependable and Secure Computing10.1145/3615366.3615420(40-49)Online publication date: 16-Oct-2023
  • (2023)The missing piece: a distributed system-level diagnosis model for the implementation of unreliable failure detectorsComputing10.1007/s00607-023-01211-8105:12(2821-2845)Online publication date: 18-Aug-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Computers
IEEE Transactions on Computers  Volume 47, Issue 1
January 1998
143 pages
ISSN:0018-9340
Issue’s Table of Contents

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 January 1998

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)vCubeChainAd Hoc Networks10.1016/j.adhoc.2024.103461158:COnline publication date: 1-May-2024
  • (2023)Diamond-P-vCube: An Eventually Perfect Hierarchical Failure Detector for Asynchronous Distributed SystemsProceedings of the 12th Latin-American Symposium on Dependable and Secure Computing10.1145/3615366.3615420(40-49)Online publication date: 16-Oct-2023
  • (2023)The missing piece: a distributed system-level diagnosis model for the implementation of unreliable failure detectorsComputing10.1007/s00607-023-01211-8105:12(2821-2845)Online publication date: 18-Aug-2023
  • (2016)Network Monitoring with Imperfect TestsProceedings of the 2016 workshop on Fostering Latin-American Research in Data Communication Networks10.1145/2940116.2940124(49-51)Online publication date: 22-Aug-2016
  • (2015)Distributed self fault diagnosis algorithm for large scale wireless sensor networks using modified three sigma edit testAd Hoc Networks10.1016/j.adhoc.2014.10.00625:PA(170-184)Online publication date: 1-Feb-2015
  • (2014)VCubeProceedings of the 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems10.1109/ScalA.2014.14(17-22)Online publication date: 16-Nov-2014
  • (2013)MoDiVHAJournal of Electronic Testing: Theory and Applications10.1007/s10836-013-5400-129:6(839-847)Online publication date: 1-Dec-2013
  • (2011)Fault diagnosis for hypercube-like networksProceedings of the 2nd international conference on Applied informatics and computing theory10.5555/2047895.2047932(205-209)Online publication date: 26-Sep-2011
  • (2011)A survey of comparison-based system-level diagnosisACM Computing Surveys10.1145/1922649.192265943:3(1-56)Online publication date: 29-Apr-2011
  • (2010)Distributed testing and diagnosis in a mobile computing environmentProceedings of the 6th International Wireless Communications and Mobile Computing Conference10.1145/1815396.1815686(1268-1272)Online publication date: 28-Jun-2010
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media