Dynamic Fault Tolerance in Distributed Simulation System

Ma, Min; Jin, Shiyao; Ye, Chaoqun; Liu, Xiaojian

doi:10.1007/11758501_102

Min Ma²⁰,
Shiyao Jin²⁰,
Chaoqun Ye²⁰ &
…
Xiaojian Liu²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3991))

Included in the following conference series:

International Conference on Computational Science

1394 Accesses

Abstract

Distributed simulation system is widely used for forecasting, decision-making and scientific computing. Multi-agent and Grid have been used as platform for simulation. In order to survive from software or hardware failures and guarantee successful rate during agent migrating, system must solve the fault tolerance problem. Classic fault tolerance technology like checkpoint and redundancy can be used for distributed simulation system, but is not efficient. We present a novel fault tolerance protocol which combines the causal message logging method and prime-backup technology. The proposed protocol uses iterative backup location scheme and adaptive update interval to reduce overhead and balance the cost of fault tolerance and recovery time. The protocol has characteristics of no orphan state, and do not need the survival agents to rollback. Most important is that the recovery scheme can tolerant concurrently failures, even the permanent failure of single node. Correctness of the protocol is proved and experiments show the protocol is efficient.

Download to read the full chapter text

Chapter PDF

Simulation Experiments of a Distributed Fault Containment Algorithm Using Randomized Scheduler

A Survey on Fault Management Techniques in Distributed Computing

Dynamic Checkpoint Data Replication Strategy in Computational Grid

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Damani: Fault -tolerant distributed simulation. Presented at proceedings of the 12th workshop on parallel and distributed simulation(PADS 1998) (1998)
Google Scholar
Johnnes Luthi, S.G.: F-RSS: A Flexible Framework for Fault Tolerant HLA Federations. In: Presented at ICCS 2004 (2004)
Google Scholar
Elnozahy, E.N., Johnson, D.B., Wang, Y.M.: A survey of rollback-recovery protocols in message-passing systems. Technical Report CMU-CS-96-181, Carnegie Mellon University (October 1996)
Google Scholar
Johansen, D., Marzullo, K., Schneider, F.B., Jacobsen, K., Zagorodnov, D.: NAP: Practical Fault-Tolerance for Itinerant Computations. Technical Report TR98-1716. Department of Computer Science, Cornell University, USA (November 1998)
Google Scholar
Leslie, L.: Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 558–565 (1978)
Article MATH Google Scholar
Agrawal, D.: Replicated objects in time warp simulations. In: Presented at Proc. 1992 Winter Simulation Conference, SCS (1992)
Google Scholar
Rob, S., Shaula, Y.: Optimistic recovery in distributed systems. ACM Trans. Comput. Syst. 3, 204–226 (1985)
Article Google Scholar
Lyu, M.R., Chen, X., Wong, T.Y.: Design and Evaluation of a Fault-Tolerant Mobile-Agent System. IEEE Intelligent Systems 19(5), 32–38 (2004)
Google Scholar
Alan, F., Ralph, D.: Using Dynamic Proxy Agent Replicate Groups to Improve Fault-Tolerance in Multi-Agent Systems. In: AAMAS 2003, July 14-18 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, National University of Defense Technology, Hunan, Changsha, 410073, China
Min Ma, Shiyao Jin, Chaoqun Ye & Xiaojian Liu

Authors

Min Ma
View author publications
You can also search for this author in PubMed Google Scholar
Shiyao Jin
View author publications
You can also search for this author in PubMed Google Scholar
Chaoqun Ye
View author publications
You can also search for this author in PubMed Google Scholar
Xiaojian Liu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Advanced Computing and Emerging Technologies Centre, The School of Systems Engineering, University of Reading, RG6 6AY, Reading, United Kingdom
Vassil N. Alexandrov
Department of Mathematics and Computer Science, University of Amsterdam, Kruislaan 403, 1098, Amsterdam, SJ, The Netherlands
Geert Dick van Albada
Faculty of Sciences, Section of Computational Science, University of Amsterdam, Kruislaan 403, 1098, Amsterdam, SJ, The Netherlands
Peter M. A. Sloot
Computer Science Department, University of Tennessee, 37996-3450, Knoxville, TN, USA
Jack Dongarra

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ma, M., Jin, S., Ye, C., Liu, X. (2006). Dynamic Fault Tolerance in Distributed Simulation System. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds) Computational Science – ICCS 2006. ICCS 2006. Lecture Notes in Computer Science, vol 3991. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11758501_102

Download citation

DOI: https://doi.org/10.1007/11758501_102
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-34379-0
Online ISBN: 978-3-540-34380-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Dynamic Fault Tolerance in Distributed Simulation System

Abstract

Chapter PDF

Similar content being viewed by others

Simulation Experiments of a Distributed Fault Containment Algorithm Using Randomized Scheduler

A Survey on Fault Management Techniques in Distributed Computing

Dynamic Checkpoint Data Replication Strategy in Computational Grid

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Dynamic Fault Tolerance in Distributed Simulation System

Abstract

Chapter PDF

Similar content being viewed by others

Simulation Experiments of a Distributed Fault Containment Algorithm Using Randomized Scheduler

A Survey on Fault Management Techniques in Distributed Computing

Dynamic Checkpoint Data Replication Strategy in Computational Grid

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation