Abstract
Distributed simulation system is widely used for forecasting, decision-making and scientific computing. Multi-agent and Grid have been used as platform for simulation. In order to survive from software or hardware failures and guarantee successful rate during agent migrating, system must solve the fault tolerance problem. Classic fault tolerance technology like checkpoint and redundancy can be used for distributed simulation system, but is not efficient. We present a novel fault tolerance protocol which combines the causal message logging method and prime-backup technology. The proposed protocol uses iterative backup location scheme and adaptive update interval to reduce overhead and balance the cost of fault tolerance and recovery time. The protocol has characteristics of no orphan state, and do not need the survival agents to rollback. Most important is that the recovery scheme can tolerant concurrently failures, even the permanent failure of single node. Correctness of the protocol is proved and experiments show the protocol is efficient.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Damani: Fault -tolerant distributed simulation. Presented at proceedings of the 12th workshop on parallel and distributed simulation(PADS 1998) (1998)
Johnnes Luthi, S.G.: F-RSS: A Flexible Framework for Fault Tolerant HLA Federations. In: Presented at ICCS 2004 (2004)
Elnozahy, E.N., Johnson, D.B., Wang, Y.M.: A survey of rollback-recovery protocols in message-passing systems. Technical Report CMU-CS-96-181, Carnegie Mellon University (October 1996)
Johansen, D., Marzullo, K., Schneider, F.B., Jacobsen, K., Zagorodnov, D.: NAP: Practical Fault-Tolerance for Itinerant Computations. Technical Report TR98-1716. Department of Computer Science, Cornell University, USA (November 1998)
Leslie, L.: Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 558–565 (1978)
Agrawal, D.: Replicated objects in time warp simulations. In: Presented at Proc. 1992 Winter Simulation Conference, SCS (1992)
Rob, S., Shaula, Y.: Optimistic recovery in distributed systems. ACM Trans. Comput. Syst. 3, 204–226 (1985)
Lyu, M.R., Chen, X., Wong, T.Y.: Design and Evaluation of a Fault-Tolerant Mobile-Agent System. IEEE Intelligent Systems 19(5), 32–38 (2004)
Alan, F., Ralph, D.: Using Dynamic Proxy Agent Replicate Groups to Improve Fault-Tolerance in Multi-Agent Systems. In: AAMAS 2003, July 14-18 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ma, M., Jin, S., Ye, C., Liu, X. (2006). Dynamic Fault Tolerance in Distributed Simulation System. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds) Computational Science – ICCS 2006. ICCS 2006. Lecture Notes in Computer Science, vol 3991. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11758501_102
Download citation
DOI: https://doi.org/10.1007/11758501_102
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-34379-0
Online ISBN: 978-3-540-34380-6
eBook Packages: Computer ScienceComputer Science (R0)