Self-healing Dilemmas in Distributed Systems: Fault Correction vs. Fault Tolerance

Nikolic, Jovan; Jubatyrov, Nursultan; Pournaras, Evangelos

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2007.05261 (cs)

[Submitted on 10 Jul 2020 (v1), last revised 24 Jun 2021 (this version, v3)]

Title:Self-healing Dilemmas in Distributed Systems: Fault Correction vs. Fault Tolerance

Authors:Jovan Nikolic, Nursultan Jubatyrov, Evangelos Pournaras

View PDF

Abstract:Large-scale decentralized systems of autonomous agents interacting via asynchronous communication often experience the following self-healing dilemma: fault detection inherits network uncertainties making a remote faulty process indistinguishable from a slow process. In the case of a slow process without fault, fault correction is undesirable as it can trigger new faults that could be prevented with fault tolerance that is a more proactive system maintenance. But in the case of an actual faulty process, fault tolerance alone without eventually correcting persistent faults can make systems underperforming. Measuring, understanding and resolving such self-healing dilemmas is a timely challenge and critical requirement given the rise of distributed ledgers, edge computing, the Internet of Things in several energy, transport and health applications. This paper contributes a novel and general-purpose modeling of fault scenarios during system runtime. They are used to accurately measure and predict inconsistencies generated by the undesirable outcomes of fault correction and fault tolerance as the means to improve self-healing of large-scale decentralized systems at the design phase. A rigorous experimental methodology is designed that evaluates 696 experimental settings of different fault scales, fault profiles and fault detection thresholds in a prototyped decentralized network of 3000 nodes. Almost 9 million measurements of inconsistencies were collected in a network, where each node monitors the health status of another node, while both can defect. The prediction performance of the modeled fault scenarios is validated in a challenging application scenario of decentralized and dynamic in-network data aggregation using real-world data from a Smart Grid pilot project. Findings confirm the origin of inconsistencies at design phase.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI); Performance (cs.PF)
Cite as:	arXiv:2007.05261 [cs.DC]
	(or arXiv:2007.05261v3 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2007.05261

Submission history

From: Evangelos Pournaras [view email]
[v1] Fri, 10 Jul 2020 09:10:00 UTC (623 KB)
[v2] Mon, 12 Apr 2021 17:50:13 UTC (7,380 KB)
[v3] Thu, 24 Jun 2021 16:34:40 UTC (7,379 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Self-healing Dilemmas in Distributed Systems: Fault Correction vs. Fault Tolerance

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Self-healing Dilemmas in Distributed Systems: Fault Correction vs. Fault Tolerance

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators