Failure Transparency in Stateful Dataflow Systems (Technical Report)

Veresov, Aleksey; Spenger, Jonas; Carbone, Paris; Haller, Philipp

Computer Science > Programming Languages

arXiv:2407.06738 (cs)

[Submitted on 9 Jul 2024 (v1), last revised 18 Oct 2024 (this version, v2)]

Title:Failure Transparency in Stateful Dataflow Systems (Technical Report)

Authors:Aleksey Veresov (1), Jonas Spenger (1), Paris Carbone (1 and 2), Philipp Haller (1) ((1) KTH Royal Institute of Technology, (2) RISE Research Institutes of Sweden)

View PDF

Abstract:Failure transparency enables users to reason about distributed systems at a higher level of abstraction, where complex failure-handling logic is hidden. This is especially true for stateful dataflow systems, which are the backbone of many cloud applications. In particular, this paper focuses on proving failure transparency in Apache Flink, a popular stateful dataflow system. Even though failure transparency is a critical aspect of Apache Flink, to date it has not been formally proven. Showing that the failure transparency mechanism is correct, however, is challenging due to the complexity of the mechanism itself. Nevertheless, this complexity can be effectively hidden behind a failure transparent programming interface. To show that Apache Flink is failure transparent, we model it in small-step operational semantics. Next, we provide a novel definition of failure transparency based on observational explainability, a concept which relates executions according to their observations. Finally, we provide a formal proof of failure transparency for the implementation model; i.e., we prove that the failure-free model correctly abstracts from the failure-related details of the implementation model. We also show liveness of the implementation model under a fair execution assumption. These results are a first step towards a verified stack for stateful dataflow systems.

Comments:	26 pages, 12 figures, 44 pages including references and appendix with proofs, technical report, ECOOP 2024; updated with DOI
Subjects:	Programming Languages (cs.PL); Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2407.06738 [cs.PL]
	(or arXiv:2407.06738v2 [cs.PL] for this version)
	https://doi.org/10.48550/arXiv.2407.06738

Submission history

From: Jonas Spenger [view email]
[v1] Tue, 9 Jul 2024 10:32:26 UTC (279 KB)
[v2] Fri, 18 Oct 2024 09:32:11 UTC (279 KB)

Computer Science > Programming Languages

Title:Failure Transparency in Stateful Dataflow Systems (Technical Report)

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Programming Languages

Title:Failure Transparency in Stateful Dataflow Systems (Technical Report)

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators