You may not need synchronization (in streaming systems)

Lattuada, Andrea

doi:10.3929/ethz-b-000606757

Download

Full text (PDF, 3.029Mb)

Open access

Author

Lattuada, Andrea

Date

2022

Type

Doctoral Thesis

ETH Bibliography

yes

Altmetrics

Download

Full text (PDF, 3.029Mb)

Rights / license

In Copyright - Non-Commercial Use Permitted

Abstract

Modern data analytics and processing systems are increasingly relying on rack-scale or cluster-scale systems to deal with massive input rates and memory requirements that cannot be handled by a single compute node. The infrastructure to run these systems has an high cost: gains in efficiency result in big savings, and enable more sophisticated analyses. At the same time the programming abstractions used for data-parallel data processing that provide semi-automating scaling come with a significant performance tradeoff, and the cost of abstraction can often outweigh the performance gains due to horizontal scaling. Dataflow programming models with epoch-based, fine-grained coordination were developed to have significantly less intrinsic overhead: systems based on these models enable the efficient implementation of many large-scale low-latency data analytics and processing tasks. However, system mechanisms such as dynamic re-scaling, on-line data re-partitioning, fault-tolerance, and index sharing need to be adapted to cope with the more complex execution model, and need to introduce minimal overhead, to avoid squandering these systems' increased efficiency. These mechanisms are often necessary to deploy these systems in the real world. This thesis describes how to adapt the distributed dataflow programming model to enable the implementation of low-overhead, predictable index sharing, re-scaling, re-partitioning, fault tolerance and resource management systems as optional libraries written against the core dataflow system that only needs to provide dataflow primitives. It then demonstrates how to build these mechanisms with acceptable throughput overhead and predictable, bounded latency cost, so they are suitable for interactive applications. We present a new programming abstraction for data-parallel dataflow systems: a coordination primitive that the dataflow program can use to precisely signal fine-grained coordination information. Building on this abstraction, we design and implement a data index sharing mechanism inspired by DBMS but adapted to low-coordination dataflow systems and a fault-tolerance and dynamic re-scaling protocol with predictable performance. To help ensure correctness of these mechanisms and of the data processing tasks, we also formalize and verify the core coordination protocol of a state-of-the-art stream processor that supports our new coordination primitive. Show more

Permanent link

https://doi.org/10.3929/ethz-b-000606757

Publication status

published

External links

Search print copy at ETH Library

Contributors

Examiner: Roscoe, Timothy
Examiner: Basin, David