research-article

Open access

StreamBed: Capacity Planning for Stream Processing

Authors:

Guillaume Rosinosky,

Donatien Schmitz,

Etienne RivièreAuthors Info & Claims

DEBS '24: Proceedings of the 18th ACM International Conference on Distributed and Event-based Systems

Pages 90 - 102

https://doi.org/10.1145/3629104.3666034

Published: 22 July 2024 Publication History

Abstract

We present StreamBed, a capacity planning system for stream processing. StreamBed predicts, ahead of any production deployment, the resources that a query will require to process an incoming data rate sustainably, and the appropriate configuration of these resources. For this purpose, StreamBed builds a capacity planning model by piloting a series of runs of the target query in a small-scale, controlled testbed. We implement StreamBed for Apache Flink. Our evaluation with large-scale queries of the Nexmark benchmark demonstrates that StreamBed can accurately predict capacity requirements for jobs spanning more than 1,000 cores using a model built with a 48-core testbed.

References

[1]

Pratyush Agnihotri, Boris Koldehofe, Carsten Binnig, and Manisha Luthra. 2023. Zero-Shot Cost Models for Parallel Stream Processing. In 6th Intl. Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM).

[2]

Pratyush Agnihotri, Boris Koldehofe, Paul Stiegele, Roman Heinrich, Carsten Binnig, and Manisha Luthra. 2024. ZeroTune: Learned Zero-Shot Cost Models for Parallelism Tuning in Stream Processing. In ICDE 2024.

[3]

Candelieri Antonio. 2021. Sequential model based optimization of partially defined functions under unknown constraints. Journal of Global Optimization 79, 2 (2021), 281--303.

Digital Library

[4]

Apache Foundation. 2015. Zeppelin. https://zeppelin.apache.org

[5]

Esmail Asyabi, Yuanli Wang, John Liagouris, Vasiliki Kalavri, and Azer Bestavros. 2022. A new benchmark harness for systematic and robust evaluation of streaming state stores. In EuroSys 2022.

Digital Library

[6]

Raphaël Barazzutti, Thomas Heinze, André Martin, Emanuel Onica, Pascal Felber, Christof Fetzer, Zbigniew Jerzak, Marcelo Pasin, and Etienne Riviere. 2014. Elastic scaling of a high-throughput content-based publish/subscribe engine. In ICDCS 2014.

Digital Library

[7]

Raphaël Bolze, Franck Cappello, Eddy Caron, Michel Daydé, Frédéric Desprez, Emmanuel Jeannot, Yvon Jégou, Stephane Lanteri, Julien Leduc, Noredine Melab, et al. 2006. Grid'5000: A large scale and highly reconfigurable experimental grid testbed. The International Journal of High Performance Computing Applications 20, 4 (2006), 481--494.

Digital Library

[8]

Paris Carbone, Stephan Ewen, Gyula Fóra, Seif Haridi, Stefan Richter, and Kostas Tzoumas. 2017. State management in Apache Flink®: consistent stateful distributed stream processing. Proceedings of the VLDB Endowment 10, 12 (2017), 1718--1729.

Digital Library

[9]

Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink: Stream and batch processing in a single engine. The Bulletin of the Technical Committee on Data Engineering 38, 4 (2015).

[10]

Valeria Cardellini, Francesco Lo Presti, Matteo Nardelli, and Gabriele Russo Russo. 2022. Runtime Adaptation of Data Stream Processing Systems: The State of the Art. ACM Comp. Sur. 54 (2022).

[11]

CNCF. 2014. Prometheus. https://prometheus.io/

[12]

COIN-OR Foundation. 2005. CBC. https://github.com/coin-or/Cbc

[13]

COIN-OR Foundation. 2005. PuLP. https://coin-or.github.io/pulp/

[14]

Siying Dong, Andrew Kryczka, Yanqin Jin, and Michael Stumm. 2021. RocksDB: Evolution of development priorities in a key-value store serving large-scale applications. ACM Trans. on Storage 17, 4 (2021).

Digital Library

[15]

Omar Farhat, Harsh Bindra, and Khuzaima Daudjee. 2020. Leaving stragglers at the window: Low-latency stream sampling with accuracy guarantees. In DEBS 2020.

Digital Library

[16]

Flink. 2022. Kafka Connector. https://github.com/apache/flink-connector-kafka

[17]

Apache Flink. 2023. Release 1.14.1. https://mvnrepository.com/artifact/org.apache.flink/flink-java/1.14.1.

[18]

Gyula Fóra and Mattias Andersson. 2023. Flink Kubernetes Autoscaler. https://github.com/apache/flink-kubernetes-operator/tree/main/flink-kubernetes-operator-autoscaler.

[19]

Vincenzo Gulisano, Yiannis Nikolakopoulos, Marina Papatriantafilou, and Philippas Tsigas. 2016. Scalejoin: A deterministic, disjoint-parallel and skew-resilient stream join. IEEE Trans. on Big Data 7, 2 (2016).

[20]

Tim Head et al. 2021. Scikit-optimize. https://zenodo.org/record/5565057

[21]

Roman Heinrich, Manisha Luthra, Harald Kornmayer, and Carsten Binnig. 2022. Zero-shot cost models for distributed stream processing. In DEBS 2022. 85--90.

Digital Library

[22]

Sören Henning and Wilhelm Hasselbring. 2022. A configurable method for benchmarking scalability of cloud-native applications. Empirical Software Engineering 27 (2022).

[23]

Antony S Higginson, Mihaela Dediu, Octavian Arsene, Norman W Paton, and Suzanne M Embury. 2020. Database workload capacity planning using time series analysis and ML. In SIGMOD 2020.

[24]

Vasiliki Kalavri, John Liagouris, Moritz Hoffmann, Desislava Dimitrova, Matthew Forshaw, and Timothy Roscoe. 2018. Three steps is all you need: fast, accurate, automatic scaling decisions for distributed streaming dataflows. In OSDI 2018.

[25]

Faria Kalim et al. 2019. Caladrius: A performance modelling service for distributed stream processing systems. In ICDE 2019.

[26]

Jeyhun Karimov, Tilmann Rabl, Asterios Katsifodimos, Roman Samarev, Henri Heiskanen, and Volker Markl. 2018. Benchmarking distributed stream data processing systems. In ICDE 2018.

[27]

Alireza Khoshkbarforoushha, Rajiv Ranjan, Raj Gaire, Ehsan Abbasnejad, Lizhe Wang, and Albert Y Zomaya. 2016. Distribution based workload modelling of continuous queries in clouds. IEEE Transactions on Emerging Topics in Computing 5, 1 (2016), 120--133.

[28]

Teng Li, Jian Tang, and Jielong Xu. 2016. Performance modeling and predictive scheduling for distributed stream data processing. IEEE Transactions on Big Data 2, 4 (2016), 353--364.

[29]

Ruirui Lu, Gang Wu, Bin Xie, and Jingtong Hu. 2014. StreamBench: Towards benchmarking modern distributed stream computing frameworks. In UCC 2014.

[30]

Maximilian Michels. 2023. FLIP-271: Autoscaling. https://cwiki.apache.org/confluence/display/FLINK/FLIP-271%3A+Autoscaling.

[31]

Andrew Y Ng. 1997. Preventing "overfitting" of cross-validation data. In ICML 1997.

[32]

Fabian Paul. 2020. Towards a Flink Autopilot in Ververica Platform. https://www.youtube.com/watch?v=5HvGo5vRxRA.

[33]

F. Pedregosa et al. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011).

[34]

Benjamin JJ Pfister, Wolf S Lickefett, Jan Nitschke, Sumit Paul, Morgan K Geldenhuys, Dominik Scheinert, Kordian Gontarska, and Lauritz Thamsen. 2022. Rafiki: Task-Level Capacity Planning in Distributed Stream Processing Systems. In Euro-Par 2021 workshops.

[35]

Tilmann Rabl, Jonas Traub, Asterios Katsifodimos, and Volker Markl. 2016. Apache Flink in current research. it-Inf. Tech. 58, 4 (2016).

[36]

Henriette Röger and Ruben Mayer. 2019. A comprehensive survey on parallelization and elasticity in stream processing. ACM Computing Surveys (CSUR) 52, 2 (2019), 1--37.

Digital Library

[37]

Gabriele Russo Russo, Valeria Cardellini, Giuliano Casale, and Francesco Lo Presti. 2021. MEAD: Model-based vertical auto-scaling for data stream processing. In CCGrid 2021.

[38]

Subham Sahoo, Christoph Lampert, and Georg Martius. 2018. Learning equations for extrapolation and control. In ICML 2018.

[39]

Anupam Sanghi, Shadab Ahmed, and Jayant R Haritsa. 2022. Projection-compliant database generation. Proceedings of the VLDB Endowment 15, 5 (2022), 998--1010.

Digital Library

[40]

Anand Sanmukhani. 2019. Python Prometheus API client. https://github.com/4n4nd/prometheus-api-client-python

[41]

Rayman Preet Singh, Bharath Kumarasubramanian, Prateek Maheshwari, and Samarth Shetty. 2020. Auto-sizing for stream processing applications at LinkedIn. In HotCloud 2020.

[42]

Spotify. 2023. Kubernetes Operator for Apache Flink. https://github.com/spotify/flink-on-k8s-operator.

[43]

Strimzi. 2023. Strimzi provides a way to run an Apache Kafka cluster on Kubernetes in various deployment configurations. https://strimzi.io.

[44]

Tri Minh Truong, Aaron Harwood, Richard O Sinnott, and Shiping Chen. 2018. Performance analysis of large-scale distributed stream processing systems on the cloud. In IEEE Cloud 2018.

[45]

Pete Tucker, Kristin Tufte, Vassilis Papadimos, and David Maier. 2010. NEXMark-a benchmark for queries over data streams. Technical Report. OGI School of Science & Engineering at OHSU.

[46]

Juliane Verwiebe, Philipp M Grulich, Jonas Traub, and Volker Markl. 2023. Survey of window types for aggregation in stream processing systems. The VLDB Journal (2023), 1--27.

[47]

Wei Zhang, Ningxin Zheng, Quan Chen, Yong Yang, Zhuo Song, Tao Ma, Jingwen Leng, and Minyi Guo. 2020. URSA: Precise capacity planning and fair scheduling based on low-level statistics for public clouds. In ICPP 2020.

Digital Library

[48]

Beiji Zou, Tao Zhang, Chengzhang Zhu, Ling Xiao, Meng Zeng, and Zhi Chen. 2022. Alps: An Adaptive Load Partitioning Scaling Solution for Stream Processing System on Skewed Stream. In DEXA 2022.

Index Terms

StreamBed: Capacity Planning for Stream Processing

Recommendations

Capacity-driven production planning
Highlights
- Combination of MRP, capacity planning and capacity adjustment.
- (Semi-)automatic ...
Abstract
Traditional material requirements planning systems (MRP) schedule production orders for on-demand items without considering limited capacities of production resources. This article introduces a production planning approach which also ...
The capacity planning problem in make-to-order enterprises

This paper addresses the short-term capacity planning problem in a make-to-order (MTO) operation environment. A mathematical model is presented to aid an operations manager in an MTO environment to select a set of potential customer orders to maximize ...
On-the-fly capacity planning
OOPSLA '13: Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications

When resolving performance problems, a simple histogram of hot call stacks does not cut it, especially given the highly fluid nature of modern deployments. Why bother tuning, when adding a few CPUs via the management console will quickly resolve the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

DEBS '24: Proceedings of the 18th ACM International Conference on Distributed and Event-based Systems

June 2024

239 pages

ISBN:9798400704437

DOI:10.1145/3629104

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 July 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

SWP Wallonia

Conference

DEBS '24

Sponsor:

DEBS '24: The 18th ACM International Conference on Distributed and Event-based Systems

June 24 - 28, 2024

Villeurbanne, France

Acceptance Rates

DEBS '24 Paper Acceptance Rate 15 of 30 submissions, 50%;

Overall Acceptance Rate 145 of 583 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
96
Total Downloads

Downloads (Last 12 months)96
Downloads (Last 6 weeks)38

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents