Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3590140.3629113acmconferencesArticle/Chapter ViewAbstractPublication PagesmiddlewareConference Proceedingsconference-collections
research-article

Pravega: A Tiered Storage System for Data Streams

Published: 27 November 2023 Publication History

Abstract

The growing popularity of the data stream abstraction entails new challenging requirements when it comes to data ingestion and storage. Many organizations expect to retain data streams for extended periods of time and to store such stream data in a cost-effective manner. It is also crucial to reconcile apparently opposite properties, like data durability and consistency, along with high performance. Furthermore, data streams should not only deal with a high degree of parallelism, but also adapt to fluctuating workloads with little or no admin intervention. To our knowledge, no storage system for data streams fully copes with all these requirements.
In this paper, we present Pravega: a distributed, tiered storage system for data streams. Pravega streams are unbounded by design and cost-effective, as the system automatically moves data to a long-term storage tier (e.g., S3, NFS) and transparently manages it for the user. Pravega guarantees no duplicate or missing events, as well as per routing-key event ordering, while providing high performance streaming IO and historical reads. As a unique feature, Pravega streams are elastic: they can automatically change their degree of parallelism based on the ingestion workload. We compared the performance of Pravega with Apache Kafka and Apache Pulsar on AWS. Our results certify that Pravega can deliver performance improvements over them in many scenarios.

References

[1]
2016. Blink: How Alibaba Uses Apache Flink. https://www.ververica.com/blog/blink-flink-alibaba-search.
[2]
2019. Pravega Blog - Segment Attributes. https://cncf.pravega.io/blog/2019/11/21/segment-attributes/.
[3]
2020. Pravega - Performance Blog Post. https://cncf.pravega.io/blog/2020/10/01/when-speeding-makes-sense-fast-consistent-durable-and-scalable-streaming-data-with-pravega.
[4]
2021. Trino - Episode 28: Autoscaling streaming ingestion to Trino with Pravega. https://trino.io/episodes/28.html.
[5]
2022. Market Guide for Event Stream Processing. https://www.gartner.com/en/documents/4347499.
[6]
2023. Amazon Kinesis. https://aws.amazon.com/es/kinesis.
[7]
2023. Apache Bookkeeper. https://bookkeeper.apache.org.
[8]
2023. Apache Bookkeeper - Protocol. https://bookkeeper.apache.org/docs/development/protocol.
[9]
2023. Apache Cassandra. https://cassandra.apache.org.
[10]
2023. Apache Druid. https://druid.apache.org.
[11]
2023. Apache Flink. https://flink.apache.org.
[12]
2023. Apache Kafka. https://kafka.apache.org.
[13]
2023. Apache Kafka - Documentation. https://kafka.apache.org/documentation.
[14]
2023. Apache Pulsar. https://pulsar.apache.org.
[15]
2023. Apache Pulsar - Overview of tiered storage. https://pulsar.apache.org/docs/2.11.x/tiered-storage-overview.
[16]
2023. Apache Spark. https://spark.apache.org.
[17]
2023. Apache Zookeeper. https://zookeeper.apache.org.
[18]
2023. Dell Streaming Data Platform. https://www.dell.com/en-us/dt/storage/streaming-data-platform.htm.
[19]
2023. InfluxDB. https://www.influxdata.com.
[20]
2023. OpenMessaging Benchmark. https://github.com/openmessaging/benchmark.
[21]
2023. Pravega. https://cncf.pravega.io.
[22]
2023. Pravega - ByteStream API Javadoc. https://pravega.io/docs/latest/javadoc/clients/io/pravega/client/byteStream/package-summary.html.
[23]
2023. Pravega - Flink Connector. https://github.com/pravega/flink-connector.
[24]
2023. Pravega - KeyValueTable Javadoc. https://cncf.pravega.io/docs/latest/javadoc/clients/io/pravega/client/tables/KeyValueTable.html.
[25]
2023. Pravega - Simplified LTS. https://github.com/pravega/pravega/wiki/PDP-34-(Simplified-Tier-2).
[26]
2023. Pravega - Spark Connector. https://github.com/pravega/spark-connectors.
[27]
2023. Pravega - StateSynchronizer Javadoc. https://cncf.pravega.io/docs/latest/javadoc/clients/io/pravega/client/state/StateSynchronizer.html.
[28]
2023. RedPanda. https://redpanda.com.
[29]
Georgii Maksimovich Adeleson-Velskii and Evgenii Mikhailovich Landis. 1962. An algorithm for organization of information. In Doklady Akademii Nauk, Vol. 146. Russian Academy of Sciences, 263--266.
[30]
Oscar Boykin, Sam Ritchie, Ian O'Connell, and Jimmy Lin. 2014. Summingbird: A framework for integrating batch and online mapreduce computations. VLDB Endowment 7, 13 (2014), 1441--1451.
[31]
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache flink: Stream and batch processing in a single engine. The Bulletin of the Technical Committee on Data Engineering 38, 4 (2015).
[32]
Dell Technologies. 2023. Dell Streaming Data Platform: Architecture, Configuration, and Considerations. https://www.delltechnologies.com/asset/ensg/products/storage/industry-market/h18162-streaming-data-platform-architecture.pdf.
[33]
Philippe Dobbelaere and Kyumars Sheykh Esmaili. 2017. Kafka versus RabbitMQ: A comparative study of two industry reference publish/subscribe implementations: Industry Paper. In ACM DEBS'17. 227--238.
[34]
Guo Fu, Yanfeng Zhang, and Ge Yu. 2020. A fair comparison of message queuing systems. IEEE Access 9 (2020), 421--432.
[35]
Gaston H Gonnet. 1981. Expected length of the longest probe sequence in hash code searching. J. ACM 28, 2 (1981), 289--304.
[36]
Raúl Gracia-Tinedo, Danny Harnik, Dalit Naor, Dmitry Sotnikov, Sivan Toledo, and Aviad Zuck. 2015. SDGen: Mimicking datasets for content generation in storage benchmarks. In USENIX FAST'15. 317--330.
[37]
Raúl Gracia-Tinedo, Yongchao Tian, Josep Sampé, Hamza Harkous, John Lenton, Pedro García-López, Marc Sánchez-Artigas, and Marko Vukolic. 2015. Dissecting ubuntuone: Autopsy of a global-scale personal cloud back-end. In ACM IMC '15. 155--168.
[38]
Patrick Hunt, Mahadev Konar, Flavio Paiva Junqueira, and Benjamin Reed. 2010. ZooKeeper: wait-free coordination for internet-scale systems. In USENIX ATC '10, Vol. 8.
[39]
Haruna Isah, Tariq Abughofa, Sazia Mahfuz, Dharmitha Ajerla, Farhana Zulkernine, and Shahzad Khan. 2019. A survey of distributed data stream processing frameworks. IEEE Access 7 (2019), 154300--154316.
[40]
Flavio P Junqueira, Ivan Kelly, and Benjamin Reed. 2013. Durability with bookkeeper. ACM SIGOPS operating systems review 47, 1 (2013), 9--15.
[41]
Jeyhun Karimov, Tilmann Rabl, Asterios Katsifodimos, Roman Samarev, Henri Heiskanen, and Volker Markl. 2018. Benchmarking distributed stream data processing systems. In IEEE ICDE'18. 1507--1518.
[42]
Mariam Kiran, Peter Murphy, Inder Monga, Jon Dugan, and Sartaj Singh Baveja. 2015. Lambda architecture for cost-effective batch and speed big data processing. In IEEE International Conference on Big Data. 2785--2792.
[43]
Jay Kreps, Neha Narkhede, Jun Rao, et al. 2011. Kafka: A distributed messaging system for log processing. In NetDB '11, Vol. 11. 1--7.
[44]
Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter heron: Stream processing at scale. In ACM SIGMOD'15. 239--250.
[45]
Andrew W Leung, Shankar Pasupathy, Garth R Goodson, and Ethan L Miller. 2008. Measurement and Analysis of Large-Scale Network File System Workloads. In USENIX ATC '08, Vol. 1. 5--2.
[46]
Martin Andreoni Lopez, Antonio Gonzalez Pastana Lobato, and Otto Carlos MB Duarte. 2016. A performance comparison of open-source stream processing platforms. In IEEE GLOBECOM'16. 1--6.
[47]
Ruirui Lu, Gang Wu, Bin Xie, and Jingtong Hu. 2014. Stream bench: Towards benchmarking modern distributed stream computing frameworks. In IEEE/ACM International Conference on Utility and Cloud Computing. 69--78.
[48]
Ovidiu-Cristian Marcu, Alexandru Costan, Bogdan Nicolae, and Gabriel Antonin. 2021. Virtual Log-Structured Storage for High-Performance Streaming. In IEEE CLUSTER'21. 135--145.
[49]
John Meehan, Cansu Aslantas, Stan Zdonik, Nesime Tatbul, and Jiang Du. 2017. Data Ingestion for the Connected World. In CIDR'17, Vol. 17. 8--11.
[50]
Mehdi Mohammadi, Ala Al-Fuqaha, Sameh Sorour, and Mohsen Guizani. 2018. Deep learning for IoT big data and streaming analytics: A survey. IEEE Communications Surveys & Tutorials 20, 4 (2018), 2923--2960.
[51]
Hamid Nasiri, Saeed Nasehi, and Maziar Goudarzi. 2019. Evaluation of distributed stream processing frameworks for IoT applications in Smart Cities. Journal of Big Data 6 (2019), 1--24.
[52]
Shadi A Noghabi, Kartik Paramasivam, Yi Pan, Navina Ramesh, Jon Bringhurst, Indranil Gupta, and Roy H Campbell. 2017. Samza: stateful scalable stream processing at LinkedIn. VLDB Endowment 10, 12 (2017), 1634--1645.
[53]
Alan Robertson. 2001. Resource fencing using STONITH. White Paper, August (2001).
[54]
Mahadev Satyanarayanan, Pieter Simoens, Yu Xiao, Padmanabhan Pillai, Zhuo Chen, Kiryong Ha, Wenlu Hu, and Brandon Amos. 2015. Edge analytics in the internet of things. IEEE Pervasive Computing 14, 2 (2015), 24--31.
[55]
Michael Stonebraker, Uğur Çetintemel, and Stan Zdonik. 2005. The 8 requirements of real-time stream processing. ACM SIGMOD'05 34, 4 (2005), 42--47.
[56]
Giselle Van Dongen and Dirk Van den Poel. 2020. Evaluation of stream processing frameworks. IEEE Transactions on Parallel and Distributed Systems 31, 8 (2020), 1845--1858.
[57]
Shusen Yang. 2017. IoT stream processing and analytics in the fog. IEEE Communications Magazine 55, 8 (2017), 21--27.
[58]
Nezih Yigitbasi, Matthieu Gallet, Derrick Kondo, Alexandru Iosup, and Dick Epema. 2010. Analysis and modeling of time-correlated failures in large-scale distributed systems. In IEEE/ACM International Conference on Grid Computing. 65--72.
[59]
Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, Ion Stoica, et al. 2010. Spark: Cluster computing with working sets. USENIX HotCloud'10 10, 10-10 (2010), 95.

Cited By

View all
  • (2024)"Back to the Byte": Towards Byte-oriented Semantics for Streaming StorageProceedings of the 25th International Middleware Conference Industrial Track10.1145/3700824.3701099(43-49)Online publication date: 2-Dec-2024
  • (2024)StreamSense: Policy-driven Semantic Video Search in Streaming SystemsProceedings of the 25th International Middleware Conference Industrial Track10.1145/3700824.3701097(29-35)Online publication date: 2-Dec-2024
  • (2024)Vortex: A Stream-oriented Storage Engine For Big Data AnalyticsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653396(175-187)Online publication date: 9-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
Middleware '23: Proceedings of the 24th International Middleware Conference
November 2023
334 pages
ISBN:9798400701771
DOI:10.1145/3590140
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

  • IFIP: International Federation for Information Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 November 2023

Permissions

Request permissions for this article.

Check for updates

Badges

  • Best Paper

Author Tags

  1. data streams
  2. distributed storage
  3. performance
  4. storage tiering

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

Middleware '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 203 of 948 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)304
  • Downloads (Last 6 weeks)28
Reflects downloads up to 26 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)"Back to the Byte": Towards Byte-oriented Semantics for Streaming StorageProceedings of the 25th International Middleware Conference Industrial Track10.1145/3700824.3701099(43-49)Online publication date: 2-Dec-2024
  • (2024)StreamSense: Policy-driven Semantic Video Search in Streaming SystemsProceedings of the 25th International Middleware Conference Industrial Track10.1145/3700824.3701097(29-35)Online publication date: 2-Dec-2024
  • (2024)Vortex: A Stream-oriented Storage Engine For Big Data AnalyticsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653396(175-187)Online publication date: 9-Jun-2024
  • (2024)HUILLY: A Non-Blocking Ingestion Buffer for Timestepped Simulation Analytics2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid59990.2024.00022(113-118)Online publication date: 6-May-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media