Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3524860.3539782acmconferencesArticle/Chapter ViewAbstractPublication PagesdebsConference Proceedingsconference-collections
tutorial
Open access

A unifying model for distributed data-intensive systems

Published: 15 July 2022 Publication History

Abstract

Modern applications handle increasingly larger volumes of data, generated at an unprecedented and constantly growing rate. They introduce challenges that are radically transforming the research fields that gravitate around data management and processing, resulting in a blooming of distributed data-intensive systems. Each such system comes with its specific assumptions, data and processing model, design choices, implementation strategies, and guarantees. Yet, the problems data-intensive systems face and the solutions they propose are frequently overlapping.
This tutorial presents a unifying model for data-intensive systems that dissects them into core building blocks, enabling a precise and unambiguous description and a detailed comparison. From the model, we derive a list of classification criteria and we use them to build a taxonomy of state-of-the-art systems. The tutorial offers a global view of the vast research field of data-intensive systems, highlighting interesting observations on the current state of things, and suggesting promising research directions.

References

[1]
Lorenzo Affetti, Alessandro Margara, and Gianpaolo Cugola. 2020. TSpoon: Transactions on a stream processor. J. Parallel and Distrib. Comput. 140 (2020), 65--79.
[2]
Joy Arulraj and Andrew Pavlo. 2017. How to Build a Non-Volatile Memory Database Management System. In Proc of the Intl Conf on Management of Data (SIGMOD '17). ACM, 1753--1758.
[3]
David F. Bacon, Nathan Bales, Nico Bruno, Brian F. Cooper, Adam Dickinson, Andrew Fikes, Campbell Fraser, Andrey Gubarev, Milind Joshi, Eugene Kogan, Alexander Lloyd, Sergey Melnik, Rajesh Rao, David Shue, Christopher Taylor, Marcel van der Holst, and Dale Woodford. 2017. Spanner: Becoming a SQL System. In Proc of the Intl Conf on Management of Data (SIGMOD '17). ACM, 331--343.
[4]
Peter Bailis, Alan D. Fekete, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. 2014. Coordination Avoidance in Database Systems. Proc. VLDB Endow. 8, 3 (2014), 185--196.
[5]
Bill Bejeck. 2018. Kafka Streams in Action: Real-time apps and microservices with the Kafka Streams API. Manning.
[6]
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink: Stream and Batch Processing in a Single Engine. IEEE Data Engineering Bulletin 38, 4 (2015), 28--38.
[7]
Ugur Cetintemel, Jiang Du, Tim Kraska, Samuel Madden, David Maier, John Meehan, Andrew Pavlo, Michael Stonebraker, Erik Sutherland, Nesime Tatbul, et al. 2014. S-Store: a streaming NewSQL system for big velocity applications. Proc of VLDB 7, 13 (2014), 1633--1636.
[8]
Ali Davoudian, Liu Chen, and Mengchi Liu. 2018. A Survey on NoSQL Stores. ACM Comput. Surv. 51, 2, Article 40 (2018).
[9]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (2008), 107--113.
[10]
Bonaventura Del Monte, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2022. Rethinking Stateful Stream Processing with RDMA. (2022).
[11]
Raul Castro Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki, and Peter Pietzuch. 2014. Making State Explicit for Imperative Big Data Processing. In Proc of the USENIX Annual Technical Conf (ATC'14). USENIX Assoc., 49--60.
[12]
Martin Kleppmann. 2016. Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O'Reilly.
[13]
Jay Kreps, Neha Narkhede, Jun Rao, et al. 2011. Kafka: A distributed messaging system for log processing. In Proc of the Intl Workshop on Networking meets Databases (NetDB). USENIX, 1--7.
[14]
Rubao Lee, Minghong Zhou, Chi Li, Shenggang Hu, Jianping Teng, Dongyang Li, and Xiaodong Zhang. 2021. The Art of Balance: A RateupDB Experience of Building a CPU/GPU Hybrid Database Product. Proc of VLDB 14, 12 (2021), 2999--3013.
[15]
Jimmy Lin. 2017. The Lambda and the Kappa. IEEE Internet Computing 21, 5 (2017), 60--66.
[16]
Alessandro Margara, Gianpaolo Cugola, Nicoló Felicioni, and Stefano Cilloni. 2022. A Model and Survey of Distributed Data-Intensive Systems.
[17]
Matthias J. Sax, Guozhang Wang, Matthias Weidlich, and Johann-Christoph Freytag. 2018. Streams and Tables: Two Sides of the Same Coin. In Proc of the Intl Workshop on Real-Time Business Intelligence and Analytics (BIRTE '18). ACM, Article 1.
[18]
Weisong Shi, Jie Cao, Quan Zhang, Youhuizi Li, and Lanyu Xu. 2016. Edge Computing: Vision and Challenges. Internet of Things Journal 3, 5 (2016), 637--646.
[19]
Jeff Shute, Radek Vingralek, Bart Samwel, Ben Handy, Chad Whipkey, Eric Rollins, Mircea Oancea, Kyle Littlefield, David Menestrina, Stephan Ellner, John Cieslewicz, Ian Rae, Traian Stancescu, and Himani Apte. 2013. F1: A Distributed SQL Database That Scales. Proc of VLDB 6, 11 (2013), 1068--1079.
[20]
Michael Stonebraker. 2012. New Opportunities for New SQL. Commun. ACM 55, 11 (2012), 10--11.
[21]
Michael Stonebraker and Ugur Cetintemel. 2005. "One Size Fits All": An Idea Whose Time Has Come and Gone. In Proc of the Intl Conf on Data Engineering (ICDE '05). IEEE, 2--11.
[22]
Michael Stonebraker and Ariel Weisberg. 2013. The VoltDB Main Memory DBMS. IEEE Data Engineering Bulletin 36, 2 (2013), 21--27.
[23]
Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. 2012. Calvin: Fast Distributed Transactions for Partitioned Database Systems. In Proc of the Intl Conf on Management of Data (SIGMOD '12). ACM, 1--12.
[24]
Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, and Xiaofeng Bao. 2017. Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases. In Proc of the Intl Conf on Management of Data (SIGMOD '17). ACM, 1041--1052.
[25]
Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: A Unified Engine for Big Data Processing. Commun. ACM 59, 11 (2016), 56--65.

Cited By

View all
  • (2024)Rest security framework for event streaming bus architectureInternational Journal of Information Technology10.1007/s41870-024-01836-816:5(3033-3047)Online publication date: 13-Apr-2024
  • (2024)Emerging Concepts Using Blockchain and Big DataArtificial Intelligence, Data Science and Applications10.1007/978-3-031-48573-2_70(487-492)Online publication date: 30-Jan-2024
  • (2022)A brief survey on big data: technologies, terminologies and data-intensive applicationsJournal of Big Data10.1186/s40537-022-00659-39:1Online publication date: 17-Nov-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DEBS '22: Proceedings of the 16th ACM International Conference on Distributed and Event-Based Systems
June 2022
210 pages
ISBN:9781450393089
DOI:10.1145/3524860
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 July 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. big data
  2. data management
  3. data processing
  4. data-intensive systems
  5. distributed database
  6. distributed systems
  7. model
  8. survey

Qualifiers

  • Tutorial

Conference

DEBS '22

Acceptance Rates

DEBS '22 Paper Acceptance Rate 10 of 19 submissions, 53%;
Overall Acceptance Rate 145 of 583 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)83
  • Downloads (Last 6 weeks)10
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Rest security framework for event streaming bus architectureInternational Journal of Information Technology10.1007/s41870-024-01836-816:5(3033-3047)Online publication date: 13-Apr-2024
  • (2024)Emerging Concepts Using Blockchain and Big DataArtificial Intelligence, Data Science and Applications10.1007/978-3-031-48573-2_70(487-492)Online publication date: 30-Jan-2024
  • (2022)A brief survey on big data: technologies, terminologies and data-intensive applicationsJournal of Big Data10.1186/s40537-022-00659-39:1Online publication date: 17-Nov-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media