Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3315507.3330199acmconferencesArticle/Chapter ViewAbstractPublication PagessplashConference Proceedingsconference-collections
research-article
Open access

Arc: an IR for batch and stream programming

Published: 23 June 2019 Publication History

Abstract

In big data analytics, there is currently a large number of data programming models and their respective frontends such as relational tables, graphs, tensors, and streams. This has lead to a plethora of runtimes that typically focus on the efficient execution of just a single frontend. This fragmentation manifests itself today by highly complex pipelines that bundle multiple runtimes to support the necessary models. Hence, joint optimization and execution of such pipelines across these frontend-bound runtimes is infeasible. We propose Arc as the first unified Intermediate Representation (IR) for data analytics that incorporates stream semantics based on a modern specification of streams, windows and stream aggregation, to combine batch and stream computation models. Arc extends Weld, an IR for batch computation and adds support for partitioned, out-of-order stream and window operators which are the most fundamental building blocks in contemporary data streaming.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: a system for large-scale machine learning. In OSDI, Vol. 16. 265–283.
[2]
Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, et al. 2015. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment 8, 12 (2015), 1792–1803.
[3]
Arvind Arasu, Shivnath Babu, and Jennifer Widom. 2006. The CQL continuous query language: semantic foundations and query execution. VLDBJ (2006).
[4]
Arvind Arasu and Jennifer Widom. 2004. Resource sharing in continuous sliding-window aggregates. In VLDB.
[5]
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 36, 4 (2015).
[6]
Paris Carbone, Jonas Traub, Asterios Katsifodimos, Seif Haridi, and Volker Markl. 2016. Cutty: Aggregate Sharing for User-Defined Windows. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM.
[7]
Peter MD Gray, Larry Kerschberg, Peter JH King, and Alexandra Poulovassilis. 2013. The functional approach to data management: modeling, analyzing and integrating heterogeneous data. Springer Science & Business Media.
[8]
Martin Hirzel, Robert Soulé, Scott Schneider, Buğra Gedik, and Robert Grimm. 2014. A catalog of stream processing optimizations. ACM Computing Surveys (CSUR) 46, 4 (2014), 46.
[9]
Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter Heron: Stream Processing at Scale. In ACM SIGMOD.
[10]
Jin Li, Kristin Tufte, Vladislav Shkapenyuk, Vassilis Papadimos, Theodore Johnson, and David Maier. 2008. Out-of-order processing: a new architecture for high-performance stream systems. Proceedings of the VLDB Endowment 1, 1 (2008), 274–288.
[11]
David Maier, Jin Li, Peter Tucker, Kristin Tufte, and Vassilis Papadimos. 2005. Semantics of data streams and operators. In International Conference on Database Theory. Springer, 37–52.
[12]
Shoumik Palkar, James Thomas, Deepak Narayanan, Pratiksha Thaker, Rahul Palamuttam, Parimajan Negi, Anil Shanbhag, Malte Schwarzkopf, Holger Pirk, Saman Amarasinghe, et al. 2018. Evaluating end-to-end optimization for data analytics applications in weld. Proceedings of the VLDB Endowment 11, 9 (2018), 1002–1015.
[13]
Shoumik Palkar, James J Thomas, Anil Shanbhag, Deepak Narayanan, Holger Pirk, Malte Schwarzkopf, Saman Amarasinghe, Matei Zaharia, and Stanford InfoLab. 2017. Weld: A common runtime for high performance data analytics. In Conference on Innovative Data Systems Research (CIDR).
[14]
Kostas Patroumpas and Timos Sellis. 2006. Window specification over data streams. In Current Trends in Database Technology–EDBT 2006. Springer, 445–464.
[15]
Benjamin C Pierce. 2005. Advanced topics in types and programming languages. MIT press.
[16]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices 48, 6 (2013), 519–530.
[17]
Robert Soulé, Martin Hirzel, Buğra Gedik, and Robert Grimm. 2016. River: an intermediate language for stream processing. Software: Practice and Experience 46, 7 (2016), 891–929.
[18]
Utkarsh Srivastava and Jennifer Widom. 2004. Flexible time management in data stream systems. In Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 263–274.
[19]
Kanat Tangwongsan, Martin Hirzel, Scott Schneider, and Kun-Lung Wu. 2015. General incremental sliding-window aggregation. In VLDB.
[20]
Jonas Traub, Philipp Grulich, Alejandro Rodriguez Cuellar, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, and Volker Markl. 2019. Efficient Window Aggregation with General Stream Slicing. In EDBT. ACM.
[21]
Pete Tucker, Kristin Tufte, Vassilis Papadimos, and David Maier. 2008. NEXMark–A Benchmark for Queries over Data Streams (DRAFT). Technical Report. Technical report, OGI School of Science & Engineering at OHSU, Septembers.
[22]
Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. HotCloud (2010).

Cited By

View all
  • (2025)Flo: A Semantic Foundation for Progressive Stream ProcessingProceedings of the ACM on Programming Languages10.1145/37048459:POPL(241-270)Online publication date: 9-Jan-2025
  • (2024)Stream TypesProceedings of the ACM on Programming Languages10.1145/36564348:PLDI(1412-1436)Online publication date: 20-Jun-2024
  • (2024)An Overview of Continuous Querying in (Modern) Data SystemsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654679(605-612)Online publication date: 9-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DBPL 2019: Proceedings of the 17th ACM SIGPLAN International Symposium on Database Programming Languages
June 2019
84 pages
ISBN:9781450367189
DOI:10.1145/3315507
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data analytics
  2. intermediate representation
  3. stream processing

Qualifiers

  • Research-article

Funding Sources

  • Stiftelsen för Strategisk Forskning

Conference

PLDI '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 10 of 15 submissions, 67%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)100
  • Downloads (Last 6 weeks)17
Reflects downloads up to 25 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Flo: A Semantic Foundation for Progressive Stream ProcessingProceedings of the ACM on Programming Languages10.1145/37048459:POPL(241-270)Online publication date: 9-Jan-2025
  • (2024)Stream TypesProceedings of the ACM on Programming Languages10.1145/36564348:PLDI(1412-1436)Online publication date: 20-Jun-2024
  • (2024)An Overview of Continuous Querying in (Modern) Data SystemsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654679(605-612)Online publication date: 9-Jun-2024
  • (2022)On Generating Out-Of-Core GPU Code for Multi-Dimensional Array OperationsProceedings of the 34th Symposium on Implementation and Application of Functional Languages10.1145/3587216.3587223(1-13)Online publication date: 31-Aug-2022
  • (2022)Adaptive SQL Query Optimization in Distributed Stream Processing: A Preliminary StudySoftware Foundations for Data Interoperability10.1007/978-3-030-93849-9_7(96-109)Online publication date: 19-Jan-2022
  • (2019)ArconProceedings of Real-Time Business Intelligence and Analytics10.1145/3350489.3350492(1-3)Online publication date: 26-Aug-2019

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media