research-article

Shared arrangements: practical inter-query sharing for streaming dataflows

Authors:

Frank McSherry,

Andrea Lattuada,

Malte Schwarzkopf, and

Timothy RoscoeAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 13, Issue 10

Pages 1793 - 1806

https://doi.org/10.14778/3401960.3401974

Published: 01 June 2020 Publication History

Abstract

Current systems for data-parallel, incremental processing and view maintenance over high-rate streams isolate the execution of independent queries. This creates unwanted redundancy and overhead in the presence of concurrent incrementally maintained queries: each query must independently maintain the same indexed state over the same input streams, and new queries must build this state from scratch before they can begin to emit their first results.

This paper introduces shared arrangements: indexed views of maintained state that allow concurrent queries to reuse the same in-memory state without compromising data-parallel performance and scaling. We implement shared arrangements in a modern stream processor and show order-of-magnitude improvements in query response time and resource consumption for incremental, interactive queries against high-throughput streams, while also significantly improving performance in other domains including business analytics, graph processing, and program analysis.

References

[1]

https://github.com/TimelyDataflow/differential-dataflow/.

[2]

https://github.com/TimelyDataflow/timely-dataflow/.

[3]

DDlog. https://research.vmware.com/projects/differential-datalog-ddlog.

[4]

DLVSYSTEM. http://www.dlvsystem.com.

[5]

Jemalloc memory allocator. http://jemalloc.net.

[6]

The TPC-H decision support benchmark. http://www.tpc.org/tpch/default5.asp.

[7]

M. Abadi, F. McSherry, and G. Plotkin. Foundations of differential dataflow. In A. Pitts, editor, Foundations of Software Science and Computation Structures, Lecture Notes in Computer Science, pages 71--83. Springer Berlin Heidelberg, 2015.

[8]

Y. Ahmad, O. Kennedy, C. Koch, and M. Nikolic. Dbtoaster: Higher-order delta processing for dynamic, frequently fresh views. PVLDB, 5(10):968--979, 2012.

Digital Library

[9]

A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani, U. Srivastava, and J. Widom. STREAM: The Stanford Data Stream Management System, pages 317--336. Springer, Berlin/Heidelberg, Germany, 2016.

[10]

F. Bancilhon, D. Maier, Y. Sagiv, and J. D. Ullman. Magic sets and other strange ways to implement logic programs (extended abstract). In Proceedings of the 5^th ACM SIGACT-SIGMOD Symposium on Principles of Database Systems (PODS), pages 1--15, 1986.

Digital Library

[11]

G. Candea, N. Polyzotis, and R. Vingralek. A scalable, predictable join operator for highly concurrent data warehouses. PVLDB, 2(1):277--288, 2009.

Digital Library

[12]

P. Carbone, S. Ewen, S. Haridi, A. Katsifodimos, V. Markl, and K. Tzoumas. Apache flink: Stream and batch processing in a single engine. IEEE Data Engineering, 38(4), Dec. 2015.

[13]

S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. R. Madden, F. Reiss, and M. A. Shah. Telegraphcq: Continuous dataflow processing. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD), pages 668--668, 2003.

Digital Library

[14]

J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. Niagaracq: A scalable continuous query system for internet databases. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA., pages 379--390, 2000.

Digital Library

[15]

E. Darling. Locks taken during indexed view modifications. Brent Ozar Unlimited Blog, https://www.brentozar.com/archive/2018/09/locks-taken-during-indexed-view-modifications/, Sept. 2019.

[16]

J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107--113, Jan. 2008.

Digital Library

[17]

Z. Fan, J. Zhu, Z. Zhang, A. Albarghouthi, P. Koutris, and J. M. Patel. Scaling-up in-memory datalog processing: Observations and techniques. PVLDB, 12(6):695--708, 2019.

Digital Library

[18]

G. Giannikis, G. Alonso, and D. Kossmann. Shareddb: Killing one thousand queries with one stone. PVLDB, 5(6):526--537, 2012.

Digital Library

[19]

J. Gjengset, M. Schwarzkopf, J. Behrens, L. T. Araújo, M. Ek, E. Kohler, M. F. Kaashoek, and R. Morris. Noria: dynamic, partially-stateful data-flow for high-performance web applications. In Proceedings of the 13^th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 213--231, Oct. 2018.

Digital Library

[20]

I. Gog, J. Giceva, M. Schwarzkopf, K. Vaswani, D. Vytiniotis, G. Ramalingan, D. Murray, S. Hand, and M. Isard. Broom: Sweeping out garbage collection from big data systems. In Proceedings of the 15^th USENIX Conference on Hot Topics in Operating Systems (HotOS), 2015.

Digital Library

[21]

J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. GraphX: Graph Processing in a Distributed Dataflow Framework. In Proceedings of the 11^th USENIX Conference on Operating Systems Design and Implementation (OSDI), pages 599--613, 2014.

Digital Library

[22]

J. Gu, Y. H. Watanabe, W. A. Mazza, A. Shkapsky, M. Yang, L. Ding, and C. Zaniolo. RaSQL: Greater Power and Performance for Big Data Analytics with Recursive-Aggregate-SQL on Spark. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD), page 467--484, 2019.

Digital Library

[23]

P. K. Gunda, L. Ravindranath, C. A. Thekkath, Y. Yu, and L. Zhuang. Nectar: Automatic management of data and computation in datacenters. In Proceedings of the 9^th USENIX Conference on Operating Systems Design and Implementation (OSDI), pages 75--88, 2010.

Digital Library

[24]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In Proceedings of the 2^nd ACM SIGOPS European Conference on Computer Systems (EuroSys), pages 59--72, Mar. 2007.

Digital Library

[25]

J. Karimov, T. Rabl, and V. Markl. AStream: Ad-hoc Shared Stream Processing. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD), pages 607--622, 2019.

Digital Library

[26]

F. McSherry, A. Lattuada, M. Schwarzkopf, and T. Roscoe. Shared arrangements: Practical inter-query sharing for streaming dataflows (extended technical report). https://arxiv.org/abs/1812.02639.

[27]

F. McSherry, D. G. Murray, R. Isaacs, and M. Isard. Differential dataflow. In Proceedings of the 6^th Biennial Conference on Innovative Data Systems Research (CIDR), Jan. 2013.

[28]

D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: A Timely Dataflow System. In Proceedings of the 24^th ACM Symposium on Operating Systems Principles (SOSP), pages 439--455, Nov. 2013.

Digital Library

[29]

M. Nikolic, M. Dashti, and C. Koch. How to win a hot dog eating contest: Distributed incremental view maintenance with batch updates. In Proceedings of the 2016 ACM SIGMOD International Conference on Management of Data (SIGMOD), pages 511--526, 2016.

Digital Library

[30]

A. Pacaci, A. Zhou, J. Lin, and M. T. Özsu. Do we need specialized graph databases?: Benchmarking real-time social networking applications. In Proceedings of the 5^th International Workshop on Graph Data-management Experiences & Systems (GRADES), pages 12:1--12:7, 2017.

Digital Library

[31]

PostgreSQL Global Development Group. The PostgreSQL Database Management System. https://www.postgresql.org/, April 2019.

[32]

J. Seo, S. Guo, and M. S. Lam. Socialite: An efficient graph query language based on datalog. IEEE Trans. Knowl. Data Eng., 27(7):1824--1837, 2015.

Digital Library

[33]

A. Shkapsky, M. Yang, M. Interlandi, H. Chiu, T. Condie, and C. Zaniolo. Big data analytics with datalog queries on spark. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD), pages 1135--1149, 2016.

Digital Library

[34]

T. L. Veldhuizen. Transaction repair: Full serializability without locks. https://arxiv.org/abs/1403.5645, 2014.

[35]

J. Wang, T. Baker, M. Balazinska, D. Halperin, B. Haynes, B. Howe, D. Hutchison, S. Jain, R. Maas, P. Mehta, D. Moritz, B. Myers, J. Ortiz, D. Suciu, A. Whitaker, and S. Xu. The myria big data management and analytics system and cloud services. In Proceedings of the 8^th Biennial Conference on Innovative Data Systems Research (CIDR), Jan. 2017.

[36]

K. Wang, A. Hussain, Z. Zuo, G. Xu, and A. Amiri Sani. Graspan: A single-machine disk-based graph system for interprocedural static analyses of large-scale systems code. In Proceedings of the 22^nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 389--404, 2017.

Digital Library

[37]

M. Yang, A. Shkapsky, and C. Zaniolo. Scaling up the performance of more powerful datalog systems on multicore machines. VLDB Journal, 26(2):229--248, 2017.

Digital Library

[38]

Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. In Proceedings of the 8^th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Dec. 2008.

Digital Library

[39]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9^th USENIX Conference on Networked Systems Design and Implementation (NSDI), pages 15--28, Apr. 2012.

Digital Library

[40]

M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the 24^th ACM Symposium on Operating Systems Principles (SOSP), pages 423--438, Nov. 2013.

Digital Library

Cited By

Sahu SSalihoglu S(2024)Optimizing Differential Computation for Large-Scale Graph ProcessingProceedings of the 7th Joint Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA)10.1145/3661304.3661900(1-9)Online publication date: 14-Jun-2024
https://dl.acm.org/doi/10.1145/3661304.3661900
Wooders SMo XNarang ALin KStoica IHellerstein JCrooks NGonzalez J(2023)RALF: Accuracy-Aware Scheduling for Feature Store MaintenanceProceedings of the VLDB Endowment10.14778/3632093.363211617:3(563-576)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.14778/3632093.3632116
Budiu MChajed TMcSherry FRyzhyk LTannen V(2023)DBSP: Automatic Incremental View Maintenance for Rich Query LanguagesProceedings of the VLDB Endowment10.14778/3587136.358713716:7(1601-1614)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.14778/3587136.3587137
Show More Cited By

Index Terms

Shared arrangements: practical inter-query sharing for streaming dataflows
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing

Index terms have been assigned to the content through auto-classification.

Recommendations

Shared query processing in data streaming systems
Read More
Resource-efficient Shared Query Execution via Exploiting Time Slackness
SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

Shared query execution can reduce resource consumption by sharing common sub-expressions across concurrent queries. We show that this is not always the case when regularly querying a dataset under change. Depending on latency goals, how eagerly to ...
Read More
Shared workload optimization

As a result of increases in both the query load and the data managed, as well as changes in hardware architecture (multicore), the last years have seen a shift from query-at-a-time approaches towards shared work (SW) systems where queries are executed ...
Read More

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 13, Issue 10

June 2020

193 pages

ISSN:2150-8097

Editors:
Magdalena Balazinska
University of Washington
,
Xiaofang Zhou
University of Queensland, Australia

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 June 2020

Published in PVLDB Volume 13, Issue 10

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
85
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Other Metrics

View Author Metrics

Citations

Cited By

Sahu SSalihoglu S(2024)Optimizing Differential Computation for Large-Scale Graph ProcessingProceedings of the 7th Joint Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA)10.1145/3661304.3661900(1-9)Online publication date: 14-Jun-2024
https://dl.acm.org/doi/10.1145/3661304.3661900
Wooders SMo XNarang ALin KStoica IHellerstein JCrooks NGonzalez J(2023)RALF: Accuracy-Aware Scheduling for Feature Store MaintenanceProceedings of the VLDB Endowment10.14778/3632093.363211617:3(563-576)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.14778/3632093.3632116
Budiu MChajed TMcSherry FRyzhyk LTannen V(2023)DBSP: Automatic Incremental View Maintenance for Rich Query LanguagesProceedings of the VLDB Endowment10.14778/3587136.358713716:7(1601-1614)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.14778/3587136.3587137
An SCao Y(2022)Making Cache Monotonic and ConsistentProceedings of the VLDB Endowment10.14778/3574245.357427116:4(891-904)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.14778/3574245.3574271
Abeysinghe SHe QRompf TIves ZBonifati AEl Abbadi A(2022)Efficient Incrementialization of Correlated Nested Aggregate Queries using Relative Partial Aggregate Indexes (RPAI)Proceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517889(136-149)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517889
Tang DShang ZMa WElmore AKrishnan SLi GLi ZIdreos SSrivastava D(2021)Resource-efficient Shared Query Execution via Exploiting Time SlacknessProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457282(1797-1810)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457282
Jiang XXu CYin XZhao ZGupta RBarbalace ABhatotia PAlvisi LCadar C(2021)TripolineProceedings of the Sixteenth European Conference on Computer Systems10.1145/3447786.3456226(17-32)Online publication date: 21-Apr-2021
https://dl.acm.org/doi/10.1145/3447786.3456226

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents