Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Shared arrangements: practical inter-query sharing for streaming dataflows

Published: 01 June 2020 Publication History
  • Get Citation Alerts
  • Abstract

    Current systems for data-parallel, incremental processing and view maintenance over high-rate streams isolate the execution of independent queries. This creates unwanted redundancy and overhead in the presence of concurrent incrementally maintained queries: each query must independently maintain the same indexed state over the same input streams, and new queries must build this state from scratch before they can begin to emit their first results.
    This paper introduces shared arrangements: indexed views of maintained state that allow concurrent queries to reuse the same in-memory state without compromising data-parallel performance and scaling. We implement shared arrangements in a modern stream processor and show order-of-magnitude improvements in query response time and resource consumption for incremental, interactive queries against high-throughput streams, while also significantly improving performance in other domains including business analytics, graph processing, and program analysis.

    References

    [1]
    https://github.com/TimelyDataflow/differential-dataflow/.
    [2]
    https://github.com/TimelyDataflow/timely-dataflow/.
    [3]
    DDlog. https://research.vmware.com/projects/differential-datalog-ddlog.
    [4]
    DLVSYSTEM. http://www.dlvsystem.com.
    [5]
    Jemalloc memory allocator. http://jemalloc.net.
    [6]
    The TPC-H decision support benchmark. http://www.tpc.org/tpch/default5.asp.
    [7]
    M. Abadi, F. McSherry, and G. Plotkin. Foundations of differential dataflow. In A. Pitts, editor, Foundations of Software Science and Computation Structures, Lecture Notes in Computer Science, pages 71--83. Springer Berlin Heidelberg, 2015.
    [8]
    Y. Ahmad, O. Kennedy, C. Koch, and M. Nikolic. Dbtoaster: Higher-order delta processing for dynamic, frequently fresh views. PVLDB, 5(10):968--979, 2012.
    [9]
    A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani, U. Srivastava, and J. Widom. STREAM: The Stanford Data Stream Management System, pages 317--336. Springer, Berlin/Heidelberg, Germany, 2016.
    [10]
    F. Bancilhon, D. Maier, Y. Sagiv, and J. D. Ullman. Magic sets and other strange ways to implement logic programs (extended abstract). In Proceedings of the 5th ACM SIGACT-SIGMOD Symposium on Principles of Database Systems (PODS), pages 1--15, 1986.
    [11]
    G. Candea, N. Polyzotis, and R. Vingralek. A scalable, predictable join operator for highly concurrent data warehouses. PVLDB, 2(1):277--288, 2009.
    [12]
    P. Carbone, S. Ewen, S. Haridi, A. Katsifodimos, V. Markl, and K. Tzoumas. Apache flink: Stream and batch processing in a single engine. IEEE Data Engineering, 38(4), Dec. 2015.
    [13]
    S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. R. Madden, F. Reiss, and M. A. Shah. Telegraphcq: Continuous dataflow processing. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD), pages 668--668, 2003.
    [14]
    J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. Niagaracq: A scalable continuous query system for internet databases. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA., pages 379--390, 2000.
    [15]
    E. Darling. Locks taken during indexed view modifications. Brent Ozar Unlimited Blog, https://www.brentozar.com/archive/2018/09/locks-taken-during-indexed-view-modifications/, Sept. 2019.
    [16]
    J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107--113, Jan. 2008.
    [17]
    Z. Fan, J. Zhu, Z. Zhang, A. Albarghouthi, P. Koutris, and J. M. Patel. Scaling-up in-memory datalog processing: Observations and techniques. PVLDB, 12(6):695--708, 2019.
    [18]
    G. Giannikis, G. Alonso, and D. Kossmann. Shareddb: Killing one thousand queries with one stone. PVLDB, 5(6):526--537, 2012.
    [19]
    J. Gjengset, M. Schwarzkopf, J. Behrens, L. T. Araújo, M. Ek, E. Kohler, M. F. Kaashoek, and R. Morris. Noria: dynamic, partially-stateful data-flow for high-performance web applications. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 213--231, Oct. 2018.
    [20]
    I. Gog, J. Giceva, M. Schwarzkopf, K. Vaswani, D. Vytiniotis, G. Ramalingan, D. Murray, S. Hand, and M. Isard. Broom: Sweeping out garbage collection from big data systems. In Proceedings of the 15th USENIX Conference on Hot Topics in Operating Systems (HotOS), 2015.
    [21]
    J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. GraphX: Graph Processing in a Distributed Dataflow Framework. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI), pages 599--613, 2014.
    [22]
    J. Gu, Y. H. Watanabe, W. A. Mazza, A. Shkapsky, M. Yang, L. Ding, and C. Zaniolo. RaSQL: Greater Power and Performance for Big Data Analytics with Recursive-Aggregate-SQL on Spark. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD), page 467--484, 2019.
    [23]
    P. K. Gunda, L. Ravindranath, C. A. Thekkath, Y. Yu, and L. Zhuang. Nectar: Automatic management of data and computation in datacenters. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI), pages 75--88, 2010.
    [24]
    M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In Proceedings of the 2nd ACM SIGOPS European Conference on Computer Systems (EuroSys), pages 59--72, Mar. 2007.
    [25]
    J. Karimov, T. Rabl, and V. Markl. AStream: Ad-hoc Shared Stream Processing. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD), pages 607--622, 2019.
    [26]
    F. McSherry, A. Lattuada, M. Schwarzkopf, and T. Roscoe. Shared arrangements: Practical inter-query sharing for streaming dataflows (extended technical report). https://arxiv.org/abs/1812.02639.
    [27]
    F. McSherry, D. G. Murray, R. Isaacs, and M. Isard. Differential dataflow. In Proceedings of the 6th Biennial Conference on Innovative Data Systems Research (CIDR), Jan. 2013.
    [28]
    D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: A Timely Dataflow System. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP), pages 439--455, Nov. 2013.
    [29]
    M. Nikolic, M. Dashti, and C. Koch. How to win a hot dog eating contest: Distributed incremental view maintenance with batch updates. In Proceedings of the 2016 ACM SIGMOD International Conference on Management of Data (SIGMOD), pages 511--526, 2016.
    [30]
    A. Pacaci, A. Zhou, J. Lin, and M. T. Özsu. Do we need specialized graph databases?: Benchmarking real-time social networking applications. In Proceedings of the 5th International Workshop on Graph Data-management Experiences & Systems (GRADES), pages 12:1--12:7, 2017.
    [31]
    PostgreSQL Global Development Group. The PostgreSQL Database Management System. https://www.postgresql.org/, April 2019.
    [32]
    J. Seo, S. Guo, and M. S. Lam. Socialite: An efficient graph query language based on datalog. IEEE Trans. Knowl. Data Eng., 27(7):1824--1837, 2015.
    [33]
    A. Shkapsky, M. Yang, M. Interlandi, H. Chiu, T. Condie, and C. Zaniolo. Big data analytics with datalog queries on spark. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD), pages 1135--1149, 2016.
    [34]
    T. L. Veldhuizen. Transaction repair: Full serializability without locks. https://arxiv.org/abs/1403.5645, 2014.
    [35]
    J. Wang, T. Baker, M. Balazinska, D. Halperin, B. Haynes, B. Howe, D. Hutchison, S. Jain, R. Maas, P. Mehta, D. Moritz, B. Myers, J. Ortiz, D. Suciu, A. Whitaker, and S. Xu. The myria big data management and analytics system and cloud services. In Proceedings of the 8th Biennial Conference on Innovative Data Systems Research (CIDR), Jan. 2017.
    [36]
    K. Wang, A. Hussain, Z. Zuo, G. Xu, and A. Amiri Sani. Graspan: A single-machine disk-based graph system for interprocedural static analyses of large-scale systems code. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 389--404, 2017.
    [37]
    M. Yang, A. Shkapsky, and C. Zaniolo. Scaling up the performance of more powerful datalog systems on multicore machines. VLDB Journal, 26(2):229--248, 2017.
    [38]
    Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Dec. 2008.
    [39]
    M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI), pages 15--28, Apr. 2012.
    [40]
    M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP), pages 423--438, Nov. 2013.

    Cited By

    View all
    • (2024)Optimizing Differential Computation for Large-Scale Graph ProcessingProceedings of the 7th Joint Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA)10.1145/3661304.3661900(1-9)Online publication date: 14-Jun-2024
    • (2023)RALF: Accuracy-Aware Scheduling for Feature Store MaintenanceProceedings of the VLDB Endowment10.14778/3632093.363211617:3(563-576)Online publication date: 1-Nov-2023
    • (2023)DBSP: Automatic Incremental View Maintenance for Rich Query LanguagesProceedings of the VLDB Endowment10.14778/3587136.358713716:7(1601-1614)Online publication date: 8-May-2023
    • Show More Cited By

    Index Terms

    1. Shared arrangements: practical inter-query sharing for streaming dataflows
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 13, Issue 10
      June 2020
      193 pages
      ISSN:2150-8097
      Issue’s Table of Contents

      Publisher

      VLDB Endowment

      Publication History

      Published: 01 June 2020
      Published in PVLDB Volume 13, Issue 10

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)8
      • Downloads (Last 6 weeks)0

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Optimizing Differential Computation for Large-Scale Graph ProcessingProceedings of the 7th Joint Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA)10.1145/3661304.3661900(1-9)Online publication date: 14-Jun-2024
      • (2023)RALF: Accuracy-Aware Scheduling for Feature Store MaintenanceProceedings of the VLDB Endowment10.14778/3632093.363211617:3(563-576)Online publication date: 1-Nov-2023
      • (2023)DBSP: Automatic Incremental View Maintenance for Rich Query LanguagesProceedings of the VLDB Endowment10.14778/3587136.358713716:7(1601-1614)Online publication date: 8-May-2023
      • (2022)Making Cache Monotonic and ConsistentProceedings of the VLDB Endowment10.14778/3574245.357427116:4(891-904)Online publication date: 1-Dec-2022
      • (2022)Efficient Incrementialization of Correlated Nested Aggregate Queries using Relative Partial Aggregate Indexes (RPAI)Proceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517889(136-149)Online publication date: 10-Jun-2022
      • (2021)Resource-efficient Shared Query Execution via Exploiting Time SlacknessProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457282(1797-1810)Online publication date: 9-Jun-2021
      • (2021)TripolineProceedings of the Sixteenth European Conference on Computer Systems10.1145/3447786.3456226(17-32)Online publication date: 21-Apr-2021

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media