Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Keep Your Distributed Data Warehouse Consistent at a Minimal Cost

Published: 20 June 2023 Publication History

Abstract

Large data warehouses store interdependent tables that are updated independently in response to business logic changes or late arrival of critical data. To keep the warehouse consistent, changes to upstream tables need to be propagated to downstream tables in a timely fashion. However, a naive change propagation algorithm can cause many unnecessary updates or recalculations of downstream tables, which drives up the cost of data warehouse management.
In this paper, we describe our solution that can ensure the eventual consistency of the data warehouse while avoiding unnecessary table updates. We also show that the optimal trade-off between computational cost reduction and meeting data freshness constraints can be found by solving a dynamic programming problem.
The proposed solution is currently in production to manage the YouTube Data Warehouse and has reduced update requests by 25% by eliminating non-trivial duplicates. These requests would have been carried out by large batch jobs over big data. Eliminating them has led to a proportionate reduction in computing resources.
One key advantage of our approach is that it can be used in a heterogeneous, distributed data warehouse environment where the operator software may not have complete control over the query processors. This is because our approach only relies on having dependency information for tables and can operate on the post-state of data sources.

References

[1]
Ahmed Abadi. 2009. Column-stores vs. row-stores: How different are they really? SIGMOD Record, Vol. 38, 1 (2009), 1--18.
[2]
D. Agrawal, A. El Abbadi, A. Singh, and T. Yurek. 1997. Efficient View Maintenance at Data Warehouses. SIGMOD Record (ACM Special Interest Group on Management of Data), Vol. 26, 2 (June 1997), 417--427. showCODENSRECD8
[3]
Amazon Web Services (AWS). 2023. AWS CodePipeline. Amazon.com, Inc. https://aws.amazon.com/codepipeline/ [Accessed: November 29, 2022].
[4]
Roger S. Barga, Jonathan Goldstein, Mohamed H. Ali, and Mingsheng Hong. 2007. Consistent Streaming Through Time: A Vision for Event Stream Processing. In CIDR. Association for Computing Machinery, Asilomar Conference Grounds, United States, 363--374. https://doi.org/10.1145/1206348.1206388
[5]
Jose A. Blakeley, Per-Ake Larson, and Frank Wm Tompa. 1986. Efficiently updating materialized views. SIGMOD Record (ACM Special Interest Group on Management of Data), Vol. 15, 2 (June 1986), 61--71. showCODENSRECD8
[6]
Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert Henry, and Robert Bradshaw. 2010. FlumeJava: Easy, Efficient Data-Parallel Pipelines. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). Association for Computing Machinery, 2 Penn Plaza, Suite 701 New York, NY 10121-0701, 363--375. http://dl.acm.org/citation.cfm?id=1806638
[7]
Biswapesh Chattopadhyay, Priyam Dutta, Weiran Liu, Ott Tinn, Andrew McCormick, Aniket Mokashi, Paul Harvey, Hector Gonzalez, David Lomax, Sagar Mittal, Roee Aharon Ebenstein, Nikita Mikhaylin, Hung ching Lee, Xiaoyan Zhao, Guanzhong Xu, Luis Antonio Perez, Farhad Shahmohammadi, Tran Bui, Neil McKay, Vera Lychagina, and Brett Elliott. 2019. Procella: Unifying serving and analytical data at YouTube. PVLDB, Vol. 12(12) (2019), 2022--2034. https://dl.acm.org/citation.cfm?id=3360438
[8]
Andrew Chung, Subru Krishnan, Konstantinos Karanasos, Carlo Curino, and Gregory R. Ganger. 2020. Unearthing inter-job dependencies for better cluster scheduling. In USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Carlsbad, CA, 689--705. https://doi.org/10.5555/3434127.3437541
[9]
Latha S. Colby, Timothy Griffin, Leonid Libkin, Inderpal Singh Mumick, and Howard Trickey. 1996. Algorithms for deferred view maintenance. SIGMOD Record (ACM Special Interest Group on Management of Data), Vol. 25, 2 (June 1996), 469--480. showCODENSRECD8
[10]
Start data engineering. 2021. How to backfill a SQL query using Apache Airflow. Start Data Engineering. http://www.startdataengineering.com Retrieved Jan 6, 2021 from
[11]
Lukasz Golab and Theodore Johnson. 2011. Consistency in a Stream Warehouse. In Proceedings of the 5th Biennial Conference on Innovative Data Systems Research (CIDR '11). ACM, Association for Computing Machinery (ACM), Asilomar, CA, 291--302. https://doi.org/10.1145/1920841.1920889
[12]
Google LLC. 2023. Google Cloud Dataflow Documentation: Data Pipelines. Google Cloud. [Accessed: November 29, 2022].
[13]
Timothy Griffin and Leonid Libkin. 1995. Incremental maintenance of views with duplicates. SIGMOD Record (ACM Special Interest Group on Management of Data), Vol. 24, 2 (May 1995), 328--339. showCODENSRECD8
[14]
Ashish Gupta, Fan Yang, Jason Govig, Adam Kirsch, Kelvin Chan, Kevin Lai, Shuo Wu, Sandeep Dhoot, Abhilash Kumar, Ankur Agiwal, Sanjay Bhansali, Mingsheng Hong, Jamie Cameron, Masood Siddiqi, David Jones, Jeff Shute, Andrey Gubarev, Shivakumar Venkataraman, and Divyakant Agrawal. 2014. Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing. In VLDB. VLDB Endowment, Hangzhou, China, 1285--1296.
[15]
Stratos Idreos and Dimitris Papadias. 2010. The Design and Implementation of Modern Column-Oriented Database Systems. Transactions on Database Systems, Vol. 35, 4 (2010), 1--58.
[16]
Hicham G. Elmongui Jinren Zhou, Per-Ake Larson. 2007. Lazy maintenance of materialized views. Proceedings of the 33rd international conference on Very large data bases., Vol. 1, 1 (2007), 231--242.
[17]
Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter A. Tucker. 2005. No pane, no gain: efficient evaluation of sliding-window aggregates over data streams. SIGMOD Record, Vol. 34, 1 (2005), 39--44.
[18]
Jin Li, Kristin Tufte, Vladislav Shkapenyuk, Vassilis Papadimos, Theodore Johnson, and David Maier. 2008. Out-of-order processing: a new architecture for high-performance stream systems. Proc. VLDB Endow., Vol. 1, 1 (2008), 274--288. https://doi.org/10.14778/1453856.1453890
[19]
Frank McSherry, Derek Gordon Murray, Rebecca Isaacs, and Michael Isard. 2013. Differential Dataflow. In Sixth Biennial Conference on Innovative Data Systems Research, CIDR 2013, Asilomar, CA, USA, January 6--9, 2013, Online Proceedings. www.cidrdb.org, Asilomar Conference Grounds, Pacific Grove, CA, USA, 3--14. https://doi.org/10.1145/2452369.2452372
[20]
Derek Gordon Murray, Frank McSherry, Michael Isard, Rebecca Isaacs, Paul Barham, and Mart'i n Abadi. 2016. Incremental, iterative data processing with timely dataflow. Commun. ACM, Vol. 59, 10 (2016), 75--83. https://doi.org/10.1145/2983551
[21]
Jayesh Patel. 2019. An Effective and Scalable Data Modeling for Enterprise Big Data Platform. In 2019 IEEE International Conference on Big Data (Big Data). IEEE, IEEE, Los Angeles, CA, USA, 2691--2697. https://doi.org/10.1109/BigData47090.2019.9005614
[22]
Dallan Quass and Jennifer Widom. 1997. On-line Warehouse View Maintenance. SIGMOD Record (ACM Special Interest Group on Management of Data), Vol. 26, 2 (June 1997), 393--404. showCODENSRECD8
[23]
Snowflake Inc. 2023. Snowflake Data Load Auto-Ingest with Snowpipe. Snowflake Inc. Accessed: November 29, 2022.
[24]
Jennifer Widom Stefano Ceri. 1991. Deriving Production Rules for Incremental View Maintenance. Proceedings of the 17th International Conference on Very Large Data Bases., Vol. 1, 1 (1991), 577--589.
[25]
Frank Elberzhager Susanne Braun, Annette Bieniusa. 2021. Advanced Domain-Driven Design for Consistency in Distributed Data-Intensive Systems. In 8th Workshop on Principles and Practice of Consistency for Distributed Data (PapoC's21). ACM, New York, NY, USA, 11--20. https://doi.org/10.1145/3447865.3457969
[26]
The Apache Software Foundation. 2023. Apache Airflow Documentation: Scheduler. The Apache Software Foundation. Accessed: November 29, 2022.
[27]
Eng Kong Sze Tok Wang Ling. 1999. Materialized View Maintenance Using Version Numbers. In Proceedings of the Sixth International Conference on Database Systems for Advanced Applications (DASFAA '99). IEEE Computer Society, Florence, Italy, 160--173. https://doi.org/10.1109/DASFAA.1999.753869
[28]
Chao Tian Wenfei Fan, Chunming Hu. 2017. Incremental Graph Computations: Doable and Undoable. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD '17). ACM, Chicago, IL, USA, 155--169. https://doi.org/10.1145/3035918.3035944
[29]
Data Pipeline Wiki. 2022. Data Pipeline. Data Pipeline Community. https://dataengineering.wiki/Concepts/DataPipeline [Accessed: June 5, 2022].
[30]
Wikipedia. 2021. Monotonic function. Wikipedia Foundation, Inc. https://en.wikipedia.org/wiki/Monotonic_function Retrieved June 28, 2021 from
[31]
Wikipedia. 2021. Service-level objective. Wikipedia Foundation, Inc. https://en.wikipedia.org/wiki/Service-level_objective Retrieved December 20, 2021 from
[32]
Yuke Yang, Lukasz Golab, and M. Özsu. 2017. ViewDF: Declarative incremental view maintenance for streaming data. Inf. Syst., Vol. 71 (2017), 55--67.
[33]
Yue Zhuge, J.L. Wiener, and H. Garcia-Molina. 1997. Multiple view consistency for data warehousing. In Proceedings 13th International Conference on Data Engineering. IEEE Computer Society, Birmingham, AL, USA, 289--300.

Cited By

View all
  • (2024)A Counting-based Approach for Efficient k-Clique Densest Subgraph DiscoveryProceedings of the ACM on Management of Data10.1145/36549222:3(1-27)Online publication date: 30-May-2024
  • (2024)A Similarity-based Approach for Efficient Large Quasi-clique DetectionProceedings of the ACM Web Conference 202410.1145/3589334.3645374(401-409)Online publication date: 13-May-2024
  • (2023)Accelerating directed densest subgraph queries with software and hardware approachesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00805-033:1(207-230)Online publication date: 31-Jul-2023
  • Show More Cited By

Index Terms

  1. Keep Your Distributed Data Warehouse Consistent at a Minimal Cost

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Management of Data
    Proceedings of the ACM on Management of Data  Volume 1, Issue 2
    PACMMOD
    June 2023
    2310 pages
    EISSN:2836-6573
    DOI:10.1145/3605748
    Issue’s Table of Contents
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 June 2023
    Published in PACMMOD Volume 1, Issue 2

    Author Tags

    1. consistency management
    2. data warehouse

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)278
    • Downloads (Last 6 weeks)46
    Reflects downloads up to 02 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Counting-based Approach for Efficient k-Clique Densest Subgraph DiscoveryProceedings of the ACM on Management of Data10.1145/36549222:3(1-27)Online publication date: 30-May-2024
    • (2024)A Similarity-based Approach for Efficient Large Quasi-clique DetectionProceedings of the ACM Web Conference 202410.1145/3589334.3645374(401-409)Online publication date: 13-May-2024
    • (2023)Accelerating directed densest subgraph queries with software and hardware approachesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00805-033:1(207-230)Online publication date: 31-Jul-2023
    • (2023)A Survey of Privacy Preserving Subgraph Matching MethodsArtificial Intelligence Security and Privacy10.1007/978-981-99-9785-5_8(98-113)Online publication date: 3-Dec-2023

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media