Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

What's the Difference? Incremental Processing with Change Queries in Snowflake

Published: 20 June 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Incremental algorithms are the heart and soul of stream processing. Low latency results depend on the ability to react to the subset of changes in a dataset over time rather than reprocessing the entirety of a dataset as it evolves. But while the SQL language is well suited for representing streams of changes (via tables) and their application to tables over time (via DML), it entirely lacks a method to query the changes to a table or view in the first place.
    In this paper, we present CHANGES queries and STREAM objects, Snowflake's primitives for querying and consuming incremental changes to table objects over time. CHANGES queries and STREAMs have been in use within Snowflake for three years, and see broad adoption across our customers. We describe the semantics of these primitives, discuss the implementation challenges, present an analysis of their usage at Snowflake, and contrast with other offerings.

    References

    [1]
    Daniel J. Abadi, Donald Carney, Ugur cC etintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, and Stanley B. Zdonik. 2003. Aurora: a new model and architecture for data stream management. The VLDB Journal, Vol. 12, 2 (2003), 120--139. https://doi.org/10.1007/s00778-003-0095-z
    [2]
    Asaf Adi and Opher Etzion. 2004. Amit -- the situation manager. The VLDB Journal, Vol. 13, 2 (2004), 177--203. https://doi.org/10.1007/s00778-003-0108-y
    [3]
    Gul Agha. 1986. Actors: A Model of Concurrent Computation in Distributed Systems. MIT Press, Cambridge, MA, USA.
    [4]
    Tyler Akidau, Alex Balikov, Kaya Bekirouglu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. MillWheel: Fault-Tolerant Stream Processing at Internet Scale. Proceedings of the VLDB Endowment, Vol. 6, 11 (aug 2013), 1033--1044. https://doi.org/10.14778/2536222.2536229
    [5]
    Tyler Akidau, Slava Chernyak, and Reuven Lax. 2018. Streaming systems: the what, where, when, and how of large-scale data processing. O'Reilly Media, Inc.
    [6]
    Amazon Web Services, Inc. 2022a. Amazon Kinsesis. https://aws.amazon.com/kinesis/ (accessed: November 2022).
    [7]
    Amazon Web Services, Inc. 2022b. Change data capture for DynamoDB Streams. https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html (accessed: November 2022).
    [8]
    Apache Beam. 2022. Beam SQL overview. https://beam.apache.org/documentation/dsls/sql/overview/ (accessed: November 2022).
    [9]
    Apache Flink. 2022. Flink -- SQL. https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/table/sql/overview/ (accessed: November 2022).
    [10]
    Apache Hadoop. 2022. Hadoop. https://hadoop.apache.org (accessed: November 2022).
    [11]
    Apache Hudi. 2022a. Apache Hudi Technical Specification -- Data Model. https://hudi.apache.org/tech-specs/#data-model (accessed: November 2022).
    [12]
    Apache Hudi. 2022b. Hudi -- Spark Guide. https://hudi.apache.org/docs/quick-start-guide/ (accessed: November 2022).
    [13]
    Apache Iceberg. 2022a. Iceberg -- Spark Queries. https://iceberg.apache.org/docs/1.0.0/spark-queries/ (accessed: November 2022).
    [14]
    Apache Iceberg. 2022b. Iceberg Java API -- Update Operations. https://iceberg.apache.org/docs/1.0.0/api/#update-operations (accessed: November 2022).
    [15]
    Apache Samza. 2022. Samza SQL. https://samza.apache.org/learn/documentation/1.6.0/api/samza-sql.html (accessed: November 2022).
    [16]
    Apache Storm. 2022. Storm SQL integration. https://storm.apache.org/releases/2.4.0/storm-sql.html (accessed: November 2022).
    [17]
    Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Keith Ito, Itaru Nishizawa, Justin Rosenstein, and Jennifer Widom. 2003. STREAM: The Stanford Stream Data Manager. In Proceedings of the 2003 International Conference on Management of Data, SIGMOD 2003, San Diego, California, USA, June 9--12, 2003, Alon Y. Halevy, Zachary G. Ives, and AnHai Doan (Eds.). ACM, 665. https://doi.org/10.1145/872757.872854
    [18]
    Arvind Arasu, Shivnath Babu, and Jennifer Widom. 2006. The CQL continuous query language: semantic foundations and query execution. The VLDB Journal, Vol. 15, 2 (2006), 121--142. https://doi.org/10.1007/s00778-004-0147-z
    [19]
    Michael Armbrust, Tathagata Das, Joseph Torres, Burak Yavuz, Shixiong Zhu, Reynold Xin, Ali Ghodsi, Ion Stoica, and Matei Zaharia. 2018. Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD 2018, Houston, TX, USA, June 10--15, 2019, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 601--613. https://doi.org/10.1145/3183713.3190664
    [20]
    Edmon Begoli, Tyler Akidau, Fabian Hueske, Julian Hyde, Kathryn Knight, and Kenneth L. Knowles. 2019. One SQL to Rule Them All - an Efficient and Syntactically Idiomatic Approach to Management of Streams and Tables. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska (Eds.). ACM, 1757--1772. https://doi.org/10.1145/3299869.3314040
    [21]
    Edmon Begoli, Jesú s Camacho-Rodr'i guez, Julian Hyde, Michael J. Mior, and Daniel Lemire. 2018. Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD 2018, Houston, TX, USA, June 10--15, 2019, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 221--230. https://doi.org/10.1145/3183713.3190662
    [22]
    Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink#8482;: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull., Vol. 38, 4 (2015), 28--38. http://sites.computer.org/debull/A15dec/p28.pdf
    [23]
    Vinoth Chandar. 2020. Apache Hudi -- Design And Architecture. https://cwiki.apache.org/confluence/display/HUDI/DesignAndArchitecture (accessed: November 2022).
    [24]
    Gianpaolo Cugola and Alessandro Margara. 2010. TESLA: a formally defined event specification language. In Proceedings of the Fourth ACM International Conference on Distributed Event-Based Systems, DEBS 2010, Cambridge, United Kingdom, July 12--15, 2010, Jean Bacon, Peter R. Pietzuch, Joe Sventek, and Ugur cC etintemel (Eds.). ACM, 50--61. https://doi.org/10.1145/1827418.1827427
    [25]
    Gianpaolo Cugola and Alessandro Margara. 2012. Processing flows of information: From data stream to complex event processing. Comput. Surveys, Vol. 44, 3 (2012), 15:1--15:62. https://doi.org/10.1145/2187671.2187677
    [26]
    Gianpaolo Cugola and Alessandro Margara. 2013. Deployment strategies for distributed complex event processing. Computing, Vol. 95, 2 (2013), 129--156. https://doi.org/10.1007/s00607-012-0217--9
    [27]
    Databricks, Inc. 2022. Use Delta Lake change data feed on Databricks. https://docs.databricks.com/delta/delta-change-data-feed.html (accessed: November 2022).
    [28]
    Debezium Community. 2022a. Debezium connector for MongoDB. https://debezium.io/documentation/reference/2.0/connectors/mongodb.html (accessed: November 2022).
    [29]
    Debezium Community. 2022b. Debezium connector for MySQL. https://debezium.io/documentation/reference/2.0/connectors/mysql.html (accessed: November 2022).
    [30]
    Debezium Community. 2022c. Debezium connector for PostgreSQL. https://debezium.io/documentation/reference/2.0/connectors/postgresql.html (accessed: November 2022).
    [31]
    Alan J. Demers, Johannes Gehrke, Biswanath Panda, Mirek Riedewald, Varun Sharma, and Walker M. White. 2007. Cayuga: A General Purpose Event Monitoring System. In Proceedings of the Third Biennial Conference on Innovative Data Systems Research, CIDR 2007, Asilomar, CA, USA, January 7--10, 2007. www.cidrdb.org, 412--422. http://cidrdb.org/cidr2007/papers/cidr07p47.pdf
    [32]
    Google LLC. 2022. Work with change history. https://cloud.google.com/bigquery/docs/change-history (accessed: November 2022).
    [33]
    Goetz Graefe. 1995. The cascades framework for query optimization. IEEE Data Eng. Bull., Vol. 18, 3 (1995), 19--29.
    [34]
    Ashish Gupta, Inderpal Singh Mumick, and Venkatramanan Siva Subrahmanian. 1993. Maintaining views incrementally. ACM SIGMOD Record, Vol. 22, 2 (1993), 157--166. https://doi.org/10.1145/170035.170066
    [35]
    IBM. 2022. Temporal tables and data versioning. https://www.ibm.com/docs/en/db2-for-zos/11?topic=tables-temporal-data-versioning (accessed: November 2022).
    [36]
    InterNational Committee for Information Technology Standards. 2022. DM32.2 Task Group on Database. http://www.incits.org/committees/dm32 (accessed: November 2022).
    [37]
    Hojjat Jafarpour and Rohan Desai. 2019. KSQL: Streaming SQL Engine for Apache Kafka. In Proceedings of the 22nd International Conference on Extending Database Technology, EDBT 2019, Lisbon, Portugal, March 26--29, 2019, Melanie Herschel, Helena Galhardas, Berthold Reinwald, Irini Fundulaki, Carsten Binnig, and Zoi Kaoudi (Eds.). OpenProceedings.org, 524--533. https://doi.org/10.5441/002/edbt.2019.48
    [38]
    Jay Kreps. 2016. Introducing Kafka Streams: Stream Processing Made Simple. https://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/ (accessed: November 2022).
    [39]
    Krishna G. Kulkarni and Jan-Eike Michels. 2012. Temporal features in SQL: 2011. SIGMOD Rec., Vol. 41, 3 (2012), 34--43. https://doi.org/10.1145/2380776.2380786
    [40]
    MariaDB Foundation. 2022. MariaDB -- Temporal Tables. https://mariadb.com/kb/en/temporal-tables/ (accessed: November 2022).
    [41]
    Materialize, Inc. 2022. SUBSCRIBE. https://materialize.com/docs/sql/subscribe/ (accessed: November 2022).
    [42]
    Frank McSherry, Derek Gordon Murray, Rebecca Isaacs, and Michael Isard. 2013. Differential Dataflow. In Proceedings of the Sixth Biennial Conference on Innovative Data Systems Research, CIDR 2013, Asilomar, CA, USA, January 6--9, 2013. www.cidrdb.org. http://cidrdb.org/cidr2013/Papers/CIDR13_Paper111.pdf
    [43]
    Meroxa, Inc. 2022a. Meroxa -- MongoDB. https://docs.meroxa.com/platform/resources/mongodb (accessed: November 2022).
    [44]
    Meroxa, Inc. 2022b. Meroxa -- MySQL. https://docs.meroxa.com/platform/resources/mysql/setup/ (accessed: November 2022).
    [45]
    Meroxa, Inc. 2022c. Meroxa -- PostgreSQL -- Logical Replication. https://docs.meroxa.com/platform/resources/postgresql/connection-types/logical-replication (accessed: November 2022).
    [46]
    Microsoft, Inc. 2022a. Change feed in Azure Cosmos DB. https://learn.microsoft.com/en-us/azure/cosmos-db/change-feed (accessed: November 2022).
    [47]
    Microsoft, Inc. 2022b. Temporal tables. https://learn.microsoft.com/en-us/sql/relational-databases/tables/temporal-tables?view=sql-server-ver16 (accessed: November 2022).
    [48]
    MongoDB, Inc. 2022. Replica Set Oplog. https://www.mongodb.com/docs/manual/core/replica-set-oplog/ (accessed: November 2022).
    [49]
    Derek Gordon Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Mart'i n Abadi. 2013. Naiad: a timely dataflow system. In Proceedings of the 24th ACM Symposium on Operating Systems Principles, SOSP 2013, Farmington, PA, USA, November 3--6, 2013, Michael Kaminsky and Mike Dahlin (Eds.). ACM, 439--455. https://doi.org/10.1145/2517349.2522738
    [50]
    Shadi A. Noghabi, Kartik Paramasivam, Yi Pan, Navina Ramesh, Jon Bringhurst, Indranil Gupta, and Roy H. Campbell. 2017. Stateful Scalable Stream Processing at LinkedIn. Proceedings of the VLDB Endowment, Vol. 10, 12 (2017), 1634--1645. https://doi.org/10.14778/3137765.3137770
    [51]
    Oracle. 2022. MySQL :: MySQL 8.0 Reference Manual :: 5.4.4 The Binary Log. https://dev.mysql.com/doc/refman/8.0/en/binary-log.html (accessed: November 2022).
    [52]
    Oracle, Inc. 2022. Using Oracle Flashback Technology. https://docs.oracle.com/cd/E11882_01/appdev.112/e41502/adfns_flashback.htm#ADFNS1008 (accessed: November 2022).
    [53]
    PostgreSQLs Global Development Group. 2022. Write-Ahead Logging (WAL). https://www.postgresql.org/docs/current/wal-intro.html (accessed: November 2022).
    [54]
    SAP. 2022. Temporal Tables. https://help.sap.com/docs/HANA_SERVICE_CF/6a504812672d48ba865f4f4b268a881e/cf3523ab01834f5e84a32164c1fd597a.html?q=temporal (accessed: November 2022).
    [55]
    Matthias J. Sax, Guozhang Wang, Matthias Weidlich, and Johann-Christoph Freytag. 2018. Streams and Tables: Two Sides of the Same Coin. In Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics, BIRTE 2018, Rio de Janeiro, Brazil, August 27, 2018. Association for Computing Machinery, New York, NY, USA, Article 1, 10 pages. https://doi.org/10.1145/3242153.3242155
    [56]
    Snowflake Computing, Inc. 2022a. CHANGES. https://docs.snowflake.com/en/sql-reference/constructs/changes.html (accessed: November 2022).
    [57]
    Snowflake Computing, Inc. 2022b. Introduction to Streams. https://docs.snowflake.com/en/user-guide/streams-intro.html (accessed: November 2022).
    [58]
    Snowflake Computing, Inc. 2022c. Introduction To Tasks. https://docs.snowflake.com/en/user-guide/tasks-intro.html (accessed: November 2022).
    [59]
    Snowflake Computing, Inc. 2023. MERGE. https://docs.snowflake.com/en/sql-reference/sql/merge (accessed: March 2023).
    [60]
    Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthikeyan Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, and Dmitriy V. Ryaboy. 2014. Storm@twitter. In Proceedings of the 2014 International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22--27, 2014, Curtis E. Dyreson, Feifei Li, and M. Tamer Ö zsu (Eds.). ACM, 147--156. https://doi.org/10.1145/2588555.2595641
    [61]
    Guozhang Wang, Lei Chen, Ayusman Dikshit, Jason Gustafson, Boyang Chen, Matthias J. Sax, John Roesler, Sophie Blee-Goldman, Bruno Cadonna, Apurva Mehta, Varun Madan, and Jun Rao. 2021. Consistency and Completeness: Rethinking Distributed Stream Processing in Apache Kafka. In Proceedings of the 2021 International Conference on Management of Data, SIGMOD 2021, Virtual Event, China, June 20--25, 2021. Association for Computing Machinery, New York, NY, USA, 2602--2613. https://doi.org/10.1145/3448016.3457556
    [62]
    Zuozhi Wang, Kai Zeng, Botong Huang, Wei Chen, Xiaozong Cui, Bo Wang, Ji Liu, Liya Fan, Dachuan Qu, Zhenyu Hou, et al. 2020. Tempura: A General Cost Based Optimizer Framework for Incremental Data Processing (Extended Version). arXiv preprint arXiv:2009.13631 (2020).
    [63]
    Burak Yavuz and Prakash Chockalingam. 2019. Introducing Delta Time Travel for Large Scale Data Lakes. https://www.databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html (accessed: November 2022).
    [64]
    Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized Streams: Fault-Tolerant Streaming Computation at Scale. In Proceedings of the 24th ACM Symposium on Operating Systems Principles, SOSP 2013, Farmington, PA, USA, November 3--6, 2013. Association for Computing Machinery, New York, NY, USA, 423--438. https://doi.org/10.1145/2517349.2522737
    [65]
    Jingyu Zhou, Meng Xu, Alexander Shraer, Bala Namasivayam, Alex Miller, Evan Tschannen, Steve Atherton, Andrew J. Beamon, Rusty Sears, John Leach, Dave Rosenthal, Xin Dong, Will Wilson, Ben Collins, David Scherer, Alec Grieser, Young Liu, Alvin Moore, Bhaskar Muppana, Xiaoge Su, and Vishesh Yadav. 2021. FoundationDB: A Distributed Unbundled Transactional Key Value Store. In Proceedings of the 2021 International Conference on Management of Data, SIGMOD 2021, Virtual Event, China, June 20--25, 2021. Association for Computing Machinery, New York, NY, USA, 2653--2666. https://doi.org/10.1145/3448016.3457559

    Cited By

    View all
    • (2024)Enhancing Regular Expression Processing through Field-Programmable Gate Array-Based Multi-Character Non-Deterministic Finite AutomataElectronics10.3390/electronics1309163513:9(1635)Online publication date: 24-Apr-2024
    • (2024)Simplicity and Elegance in Stream Processing: a 5-Year OdysseyProceedings of the 18th ACM International Conference on Distributed and Event-based Systems10.1145/3629104.3674954(4-5)Online publication date: 24-Jun-2024
    • (2024)Continuous Data Ingestion and Transformation in SnowflakeProceedings of the 18th ACM International Conference on Distributed and Event-based Systems10.1145/3629104.3672430(195-198)Online publication date: 24-Jun-2024
    • Show More Cited By

    Index Terms

    1. What's the Difference? Incremental Processing with Change Queries in Snowflake

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the ACM on Management of Data
      Proceedings of the ACM on Management of Data  Volume 1, Issue 2
      PACMMOD
      June 2023
      2310 pages
      EISSN:2836-6573
      DOI:10.1145/3605748
      Issue’s Table of Contents
      This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives International 4.0 License.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 20 June 2023
      Published in PACMMOD Volume 1, Issue 2

      Author Tags

      1. CDC
      2. IVM
      3. changelogs
      4. query differentiation
      5. streaming SQL

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)3,719
      • Downloads (Last 6 weeks)219
      Reflects downloads up to 10 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Enhancing Regular Expression Processing through Field-Programmable Gate Array-Based Multi-Character Non-Deterministic Finite AutomataElectronics10.3390/electronics1309163513:9(1635)Online publication date: 24-Apr-2024
      • (2024)Simplicity and Elegance in Stream Processing: a 5-Year OdysseyProceedings of the 18th ACM International Conference on Distributed and Event-based Systems10.1145/3629104.3674954(4-5)Online publication date: 24-Jun-2024
      • (2024)Continuous Data Ingestion and Transformation in SnowflakeProceedings of the 18th ACM International Conference on Distributed and Event-based Systems10.1145/3629104.3672430(195-198)Online publication date: 24-Jun-2024
      • (2024)OpenIVM: a SQL-to-SQL Compiler for Incremental ComputationsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654743(516-519)Online publication date: 9-Jun-2024

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media