Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Tempura: a general cost-based optimizer framework for incremental data processing (Journal Version)

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Incremental processing is widely adopted in many applications, ranging from incremental view maintenance, stream computing, to recently emerging progressive data warehouse and intermittent query processing. Despite many algorithms developed on this topic, none of them can produce an incremental plan that always achieves the best performance, since the optimal plan is data dependent. In this paper, we develop a novel cost-based optimizer framework, called Tempura, for optimizing incremental data processing. We propose an incremental query planning model called TIP based on the concept of time-varying relations, which can formally model incremental processing in its most general form. We give a full specification of Tempura, which can not only unify various existing techniques to generate an optimal incremental plan, but also allow the developer to add their rewrite rules. We study how to explore the plan space and search for an optimal incremental plan. We evaluate Tempura  in various incremental processing scenarios to show its effectiveness and efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Notes

  1. Note that Final also needs to filter out empty groups with zero contributing tuples. We omit this detail for simplicity.

  2. Here, we do not assume o_id as the primary key of returns. Say returns could contain multiple records for a returned order due to different costs such as shipping cost, product damage, and inventory carrying cost.

References

  1. Abadi, D.J., Ahmad, Y., Balazinska, M., Cetintemel, U., Cherniack, M., Hwang, J.H., Lindner, W., Maskey, A., Rasin, A., Ryvkina, E., et al.: The design of the borealis stream processing engine. In: Cidr, vol. 5, pp. 277–289 (2005)

  2. Acharya, S., Gibbons, P.B., Poosala, V., Ramaswamy, S.: The aqua approximate query answering system. In: ACM Sigmod Record, vol. 28, pp. 574–576. ACM (1999)

  3. Ahmad, Y., Kennedy, O., Koch, C., Nikolic, M.: Dbtoaster: higher-order delta processing for dynamic, frequently fresh views. PVLDB 5(10), 968–979 (2012)

    Google Scholar 

  4. Aiken, A., Hellerstein, J.M., Widom, J.: Static analysis techniques for predicting the behavior of active database rules. ACM Trans. Database Syst. (TODS) 20(1), 3–41 (1995)

    Article  Google Scholar 

  5. Arasu, A., Babu, S., Widom, J.: The CQL continuous query language: semantic foundations and query execution. VLDB J. 15(2), 121–142 (2006)

    Article  Google Scholar 

  6. Babcock, B., Chaudhuri, S., Das, G.: Dynamic sample selection for approximate query processing. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 539–550. ACM (2003)

  7. Babu, S., Bizarro, P., DeWitt, D.: Proactive re-optimization. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 107–118 (2005)

  8. Begoli, E., Akidau, T., Hueske, F., Hyde, J., Knight, K., Knowles, K.L.: One SQL to rule them all - an efficient and syntactically idiomatic approach to management of streams and tables. In: Boncz, P.A., Manegold, S., Ailamaki, A., Deshpande, A., Kraska, T. (eds.) Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30–July 5, 2019, pp. 1757–1772. ACM (2019). https://doi.org/10.1145/3299869.3314040

  9. Begoli, E., Camacho-Rodríguez, J., Hyde, J., Mior, M.J., Lemire, D.: Apache calcite: a foundational framework for optimized query processing over heterogeneous data sources. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD ’18, pp. 221–230. ACM, New York, NY, USA (2018). https://doi.org/10.1145/3183713.3190662

  10. Blakeley, J.A., Larson, P.A., Tompa, F.W.: Efficiently updating materialized views. In: Proceedings of the 1986 ACM SIGMOD International Conference on Management of Data, SIGMOD ’86, pp. 61–71. ACM, New York, NY, USA (1986). https://doi.org/10.1145/16894.16861

  11. Buneman, O.P., Clemons, E.K.: Efficiently monitoring relational databases. ACM Trans. Database Syst. 4(3), 368–382 (1979). https://doi.org/10.1145/320083.320099

    Article  Google Scholar 

  12. Chandramouli, B., Bond, C.N., Babu, S., Yang, J.: Query suspend and resume. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 557–568 (2007)

  13. Chandramouli, B., Goldstein, J., Quamar, A.: Scalable progressive analytics on big data in the cloud. Proc. VLDB Endow. 6(14), 1726–1737 (2013). https://doi.org/10.14778/2556549.2556557

    Article  Google Scholar 

  14. Chaudhuri, S., Krishnamurthy, R., Potamianos, S., Shim, K.: Optimizing queries with materialized views. In: Proceedings of the Eleventh International Conference on Data Engineering, ICDE ’95, pp. 190–200. IEEE Computer Society, Washington, DC, USA (1995). http://dl.acm.org/citation.cfm?id=645480.655434

  15. Chaudhuri, S., Das, G., Narasayya, V.: Optimized stratified sampling for approximate query processing. ACM Trans. Database Syst. (TODS) 32(2), 9 (2007)

    Article  Google Scholar 

  16. Ghanem, T.M., Elmagarmid, A.K., Larson, P.Å., Aref, W.G.: Supporting views in data stream management systems. ACM Trans. Database Syst. (TODS) 35(1), 1 (2010)

    Article  Google Scholar 

  17. Graefe, G., Guy, W., Kuno, H.A., Paullley, G.: Robust query processing (dagstuhl seminar 12321). In: Dagstuhl Reports, vol. 2. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2012)

  18. Graefe, G., McKenna, W.J.: The volcano optimizer generator: Extensibility and efficient search. In: Proceedings of IEEE 9th International Conference on Data Engineering, pp. 209–218. IEEE

  19. Graefe, G.: The cascades framework for query optimization. Data Eng. Bull. 18, 19–29 (1995)

    Google Scholar 

  20. Griffin, T., Libkin, L.: Incremental maintenance of views with duplicates. In: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, SIGMOD ’95, pp. 328–339. ACM, New York, NY, USA (1995). https://doi.org/10.1145/223784.223849

  21. Griffin, T., Kumar, B.: Algebraic change propagation for semijoin and outerjoin queries. SIGMOD Rec. 27(3), 22–27 (1998). https://doi.org/10.1145/290593.290597

    Article  Google Scholar 

  22. http://www.tpc.org/tpcds/

  23. https://calcite.apache.org

  24. https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html

  25. https://flink.apache.org

  26. https://github.com/alibaba/cost-based-incremental-optimizer

  27. https://issues.apache.org/jira/browse/CALCITE-4568

  28. https://www.alibabacloud.com/product/maxcompute

  29. Jia, J., Li, C., Carey, M.J.: Drum: a rhythmic approach to interactive analytics on large data. In: 2017 IEEE International Conference on Big Data (Big Data), pp. 636–645. IEEE (2017)

  30. Kathuria, T., Sudarshan, S.: Efficient and provable multi-query optimization. In: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS ’17, pp. 53–67. ACM, New York, NY, USA (2017). https://doi.org/10.1145/3034786.3034792

  31. Koch, C.: Incremental query evaluation in a ring of databases. In: Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 87–98 (2010)

  32. Lang, W., Nehme, R.V., Robinson, E., Naughton, J.F.: Partial results in database systems. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14, pp. 1275–1286. ACM, New York, NY, USA (2014). https://doi.org/10.1145/2588555.2612176

  33. Larson, P., Zhou, J.: Efficient maintenance of materialized outer-join views. In: Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April 15-20, 2007, pp. 56–65 (2007). https://doi.org/10.1109/ICDE.2007.367851

  34. Law, Y.N., Wang, H., Zaniolo, C.: Query languages and data models for database sequences and data streams. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30, VLDB ’04, p. 492-503. VLDB Endowment (2004)

  35. Lee, M.K.: Implementing an interpreter for functional rules in a query optimizer (1988)

  36. Maier, D., Li, J., Tucker, P., Tufte, K., Papadimos, V.: Semantics of data streams and operators. In: Eiter, T., Libkin, L. (eds.) Database Theory - ICDT 2005, pp. 37–52. Springer, Berlin, Heidelberg (2005)

    Google Scholar 

  37. Motwani, R., Widom, J., Arasu, A., Babcock, B., Babu, S., Datar, M., Manku, G., Olston, C., Rosenstein, J., Varma, R.: Query processing, resource management, and approximation in a data stream management system. In: CIDR (2003)

  38. Nikolic, M., Dashti, M., Koch, C.: How to win a hot dog eating contest: distributed incremental view maintenance with batch updates. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, pp. 511–526. ACM, New York, NY, USA (2016). https://doi.org/10.1145/2882903.2915246

  39. Raman, V., Hellerstein, J.M.: Partial results for online query processing. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pp. 275–286 (2002)

  40. Roy, P., Seshadri, S., Sudarshan, S., Bhobe, S.: Efficient and extensible algorithms for multi query optimization. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD ’00, pp. 249–260. ACM, New York, NY, USA (2000). https://doi.org/10.1145/342009.335419

  41. Sax, M.J., Wang, G., Weidlich, M., Freytag, J.C.: Streams and tables: Two sides of the same coin. In: Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics, BIRTE ’18. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3242153.3242155

  42. Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data, pp. 23–34 (1979)

  43. Soliman, M.A., Antova, L., Raghavan, V., El-Helw, A., Gu, Z., Shen, E., Caragea, G.C., Garcia-Alvarado, C., Rahman, F., Petropoulos, M., Waas, F., Narayanan, S., Krikellas, K., Baldwin, R.: Orca: A modular query optimizer architecture for big data. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14, pp. 337–348. ACM, New York, NY, USA (2014). https://doi.org/10.1145/2588555.2595637

  44. Tang, D., Shang, Z., Elmore, A.J., Krishnan, S., Franklin, M.J.: Thrifty query execution via incrementability. In: Maier, D., Pottinger, R., Doan, A., Tan, W., Alawini, A., Ngo, H.Q. (eds.) Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14–19, 2020, pp. 1241–1256. ACM (2020). https://doi.org/10.1145/3318464.3389756

  45. Tang, D., Shang, Z., Elmore, A.J., Krishnan, S., Franklin, M.J.: Intermittent query processing. Proc. VLDB Endow. 12(11), 1427–1441 (2019). https://doi.org/10.14778/3342263.3342278

    Article  Google Scholar 

  46. Terry, D., Goldberg, D., Nichols, D., Oki, B.: Continuous queries over append-only databases. In: Proceedings of the 1992 ACM SIGMOD International Conference on Management of Data, SIGMOD ’92, pP. 321–330. Association for Computing Machinery, New York, NY, USA (1992). https://doi.org/10.1145/130283.130333

  47. Thakkar, H., Laptev, N., Mousavi, H., Mozafari, B., Russo, V., Zaniolo, C.: Smm: A data stream management system for knowledge discovery. In: 2011 IEEE 27th International Conference on Data Engineering, pp. 757–768. IEEE (2011)

  48. Viglas, S.D., Naughton, J.F.: Rate-based query optimization for streaming information sources. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pp. 37–48 (2002)

  49. Wang, Z., Zeng, K., Huang, B., Chen, W., Cui, X., Wang, B., Liu, J., Fan, L., Qu, D., Ho, Z., Guan, T., Li, C., Zhou, J.: Grosbeak: A data warehouse supporting resource-aware incremental computing. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD ’20. ACM, Portland, Oregon, USA (2020)

  50. Wang, Z., Zeng, K., Huang, B., Chen, W., Cui, X., Wang, B., Liu, J., Fan, L., Qu, D., Hou, Z., Guan, T., Li, C., Zhou, J.: Tempura: a general cost-based optimizer framework for incremental data processing. Proc. VLDB Endow. 14(1), 14–27 (2020). https://doi.org/10.14778/3421424.3421427

    Article  Google Scholar 

  51. Wolf, F., May, N., Willems, P.R., Sattler, K.U.: On the calculation of optimality ranges for relational query execution plans. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD ’18, p. 663-675. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3183713.3183742

  52. Yin, S., Hameurlain, A., Morvan, F.: Robust query optimization methods with respect to estimation errors: a survey. ACM Sigmod Record 44(3), 25–36 (2015)

    Article  Google Scholar 

  53. Yu, Y., Gunda, P.K., Isard, M.: Distributed aggregation for data-parallel computing: interfaces and implementations. In: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pp. 247–260 (2009)

  54. Zeng, K., Agarwal, S., Stoica, I.: iolap: Managing uncertainty for efficient incremental olap. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, pp. 1347–1361. ACM, New York, NY, USA (2016). https://doi.org/10.1145/2882903.2915240

  55. Zhang, Y., Hull, B., Balakrishnan, H., Madden, S.: Icedb: Intermittently-connected continuous query processing. In: 2007 IEEE 23rd International Conference on Data Engineering, pp. 166–175. IEEE (2007)

  56. Zhou, J., Larson, P.A., Larson, P.A., Freytag, J.C., Lehner, W.: Efficient exploitation of similar subexpressions for query processing. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD ’07, pp. 533–544. ACM, New York, NY, USA (2007). https://doi.org/10.1145/1247480.1247540

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Botong Huang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Z., Zeng, K., Huang, B. et al. Tempura: a general cost-based optimizer framework for incremental data processing (Journal Version). The VLDB Journal 32, 1315–1342 (2023). https://doi.org/10.1007/s00778-023-00785-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-023-00785-1

Keywords