Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Saving Money for Analytical Workloads in the Cloud

Published: 30 August 2024 Publication History

Abstract

As users migrate their analytical workloads to cloud databases, it is becoming just as important to reduce monetary costs as it is to optimize query runtime. In the cloud, a query is billed based on either its compute time or the amount of data it processes. We observe that analytical queries are either compute- or IO-bound and each query type executes cheaper in a different pricing model. We exploit this opportunity and propose methods to build cheaper execution plans across pricing models that complete within user-defined runtime constraints. We implement these methods and produce execution plans spanning multiple pricing models that reduce the monetary cost for workloads by as much as 56%. We reduce individual query costs by as much as 90%. The prices chosen by cloud vendors for cloud services also impact savings opportunities. To study this effect, we simulate our proposed methods with different cloud prices and observe that multi-cloud savings are robust to changes in cloud vendor prices. These results indicate the massive opportunity to save money by executing workloads across multiple pricing models.

References

[1]
Josep Aguilar-Saborit, Raghu Ramakrishnan, Krish Srinivasan, Kevin Bocksrocker, Ioannis Alagiannis, Mahadevan Sankara, Moe Shafiei, Jose Blakeley, Girish Dasarathy, Sumeet Dash, Lazar Davidovic, Maja Damjanic, Slobodan Djunic, Nemanja Djurkic, Charles Feddersen, Cesar Galindo-Legaria, Alan Halverson, Milana Kovacevic, Nikola Kicovic, Goran Lukic, Djordje Maksimovic, Ana Manic, Nikola Markovic, Bosko Mihic, Ugljesa Milic, Marko Milojevic, Tapas Nayak, Milan Potocnik, Milos Radic, Bozidar Radivojevic, Srikumar Rangarajan, Milan Ruzic, Milan Simic, Marko Sosic, Igor Stanko, Maja Stikic, Sasa Stanojkov, Vukasin Stefanovic, Milos Sukovic, Aleksandar Tomic, Dragan Tomic, Steve Toscano, Djordje Trifunovic, Veljko Vasic, Tomer Verona, Aleksandar Vujic, Nikola Vujic, Marko Vukovic, and Marko Zivanovic. 2020. POLARIS: the distributed SQL engine in azure synapse. Proceedings of the VLDB Endowment 13, 12 (aug 2020), 3204--3216.
[2]
Amazon Athena [n.d.]. Amazon Athena - Serverless Interactive Query Service - Amazon Web Services. Retrieved 2023-11-21 from https://aws.amazon.com/athena/
[3]
Apache Arrow [n.d.]. Apache Arrow. Retrieved 2024-02-13 from https://arrow.apache.org
[4]
Apache Parquet [n.d.]. Apache Parquet. Retrieved 2022-12-17 from https://parquet.apache.org/
[5]
Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (Melbourne, Victoria, Australia) (SIGMOD '15). Association for Computing Machinery, New York, NY, USA, 1383--1394.
[6]
Nikos Armenatzoglou, Sanuj Basu, Naga Bhanoori, Mengchu Cai, Naresh Chainani, Kiran Chinta, Venkatraman Govindaraju, Todd J. Green, Monish Gupta, Sebastian Hillig, Eric Hotinger, Yan Leshinksy, Jintian Liang, Michael McCreedy, Fabian Nagel, Ippokratis Pandis, Panos Parchas, Rahul Pathak, Orestis Polychroniou, Foyzur Rahman, Gaurav Saxena, Gokul Soundararajan, Sriram Subramanian, and Doug Terry. 2022. Amazon Redshift Re-invented. In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD '22). Association for Computing Machinery, New York, NY, USA, 2205--2217.
[7]
Auto-MV [n.d.]. Automated materialized views - Amazon Redshift. Retrieved 2023-04-02 from https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-auto-mv.html
[8]
AWS Batch [n.d.]. What is Batch Processing? - Batch Processing Systems Explained - AWS. Retrieved 2023-04-02 from https://aws.amazon.com/what-is/batch-processing/
[9]
AWS Batch Processing [n.d.]. Batch data processing - Data Analytics Lens. Retrieved 2024-01-07 from https://docs.aws.amazon.com/wellarchitected/latest/analytics-lens/batch-data-processing.html
[10]
Azure Synapse [n.d.]. Azure Synapse Analytics | Microsoft Azure. Retrieved 2023-11-21 from https://azure.microsoft.com/en-us/products/synapse-analytics/
[11]
Tiemo Bang, Conor Power, Siavash Ameli, Natacha Crooks, and Joseph M. Hellerstein. 2024. Optimizing the cloud? Don't train models. Build oracles!. In 14th Conference on Innovative Data Systems Research, CIDR 2024, Chaminade, HI, USA, January 14--17, 2024. www.cidrdb.org. https://www.cidrdb.org/cidr2024/papers/p47-bang.pdf
[12]
BatchExecuteStatement [n.d.]. BatchExecuteStatement. Retrieved 2024-02-07 from https://docs.aws.amazon.com/redshift-data/latest/APIReference/API_BatchExecuteStatement.html
[13]
Edmon Begoli, Jesús Camacho-Rodríguez, Julian Hyde, Michael J. Mior, and Daniel Lemire. 2018. Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 221--230.
[14]
BigQuery [n.d.]. What is BigQuery? Retrieved 2023-11-21 from https://cloud.google.com/bigquery/docs/introduction
[15]
BigQuery analytics [n.d.]. Overview of BigQuery analytics. Retrieved 2024-01-07 from https://cloud.google.com/bigquery/docs/query-overview
[16]
BigQuery Optimizing [n.d.]. Introduction to optimizing query performance | BigQuery. Retrieved 2023-04-14 from https://cloud.google.com/bigquery/docs/best-practices-performance-overview
[17]
BigQuery Pricing Change [n.d.]. Introducing new BigQuery pricing editions. Retrieved 2023-04-09 from https://cloud.google.com/blog/products/data-analytics/introducing-new-bigquery-pricing-editions
[18]
BigQuery Reliability [n.d.]. Understand reliability | BigQuery. Retrieved 2024-02-11 from https://cloud.google.com/bigquery/docs/reliability-intro#real-time_analytics
[19]
Ranjan Burman, Amit Nayak, Bosco Albuquerque, and Nita Shah. 2022. Best practices to optimize your Amazon Redshift and MicroStrategy deployment | AWS Big Data Blog. Retrieved 2023-04-14 from https://aws.amazon.com/blogs/big-data/best-practices-to-optimize-your-amazon-redshift-and-microstrategy-deployment/
[20]
Angelos Charalambidis, Antonis Troumpoukis, and Stasinos Konstantopoulos. 2015. SemaGrow: optimizing federated SPARQL queries. In Proceedings of the 11th International Conference on Semantic Systems (Vienna, Austria) (SEMANTICS '15). Association for Computing Machinery, New York, NY, USA, 121--128.
[21]
Sarah Chasins, Alvin Cheung, Natacha Crooks, Ali Ghodsi, Ken Goldberg, Joseph E. Gonzalez, Joseph M. Hellerstein, Michael I. Jordan, Anthony D. Joseph, Michael W. Mahoney, Aditya Parameswaran, David Patterson, Raluca Ada Popa, Koushik Sen, Scott Shenker, Dawn Song, and Ion Stoica. 2022. The Sky Above The Clouds. arXiv:2205.07147 [cs.DC] https://arxiv.org/abs/2205.07147
[22]
Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. 1999. On Random Sampling over Joins. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (Philadelphia, Pennsylvania, USA) (SIGMOD '99). Association for Computing Machinery, New York, NY, USA, 263--274.
[23]
Rada Chirkova, Alon Y. Halevy, and Dan Suciu. 2002. A formal perspective on the view selection problem. The VLDB Journal The International Journal on Very Large Data Bases 11, 3 (Nov. 2002), 216--237.
[24]
CloudZero. [n.d.]. Meet Our Customers | CloudZero. Retrieved 2023-11-20 from https://www.cloudzero.com/customers
[25]
Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, Allison W. Lee, Ashish Motivala, Abdul Q. Munir, Steven Pelley, Peter Povinec, Greg Rahn, Spyridon Triantafyllis, and Philipp Unterbrunner. 2016. The Snowflake Elastic Data Warehouse. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 215--226.
[26]
Shaul Dar, Michael J. Franklin, Björn Þór Jónsson, Divesh Srivastava, and Michael Tan. 1996. Semantic Data Caching and Replacement. In Proceedings of the 22th International Conference on Very Large Data Bases (VLDB '96). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 330--341.
[27]
Dashboard [n.d.]. Real-World Examples of Business Intelligence (BI) Dashboards. Retrieved 2023-04-09 from https://www.tableau.com/learn/articles/business-intelligence-dashboards-examples
[28]
Databricks Data Lakes [n.d.]. Introduction to Data Lakes. Retrieved 2023-04-14 from https://www.databricks.com/discover/data-lakes
[29]
Bappaditya Datta. 2022. Data warehouse and business intelligence technology consolidation using AWS. Retrieved 2024-01-07 from https://aws.amazon.com/blogs/architecture/data-warehouse-and-business-intelligence-technology-consolidation-using-aws/
[30]
Yefim Dinitz. 2006. Dinitz' Algorithm: The Original Version and Even's Version. Springer Berlin Heidelberg, Berlin, Heidelberg, 218--240.
[31]
Dominik Durner, Badrish Chandramouli, and Yinan Li. 2021. Crystal: a unified cache storage system for analytical databases. Proceedings of the VLDB Endowment 14, 11 (July 2021), 2432--2444.
[32]
Jack Edmonds and Richard M Karp. 1972. Theoretical improvements in algorithmic efficiency for network flow problems. Journal of the ACM (JACM) 19, 2 (1972), 248--264.
[33]
Stephan Ewen, Holger Kache, Volker Markl, and Vijayshankar Raman. 2006. Progressive Query Optimization for Federated Queries. In Advances in Database Technology - EDBT 2006, Yannis Ioannidis, Marc H. Scholl, Joachim W. Schmidt, Florian Matthes, Mike Hatzopoulos, Klemens Boehm, Alfons Kemper, Torsten Grust, and Christian Boehm (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 847--864.
[34]
Archana Ganapathi, Harumi Kuno, Umeshwar Dayal, Janet L. Wiener, Armando Fox, Michael Jordan, and David Patterson. 2009. Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning. In 2009 IEEE 25th International Conference on Data Engineering. 592--603.
[35]
Kevin Goff. [n.d.]. The Baker's Dozen: 13 Tips for Better Extract/Transform/Load (ETL) Practices in Data Warehousing (Part 1 of 2). Retrieved 2023-04-09 from https://www.codemag.com/article/1709051/The-Baker%E2%80%99s-Dozen-13-Tips-for-Better-Extract-Transform-Load-ETL-Practices-in-Data-Warehousing-Part-1-of-2
[36]
Google BigQuery Storage Pricing [n.d.]. Pricing | BigQuery: Cloud Data Warehouse. Retrieved 2023-04-14 from https://cloud.google.com/bigquery/pricing
[37]
Google ETL [n.d.]. What is ETL? Retrieved 2023-04-09 from https://cloud.google.com/learn/what-is-etl
[38]
T. Heller, S. O. Krumke, and K. H. Küfer. 2021. The Reward-Penalty-Selection Problem. arXiv:2106.14601 [cs.CC] https://arxiv.org/abs/2106.14601
[39]
Randall Hunt. 2017. S3 Select and Glacier Select - Retrieving Subsets of Objects | AWS News Blog. Retrieved 2023-11-21 from https://aws.amazon.com/blogs/aws/s3-glacier-select/
[40]
IBM Serial Batch [n.d.]. What is a workload? Retrieved 2024-02-07 from https://www.ibm.com/topics/workload
[41]
Vanja Josifovski, Peter Schwarz, Laura Haas, and Eileen Lin. 2002. Garlic: a new flavor of federated query processing for DB2. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, Wisconsin) (SIGMOD '02). Association for Computing Machinery, New York, NY, USA, 524--532.
[42]
Kate Keahey, Jason Anderson, Zhuo Zhen, Pierre Riteau, Paul Ruth, Dan Stanzione, Mert Cevik, Jacob Colleran, Haryadi S. Gunawi, Cody Hammock, Joe Mambretti, Alexander Barnes, François Halbach, Alex Rocha, and Joe Stubbs. 2020. Lessons learned from the Chameleon testbed. In Proceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC'20). USENIX Association, Article 15, 15 pages.
[43]
Donald Kossmann, Michael J. Franklin, Gerhard Drasch, and Wig Ag. 2000. Cache investment: integrating query optimization and distributed data placement. ACM Transactions on Database Systems 25, 4 (Dec. 2000), 517--558.
[44]
Yannis Kotidis and Nick Roussopoulos. 1999. DynaMat: a dynamic view management system for data warehouses. ACM SIGMOD Record 28, 2 (June 1999), 371--382.
[45]
Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How good are query optimizers, really? Proceedings of the VLDB Endowment 9, 3 (Nov. 2015), 204--215.
[46]
Viktor Leis and Maximilian Kuschewski. 2021. Towards cost-optimal query processing in the cloud. Proceedings of the VLDB Endowment 14, 9 (May 2021), 1606--1612.
[47]
Rundong Li, Ningfang Mi, Mirek Riedewald, Yizhou Sun, and Yi Yao. 2019. Abstract cost models for distributed data-intensive computations. Distributed and Parallel Databases 37, 3 (Sept. 2019), 411--439.
[48]
Xi Liang, Stavros Sintos, Zechao Shang, and Sanjay Krishnan. 2021. Combining Aggregation and Sampling (Nearly) Optimally for Approximate Query Processing. In Proceedings of the 2021 International Conference on Management of Data (Virtual Event, China) (SIGMOD '21). Association for Computing Machinery, New York, NY, USA, 1129--1141.
[49]
Guy Lohman. 2014. Is Query Optimization a "Solved" Problem? - ACM SIGMOD Blog. Retrieved 2022-12-19 from http://wp.sigmod.org/?p=1075
[50]
John Martinez. 2021. 50 Years Of ETL: Can SQL For ETL Be Replaced? Retrieved 2023-04-09 from https://www.datanami.com/2021/05/06/50-years-of-etl-can-sql-for-etl-be-replaced/
[51]
McKinsey [n.d.]. Cloud cost-optimization simulator | McKinsey. Retrieved 2023-11-20 from https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/cloud-cost-optimization-simulator
[52]
Thomas P. Nadeau and Toby J. Teorey. 2002. Achieving scalability in OLAP materialized view selection. In Proceedings of the 5th ACM International Workshop on Data Warehousing and OLAP (McLean, Virginia, USA) (DOLAP '02). Association for Computing Machinery, New York, NY, USA, 28--34.
[53]
Raghunath Othayoth Nambiar and Meikel Poess. 2006. The Making of TPC-DS. In VLDB (Seoul, Korea) (VLDB '06), Vol. 6. 1049--1058.
[54]
Luis L. Perez and Christopher M. Jermaine. 2014. History-aware query optimization with materialized intermediate views. In 2014 IEEE 30th International Conference on Data Engineering. IEEE, Chicago, IL, USA, 520--531.
[55]
Matthew Perron, Zeyuan Shang, Tim Kraska, and Michael Stonebraker. 2019. How I Learned to Stop Worrying and Love Re-optimization. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). 1758--1761.
[56]
Orestis Polychroniou, Wangda Zhang, and Kenneth A. Ross. 2018. Distributed Joins and Data Placement for Minimal Network Traffic. ACM Transactions on Database Systems 43, 3 (Nov. 2018), 1--45.
[57]
Avijit Prasad. 2022. Data lakes - Azure Architecture Center. Retrieved 2023-04-14 from https://learn.microsoft.com/en-us/azure/architecture/data-guide/scenarios/data-lake
[58]
Presto [n.d.]. Presto: Free, Open-Source SQL Query Engine for any Data. Retrieved 2023-11-21 from http://prestodb.github.io/
[59]
Redshift Spectrum [n.d.]. Spectrum performance caching and performance | AWS re:Post. Retrieved 2023-04-02 from https://repost.aws/questions/QUaVNX2NJ0REm95-dhZY4O5A/spectrum-performance-caching-and-performance
[60]
Matt Scaer, Manish Vazirani, and Tarun Chaudhary. 2020. Top 10 performance tuning techniques for Amazon Redshift | AWS Big Data Blog. Retrieved 2023-04-14 from https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-techniques-for-amazon-redshift/
[61]
Raghav Sethi, Martin Traverso, Dain Sundstrom, David Phillips, Wenlei Xie, Yutian Sun, Nezih Yegitbasi, Haozhun Jin, Eric Hwang, Nileema Shingte, and Christopher Berner. 2019. Presto: SQL on Everything. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, Macao, Macao, 1802--1813.
[62]
Pathik Sharma and James Fu. [n.d.]. Cost optimization best practices for Big-Query. Retrieved 2023-04-14 from https://cloud.google.com/blog/products/data-analytics/cost-optimization-best-practices-for-bigquery
[63]
Snappy [n.d.]. snappy. Retrieved 2022-12-29 from http://google.github.io/snappy/
[64]
Snowflake ETL [n.d.]. The Pitfalls of ETL Processing. Retrieved 2023-04-09 from https://www.snowflake.com/guides/pitfalls-etl-processing
[65]
Gábor Szárnyas, Jack Waudby, Benjamin A. Steer, Dávid Szakállas, Altan Birler, Mingxi Wu, Yuchen Zhang, and Peter Boncz. 2022. The LDBC Social Network Benchmark: Business Intelligence Workload. Proc. VLDB Endow. 16, 4 (dec 2022), 877--890.
[66]
Junjay Tan, Thanaa Ghanem, Matthew Perron, Xiangyao Yu, Michael Stonebraker, David DeWitt, Marco Serafini, Ashraf Aboulnaga, and Tim Kraska. 2019. Choosing a cloud DBMS: architectures and tradeoffs. Proc. VLDB Endow. 12, 12 (aug 2019), 2170--2182.
[67]
Zoiner Tejada. [n.d.]. Online analytical processing (OLAP) - Azure Architecture Center. Retrieved 2024-01-07 from https://learn.microsoft.com/en-us/azure/architecture/data-guide/relational-data/online-analytical-processing
[68]
The Duckbill Group [n.d.]. Duckbill. Retrieved 2023-11-20 from https://www.duckbillgroup.com/
[69]
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu, and Raghotham Murthy. 2010. Hive - a petabyte scale data warehouse using Hadoop. In 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010). IEEE, Long Beach, CA, USA, 996--1005.
[70]
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu, and Raghotham Murthy. 2010. Hive - a petabyte scale data warehouse using Hadoop. In 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010). IEEE, Long Beach, CA, USA, 996--1005.
[71]
TPC-DS [n.d.]. TPC Benchmark DS. Retrieved 2024-02-13 from https://www.tpc.org/TPC_Documents_Current_Versions/pdf/TPC-DS_v3.2.0.pdf
[72]
What is OLAP? [n.d.]. What is Online Analytical Processing? - Online Analytical Processing Explained - AWS. Retrieved 2024-01-07 from https://aws.amazon.com/what-is/olap/
[73]
Sarah Wooders, Shu Liu, Paras Jain, Xiangxi Mo, Joseph E. Gonzalez, Vincent Liu, and Ion Stoica. 2024. Cloudcast: High-Throughput, Cost-Aware Overlay Multicast in the Cloud. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). USENIX Association, Santa Clara, CA, 281--296. https://www.usenix.org/conference/nsdi24/presentation/wooders
[74]
Fuhui Wu, Qingbo Wu, and Yusong Tan. 2015. Workflow scheduling in cloud: a survey. The Journal of Supercomputing 71, 9 (Sept. 2015), 3373--3418.
[75]
Wentao Wu, Yun Chi, Shenghuo Zhu, Junichi Tatemura, Hakan Hacigümüs, and Jeffrey F. Naughton. 2013. Predicting query execution time: Are optimizer cost models really unusable?. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). 1081--1092.
[76]
Zhanghao Wu, Wei-Lin Chiang, Ziming Mao, Zongheng Yang, Eric Friedman, Scott Shenker, and Ion Stoica. 2024. Can't Be Late: Optimizing Spot Instance Savings under Deadlines. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). USENIX Association, Santa Clara, CA, 185--203. https://www.usenix.org/conference/nsdi24/presentation/wu-zhanghao
[77]
Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Xi Chen, Pieter Abbeel, Joseph M. Hellerstein, Sanjay Krishnan, and Ion Stoica. 2019. Deep unsupervised cardinality estimation. Proc. VLDB Endow. 13, 3 (nov 2019), 279--292.
[78]
Xiangyao Yu, Matt Youill, Matthew Woicik, Abdurrahman Ghanem, Marco Serafini, Ashraf Aboulnaga, and Michael Stonebraker. 2020. PushdownDB: Accelerating a DBMS Using S3 Computation. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). 1802--1805.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 17, Issue 11
July 2024
1039 pages
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 30 August 2024
Published in PVLDB Volume 17, Issue 11

Check for updates

Badges

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 53
    Total Downloads
  • Downloads (Last 12 months)53
  • Downloads (Last 6 weeks)13
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media