Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

FlexPushdownDB: hybrid pushdown and caching in a cloud DBMS

Published: 01 July 2021 Publication History

Abstract

Modern cloud databases adopt a storage-disaggregation architecture that separates the management of computation and storage. A major bottleneck in such an architecture is the network connecting the computation and storage layers. Two solutions have been explored to mitigate the bottleneck: caching and computation pushdown. While both techniques can significantly reduce network traffic, existing DBMSs consider them as orthogonal techniques and support only one or the other, leaving potential performance benefits unexploited.
In this paper we present FlexPushdownDB (FPDB), an OLAP cloud DBMS prototype that supports fine-grained hybrid query execution to combine the benefits of caching and computation pushdown in a storage-disaggregation architecture. We build a hybrid query executor based on a new concept called separable operators to combine the data from the cache and results from the pushdown processing. We also propose a novel Weighted-LFU cache replacement policy that takes into account the cost of pushdown computation. Our experimental evaluation on the Star Schema Benchmark shows that the hybrid execution outperforms both the conventional caching-only architecture and pushdown-only architecture by 2.2X. In the hybrid architecture, our experiments show that Weighted-LFU can outperform the baseline LFU by 37%.

References

[1]
2012. Akka. https://akka.io/.
[2]
2012. Ceph. https://ceph.io/.
[3]
2016. Apache Arrow. https://arrow.apache.org/.
[4]
2016. Apache Parquet. https://parquet.apache.org/.
[5]
2016. MinIO. https://min.io/.
[6]
2017. AWS Nitro System. https://aws.amazon.com/ec2/nitro/.
[7]
2018. Amazon Athena --- Serverless Interactive Query Service. https://aws.amazon.com/athena/.
[8]
2018. Amazon Redshift. https://aws.amazon.com/redshift/.
[9]
2018. Amazon S3. https://aws.amazon.com/s3/.
[10]
2018. Gandiva: an LLVM-based Arrow expression compiler. https://arrow.apache.org/blog/2018/12/05/gandiva-donation/.
[11]
2018. Presto. https://prestodb.io/.
[12]
2020. AQUA (Advanced Query Accelerator) for Amazon Redshift. https://pages.awscloud.com/AQUA_Preview.html/.
[13]
2020. Azure Data Lake Storage query acceleration. https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-query-acceleration/.
[14]
2020. Presto documentation, Alluxio Cache Service. https://prestodb.io/docs/current/cache/alluxio.html/.
[15]
Gul Agha. 1986. Actors: A Model of Concurrent Computation in Distributed Systems. MIT Press.
[16]
Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In SIGMOD. 1383--1394.
[17]
Joe Armstrong. 1996. Erlang---a Survey of the Language and its Industrial Applications. In Proc. INAP, Vol. 96.
[18]
L. A. Belady. 1966. A Study of Replacement Algorithms for a Virtual-Storage Computer. IBM System Journal 5, 2 (1966), 78--101.
[19]
Dominik Charousset, Raphael Hiesgen, and Thomas C. Schmidt. 2016. Revisiting Actor Programming in C++. Computer Languages, Systems & Structures 45, C (2016).
[20]
Hybrid Memory Cube Consortium. 2014. HMCSpecification2.1.
[21]
Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, Allison W. Lee, Ashish Motivala, Abdul Q. Munir, Steven Pelley, Peter Povinec, Greg Rahn, Spyridon Triantafyllis, and Philipp Unterbrunner. 2016. The Snowflake Elastic Data Warehouse. In SIGMOD. 215--226.
[22]
Jaeyoung Do, Yang-Suk Kee, Jignesh M. Patel, Chanik Park, Kwanghyun Park, and David J. DeWitt. 2013. Query Processing on Smart SSDs: Opportunities and Challenges. In SIGMOD. 1221--1230.
[23]
Phil Francisco. 2011. The Netezza Data Appliance Architecture.
[24]
Michael J. Franklin, Björn Thór Jónsson, and Donald Kossmann. 1996. Performance Tradeoffs for Client-Server Query Processing. SIGMOD Record 25, 2 (1996), 149--160.
[25]
Shinya Fushimi, Masaru Kitsuregawa, and Hidehiko Tanaka. 1986. An Overview of The System Software of A Parallel Relational Database Machine GRACE. In VLDB. 209--219.
[26]
Mingyu Gao and Christos Kozyrakis. 2016. HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing. In HPCA. 126--137.
[27]
Saugata Ghose, Kevin Hsieh, Amirali Boroumand, Rachata Ausavarungnirun, and Onur Mutlu. 2018. Enabling the Adoption of Processing-in-Memory: Challenges, Mechanisms, Future Research Directions. arXiv preprint arXiv:1802.00320 (2018).
[28]
Jim Gray, Prakash Sundaresan, Susanne Englert, Ken Baclawski, and Peter J. Weinberger. 1994. Quickly Generating Billion-Record Synthetic Databases. SIGMOD Record 23, 2 (1994), 243--252.
[29]
Boncheol Gu, Andre S. Yoon, Duck-Ho Bae, Insoon Jo, Jinyoung Lee, Jonghyun Yoon, Jeong-Uk Kang, Moonsang Kwon, Chanho Yoon, Sangyeun Cho, Jaeheon Jeong, and Duckhyun Chang. 2016. Biscuit: A Framework for Near-Data Processing of Big Data Workloads. In ISCA. 153--165.
[30]
Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon Redshift and the Case for Simpler Data Warehouses. In SIGMOD. 1917--1923.
[31]
Randall Hunt. 2018. S3 Select and Glacier Select - Retrieving Subsets of Objects. https://aws.amazon.com/blogs/aws/s3-glacier-select/.
[32]
Sang-Woo Jun, Shuotao Xu, and Arvind. 2017. Terabyte Sort on FPGA-accelerated Flash Storage. In IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM). 17--24.
[33]
Kimberly Keeton, David A Patterson, and Joseph M Hellerstein. 1998. A Case for Intelligent Disks (IDISKs). SIGMOD Record 27, 3 (1998), 42--52.
[34]
Tiago R. Kepe, Eduardo C. de Almeida, and Marco A. Z. Alves. 2019. Database Processing-in-Memory: An Experimental Study. VLDB 13, 3 (2019), 334--347.
[35]
Gunjae Koo, Kiran Kumar Matam, Te I, H. V. Krishna Giri Narra, Jing Li, Hung-Wei Tseng, Steven Swanson, and Murali Annavaram. 2017. Summarizer: Trading Communication with Computing Near Storage. In MICRO. 219--231.
[36]
Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandiver, Lyric Doshi, and Chuck Bear. 2012. The Vertica Analytic Database: C-Store 7 Years Later. VLDB 5, 12 (2012), 1790--1801.
[37]
Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. 2010. Dremel: Interactive Analysis of Web-Scale Datasets. VLDB 3, 1--2 (2010), 330--339.
[38]
Patrick O'Neil, Elizabeth O'Neil, Xuedong Chen, and Stephen Revilak. 2009. The Star Schema Benchmark and Augmented Fact Table Indexing. In Technology Conference on Performance Evaluation and Benchmarking. 237--252.
[39]
Erik Riedel, Christos Faloutsos, Garth A Gibson, and David Nagle. 2001. Active disks for large-scale data processing. Computer 34, 6 (2001), 68--74.
[40]
Manuel Rodríguez-Martínez and Nick Roussopoulos. 2000. MOCHA: A Self-Extensible Database Middleware System for Distributed Data Sources. In SIGMOD. 213--224.
[41]
Mary Tork Roth and Peter M. Schwarz. 1997. Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources. In VLDB. 266--275.
[42]
Junjay Tan, Thanaa Ghanem, Matthew Perron, Xiangyao Yu, Michael Stonebraker, David DeWitt, Marco Serafini, Ashraf Aboulnaga, and Tim Kraska. 2019. Choosing A Cloud DBMS: Architectures and Tradeoffs. VLDB 12, 12 (2019), 2170--2182.
[43]
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu, and Raghotham Murthy. 2010. Hive --- A Petabyte Scale Data Warehouse Using Hadoop. In ICDE. 996--1005.
[44]
Michael Ubell. 1985. The Intelligent Database Machine (IDM). In Query processing in database systems. 237--247.
[45]
Ben Vandiver, Shreya Prasad, Pratibha Rana, Eden Zik, Amin Saeidi, Pratyush Parimal, Styliani Pantela, and Jaimin Dave. 2018. Eon Mode: Bringing the Vertica Columnar Database to the Cloud. In SIGMOD. 797--809.
[46]
Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, and Xiaofeng Bao. 2017. Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases. In SIGMOD. 1041--1052.
[47]
Alexandre Verbitski, Anurag Gupta, Debanjan Saha, James Corey, Kamal Gupta, Murali Brahmadesam, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvilli, et al. 2018. Amazon Aurora: On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes. In SIGMOD. 789--796.
[48]
Midhul Vuppalapati, Justin Miron, Rachit Agarwal, Dan Truong, Ashish Motivala, and Thierry Cruanes. 2020. Building an Elastic Query Engine on Disaggregated Storage. In NSDI. 449--462.
[49]
Ronald Weiss. 2012. A Technical Overview of the Oracle Exadata Database Machine and Exadata Storage Server. Oracle White Paper. (2012).
[50]
Matthew Woicik. 2021. Determining the Optimal Amount of Computation Pushdown for a Cloud Database to Minimize Runtime. MIT Master Thesis (2021).
[51]
Louis Woods, Zsolt István, and Gustavo Alonso. 2014. Ibex: an Intelligent Storage Engine with Support for Advanced SQL Offloading. VLDB 7, 11 (2014), 963--974.
[52]
Shuotao Xu, Thomas Bourgeat, Tianhao Huang, Hojun Kim, Sungjin Lee, and Arvind Arvind. 2020. AQUOMAN: An Analytic-Query Offloading Machine. In MICRO. 386--399.
[53]
Xiangyao Yu, Matt Youill, Matthew Woicik, Abdurrahman Ghanem, Marco Serafini, Ashraf Aboulnaga, and Michael Stonebraker. 2020. PushdownDB: Accelerating a DBMS using S3 Computation. In ICDE. 1802--1805.

Cited By

View all
  • (2025)Fusion: An Analytics Object Store Optimized for Query PushdownProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707234(540-556)Online publication date: 3-Feb-2025
  • (2024)Data caching for enterprise-grade petabyte-scale OLAPProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692047(901-915)Online publication date: 10-Jul-2024
  • (2024)GaussDB: A Cloud-Native Multi-Primary Database with Compute-Memory-Storage DisaggregationProceedings of the VLDB Endowment10.14778/3685800.368580617:12(3786-3798)Online publication date: 8-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 14, Issue 11
July 2021
732 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2021
Published in PVLDB Volume 14, Issue 11

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)51
  • Downloads (Last 6 weeks)4
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Fusion: An Analytics Object Store Optimized for Query PushdownProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707234(540-556)Online publication date: 3-Feb-2025
  • (2024)Data caching for enterprise-grade petabyte-scale OLAPProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692047(901-915)Online publication date: 10-Jul-2024
  • (2024)GaussDB: A Cloud-Native Multi-Primary Database with Compute-Memory-Storage DisaggregationProceedings of the VLDB Endowment10.14778/3685800.368580617:12(3786-3798)Online publication date: 8-Nov-2024
  • (2024)Accelerating Transfer Learning with Near-Data Computation on Cloud Object StoresProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698549(995-1011)Online publication date: 20-Nov-2024
  • (2024)A Unified Graph Framework for Storage-Compute Coupled Cluster and High-Density Computing ClusterProceedings of the International Workshop on Big Data in Emergent Distributed Environments10.1145/3663741.3664790(1-6)Online publication date: 9-Jun-2024
  • (2024)Understanding the Performance Implications of the Design Principles in Storage-Disaggregated DatabasesProceedings of the ACM on Management of Data10.1145/36549832:3(1-26)Online publication date: 30-May-2024
  • (2024)Cloud-Native Databases: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339750836:12(7772-7791)Online publication date: 1-Dec-2024
  • (2024)FlexpushdownDB: rethinking computation pushdown for cloud OLAP DBMSsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00867-833:5(1643-1670)Online publication date: 1-Sep-2024
  • (2024)Optimizing LSM-based indexes for disaggregated memoryThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00863-y33:6(1813-1836)Online publication date: 1-Nov-2024
  • (2023)A Deep Dive into Common Open Formats for Analytical DBMSsProceedings of the VLDB Endowment10.14778/3611479.361150716:11(3044-3056)Online publication date: 24-Aug-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media