Abstract
In the use of database systems, the design of the storage engine and data model directly affects the performance of the database when performing queries. Therefore, the users of the database need to select the storage engine and design data model according to the workload encountered. However, in a hybrid workload, the query set of the database is dynamically changing, and the design of its optimised storage structure is also changing. Motivated by this, we propose an automatic storage structure selection system based on learning cost, which is used to dynamically select the optimised storage structure of the database under hybrid workloads. In the system, we introduce a machine learning method to build a cost model for the storage engine, and a column-oriented data layout generation algorithm. Experimental results show that the proposed system can choose the optimal combination of storage engine and data model according to the current workload, which greatly improves the performance of the default storage structure. And the system is designed to be compatible with different storage engines for easy use in practical applications.
Similar content being viewed by others
References
Daniel JA, Samuel RM, Nabil H (2008) Column-stores vs. row-stores: how different are they really? In: Proceedings of the 2008 ACM SIGMOD
Ioannis A, Stratos I, Anastasia A (2014) H2o: a hands-free adaptive store. In: Proceedings of the 2014 ACM SIGMOD
Raja A, Manos K, Danica P, Anastasia A (2017) The case for heterogeneous htap. In: 8th Biennial conference on innovative data systems research, number CONF
Joy A, Andrew P, Prashanth M (2016) Bridging the archipelago between row-stores and column-stores for hybrid workloads. In: Proceedings of the 2016 ACM SIGMOD
Surajit C, Vivek N (1998) Autoadmin what-if index analysis utility. ACM SIGMOD Rec 27(2):367–378
Surajit C, Vivek N (2007) Self-tuning database systems: a decade of progress. In: Proceedings of the 33rd international conference on Very large data bases
Tianqi C, Carlos G (2016) Xgboost: a scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD
Niv D, Stratos I (2018) Dostoevsky: better space-time trade-offs for lsm-tree based key-value stores via adaptive removal of superfluous merging. In: Proceedings of the 2018 ACM SIGMOD
Andres F Pluggable table storage in postgresql. http://web.archive.org/web/20080207010024/http://www.808multimedia.com/winnt/kernel.htm. Accessed 14 June 2020
Archana G, Harumi K, Umeshwar D, Janet LW, Armando F, Michael J, David P (2009) Predicting multiple metrics for queries: better decisions enabled by machine learning. In: 2009 IEEE 25th ICDE. IEEE
Martin G, Jens K, Hasso P, Alexander Z, Philippe C-M, Samuel M (2010) Hyrise: a main memory hybrid storage engine. Proc VLDB Endow 4(2):105–116
Tim K, Mohammad A, Alex B, Ed HC, Jialin D, Ani K, Guillaume L, Samuel M, Hongzi M, Vikram N (2019) Sagedb: a learned database system
Fatma Ö, Yuanyuan T, Pinar T (2017) Hybrid transactional/analytical processing: a survey. In: Proceedings of the 2017 SIGMOD
Alexander R, Stan Z (2013) An automatic physical design tool for clustered column-stores. In: Proceedings of the 16th international conference on extending database technology
Michael S, U\(\hat{{\rm g}}\)ur Ç (2018) One size fits all" an idea whose time has come and gone. In: Making databases work: the pragmatic wisdom of Michael Stonebraker
Gawade M, Kersten M, Simitsis A (2016) Multi-core column-store parallelization under concurrent workload. In: Proceedings of the 12th international workshop on data management on new hardware. 1–10
TFRecord (2020) https://www.tensorflow.org/tutorials/load_data/tfrecord
Protobuf (2020) https://developers.google.com/protocol-buffers
ONNX (2020) https://en.wikipedia.org/wiki/Open_Neural_Network_Exchange
George L (2011) HBase: the definitive guide: random access to your planet-size data. O’Reilly Media Inc, Sebastopol
AWS S3 (2020) https://aws.amazon.com/s3/
Bhattacherjee S, Chavan A, Huang S, Deshpande A, Parameswaran A (2015) Principles of dataset versioning: exploring the recreation/storage tradeoff. In: Proceedings of the VLDB endowment. International conference on very large data bases 2015 Aug (Vol. 8, No. 12, p. 1346). NIH Public Access
Bhardwaj A, Bhattacherjee S, Chavan A, Deshpande A, Elmore AJ, Madden S, Parameswaran AG (2014) Datahub: collaborative data science & dataset version management at scale. arXiv:1409.0798
Miao H, Li A, Davis LS, Deshpande A (2016) Modelhub: towards unified data and lifecycle management for deep learning. arXiv:1611.06224
Acknowledgements
This paper was supported by NSFC Grant (62232005, 62202126, U1866602).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, H., Wei, Y. & Yan, H. Automatic single table storage structure selection for hybrid workload. Knowl Inf Syst 65, 4713–4739 (2023). https://doi.org/10.1007/s10115-023-01913-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-023-01913-7