research-article

SeLeP: Learning Based Semantic Prefetching for Exploratory Database Workloads

Authors:

Farzaneh Zirak,

Farhana Choudhury, and

Renata Borovica-GajicAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 17, Issue 8

Pages 2064 - 2076

https://doi.org/10.14778/3659437.3659458

Published: 31 May 2024 Publication History

Abstract

Prefetching is a crucial technique employed in traditional databases to enhance interactivity, particularly in the context of data exploration. Data exploration is a query processing paradigm in which users search for insights buried in the data, often not knowing what exactly they are looking for. Data exploratory tools deal with multiple challenges such as the need for interactivity with no a priori knowledge being present to help with the system tuning. The state-of-the-art prefetchers are specifically designed for navigational workloads only, where the number of possible actions is limited. The prefetchers that work with SQL-based workloads, on the other hand, mainly rely on data logical addresses rather than the data semantics. They fail to predict complex access patterns in cases where the database size is substantial, resulting in an extensive address space, or when there is frequent co-accessing of data. In this paper, we propose SeLeP, a semantic prefetcher that makes prefetching decisions for both types of workloads, based on the encoding of the data values contained inside the accessed blocks. Following the popular path of using machine learning approaches to automatically learn the hidden patterns, we formulate the prefetching task as a time-series forecasting problem and use an encoder-decoder LSTM architecture to learn the data access pattern. Our extensive experiments, across real-life exploratory workloads, demonstrate that SeLeP improves the hit ratio up to 40% and reduces I/O time up to 45% compared to the state-of-the-art, attaining 96% hit ratio and 84% I/O reduction on average.

References

[1]

Martín Abadi, Ashish Agarwal, Paul Barham, et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.

[2]

Kevork N Abazajian, Jennifer K Adelman-McCarthy, Agüeros, et al. 2009. The seventh data release of the Sloan Digital Sky Survey. The Astrophysical Journal Supplement Series 182, 2 (2009), 543.

[3]

Ioannis Alagiannis, Renata Borovica, Miguel Branco, Stratos Idreos, and Anastasia Ailamaki. 2012. NoDB: Efficient Query Execution on Raw Data Files. In SIGMOD. 241--252.

[4]

Leilani Battle, Remco Chang, and Michael Stonebraker. 2016. Dynamic prefetching of data tiles for interactive visualization. In Proceedings of the 2016 International Conference on Management of Data. 1363--1375.

Digital Library

[5]

Rahul Bera, Konstantinos Kanellopoulos, Anant Nori, Taha Shahroodi, Sreenivas Subramoney, and Onur Mutlu. 2021. Pythia: A customizable hardware prefetching framework using online reinforcement learning. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. 1121--1137.

Digital Library

[6]

Chandranil Chakraborttii and Heiner Litz. 2020. Learning I/O Access Patterns to Improve Prefetching in SSDs. In ECML/PKDD. 427--443.

[7]

Sye-Min Chan, Ling Xiao, John Gerth, and Pat Hanrahan. 2008. Maintaining interactivity while exploring massive time series. In 2008 IEEE Symposium on Visual Analytics Science and Technology. IEEE, 59--66.

[8]

Yu Chen, Yong Zhang, Jiacheng Wu, Jin Wang, and Chunxiao Xing. 2021. Revisiting data prefetching for database systems with machine learning techniques. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 2165--2170.

[9]

Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. (2014), 103--111.

[10]

Punit R Doshi, Elke A Rundensteiner, and Matthew O Ward. 2003. Prefetching for visual data exploration. In Eighth International Conference on Database Systems for Advanced Applications, 2003.(DASFAA 2003). Proceedings. IEEE, 195--202.

[11]

Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. 1999. Similarity search in high dimensions via hashing. In Vldb, Vol. 99. 518--529.

Digital Library

[12]

Jim Gray, David T. Liu, María A. Nieto-Santisteban, Alexander S. Szalay, David J. DeWitt, and Gerd Heber. 2005. Scientific data management in the coming decade. SIGMOD Record 34, 4 (2005), 34--41.

Digital Library

[13]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.

Digital Library

[14]

Lei Huang, Jie Qin, Yi Zhou, Fan Zhu, Li Liu, and Ling Shao. 2023. Normalization techniques in training dnns: Methodology, analysis and application. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023), 10173--10196.

[15]

Stratos Idreos. 2013. Big Data Exploration. Taylor and Francis.

[16]

Stratos Idreos, Olga Papaemmanouil, and Surajit Chaudhuri. 2015. Overview of data exploration techniques. In SIGMOD. 277--281.

[17]

Shrainik Jain, Dominik Moritz, Daniel Halperin, Bill Howe, and Ed Lazowska. 2016. Sqlshare: Results from a multi-year sql-as-a-service experiment. In Proceedings of the 2016 International Conference on Management of Data. 281--293.

Digital Library

[18]

Alexander Kalinin, Ugur Cetintemel, and Stan Zdonik. 2014. Interactive data exploration using semantic windows. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 505--516.

Digital Library

[19]

Martin L. Kersten, Stratos Idreos, Stefan Manegold, and Erietta Liarou. 2011. The Researcher's Guide to the Data Deluge: Querying a Scientific Database in Just a Few Seconds. VLDB 4, 12 (2011), 1474--1477.

Digital Library

[20]

Ando Ki and Alan E Knowles. 2000. Stride prefetching for the secondary data cache. Journal of systems architecture 46, 12 (2000), 1093--1102.

Digital Library

[21]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, Conference Track Proceedings.

[22]

Hai Lan, Zhifeng Bao, J. Shane Culpepper, and Renata Borovica-Gajic. 2023. Updatable Learned Indexes Meet Disk-Resident DBMS - From Evaluations to Design Choices. Proc. ACM Manag. Data 1, 2 (2023), 139:1--139:22.

Digital Library

[23]

Hai Lan, Zhifeng Bao, J. Shane Culpepper, Renata Borovica-Gajic, and Yu Dong. 2024. A Fully On-disk Updatable Learned Index. In 40th IEEE International Conference on Data Engineering (ICDE). IEEE.

[24]

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.

[25]

Xi Liang, Aaron J Elmore, and Sanjay Krishnan. 2019. Opportunistic view materialization with deep reinforcement learning. arXiv preprint arXiv:1903.01363 (2019).

[26]

Zhicheng Liu and Jeffrey Heer. 2014. The effects of interactive latency on exploratory visual analysis. IEEE transactions on visualization and computer graphics 20, 12 (2014), 2122--2131.

[27]

Holger R Maier and Graeme C Dandy. 1998. The effect of internal parameters and geometry on the performance of back-propagation neural networks: an empirical study. Environmental Modelling & Software 13, 2 (1998), 193--209.

[28]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems 26 (2013), 3111--3119.

[29]

Matthaios Olma, Manos Karpathiotakis, Ioannis Alagiannis, Manos Athanassoulis, and Anastasia Ailamaki. 2017. Slalom: Coasting through raw data via adaptive partitioning and indexing. Proceedings of the VLDB Endowment 10, 10 (2017), 1106--1117.

Digital Library

[30]

Michael Opdenacker and Free Electrons. 2007. Readahead: time-travel techniques for desktop and embedded systems. In Proc. of the 2007 Ottawa Linux Symposium, Vol. 2. 97--106.

[31]

Mirjana Pavlovic, Eleni Tzirita Zacharatou, Darius Sidlauskas, Thomas Heinis, and Anastasia Ailamaki. 2016. Space odyssey: efficient exploration of scientific data. In Proceedings of the Third International Workshop on Exploratory Search in Databases and the Web. 12--18.

Digital Library

[32]

Karl Pearson. 1901. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin philosophical magazine and journal of science 2, 11 (1901), 559--572.

[33]

R Malinga Perera, Bastian Oetomo, Benjamin IP Rubinstein, and Renata Borovica-Gajic. 2021. DBA bandits: Self-driving index tuning under ad-hoc, analytical workloads with safety guarantees. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 600--611.

[34]

R Malinga Perera, Bastian Oetomo, Benjamin IP Rubinstein, and Renata Borovica-Gajic. 2022. HMAB: self-driving hierarchy of bandits for integrated physical database design tuning. Proceedings of the VLDB Endowment 16, 2 (2022), 216--229.

Digital Library

[35]

R. Malinga Perera, Bastian Oetomo, Benjamin I. P. Rubinstein, and Renata Borovica-Gajic. 2023. No DBA? No Regret! Multi-Armed Bandits for Index Tuning of Analytical and HTAP Workloads With Provable Guarantees. IEEE Trans. Knowl. Data Eng. 35, 12 (2023), 12855--12872.

Digital Library

[36]

David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. 1985. Learning internal representations by error propagation.

[37]

Marco Serafini, Rebecca Taft, Aaron J Elmore, Andrew Pavlo, Ashraf Aboulnaga, and Michael Stonebraker. 2016. Clay: fine-grained adaptive partitioning for general database schemas. Proceedings of the VLDB Endowment 10, 4 (2016), 445--456.

Digital Library

[38]

Zechao Shang, Xi Liang, Dixin Tang, Cong Ding, Aaron J Elmore, Sanjay Krishnan, and Michael J Franklin. 2020. CrocodileDB: Efficient Database Execution through Intelligent Deferment. In CIDR.

[39]

Alan Jay Smith. 1978. Sequentiality and prefetching in database systems. ACM Transactions on Database Systems (TODS) 3, 3 (1978), 223--247.

Digital Library

[40]

Michael Stonebraker and Lawrence A Rowe. 1986. The design of Postgres. ACM Sigmod Record 15, 2 (1986), 340--355.

Digital Library

[41]

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems 27 (2014), 3104--3112.

[42]

Farhan Tauheed, Thomas Heinis, Felix Schürmann, Henry Markram, and Anastasia Ailamaki. 2012. SCOUT: Prefetching for Latent Feature Following Queries. Proc. VLDB Endow. 5, 11 (2012), 1531--1542.

Digital Library

[43]

Hoang Vo, Ablimit Aji, and Fusheng Wang. 2014. SATO: a spatial data partitioning framework for scalable query processing. In Proceedings of the 22nd ACM SIGSPATIAL international conference on advances in geographic information systems. 545--548.

Digital Library

[44]

Ran Wan, Roman Garnett, and Alvitta Ottley. 2018. Learning and Anticipating Future Actions During Exploratory Data Analysis. arXiv preprint arXiv:1809.09664 (2018).

[45]

Eugene Wu and Samuel Madden. 2011. Partitioning techniques for fine-grained indexing. In 2011 IEEE 27th International Conference on Data Engineering. IEEE, 1127--1138.

Digital Library

[46]

Fei Yang, Luis Herranz, Joost Van De Weijer, José A Iglesias Guitián, Antonio M López, and Mikhail G Mozerov. 2020. Variable rate deep image compression with modulated autoencoder. IEEE Signal Processing Letters 27 (2020), 331--335.

[47]

Yiyuan Yang, Rongshang Li, Qiquan Shi, Xijun Li, Gang Hu, Xing Li, and Min jie Yuan. 2023. SGDP: A Stream-Graph Neural Network Based Data Prefetcher. 2023 International Joint Conference on Neural Networks (IJCNN) (2023), 1--8.

Recommendations

Effective cache prefetching on bus-based multiprocessors

Compiler-directed cache prefetching has the potential to hide much of the high memory latency seen by current and future high-performance processors. However, prefetching is not without costs, particularly on a shared-memory multiprocessor. Prefetching ...
Read More
Semantic locality and context-based prefetching using reinforcement learning
ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

Most modern memory prefetchers rely on spatio-temporal locality to predict the memory addresses likely to be accessed by a program in the near future. Emerging workloads, however, make increasing use of irregular data structures, and thus exhibit a ...
Read More
Stealth prefetching
Proceedings of the 2006 ASPLOS Conference

Prefetching in shared-memory multiprocessor systems is an increasingly difficult problem. As system designs grow to incorporate larger numbers of faster processors, memory latency and interconnect traffic increase. While aggressive prefetching ...
Read More

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 17, Issue 8

April 2024

335 pages

ISSN:2150-8097

Editors:
Meihui Zhang
Beijing Institute of Technology
,
Cyrus Shahabi
University of Southern California

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 31 May 2024

Published in PVLDB Volume 17, Issue 8

Check for updates

Badges

Artifacts Available / v1.1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
5
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)5

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents