Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Disco: A Compact Index for LSM-trees

Published: 11 February 2025 Publication History

Abstract

Many key-value stores and database systems use log-structured merge-trees (LSM-trees) as their storage engines because of their excellent write performance. However, the read performance of LSM-trees is suboptimal due to the overlapping sorted runs. Most existing efforts rely on filters to reduce unnecessary I/Os, but filters fundamentally do not help locate items and often become the bottleneck of the system. We identify that the lack of efficient index is the root cause of subpar read performance in LSM-trees. In this paper, we propose Disco: a compact index for LSM-trees. Disco indexes all the keys in an LSM-tree, so a query does not have to search every run of the LSM-tree. It records compact key representations to minimize the number of key comparisons so as to minimize cache misses and I/Os for both point and range queries. Disco guarantees that both point queries and seeks issue at most one I/O to the underlying runs, achieving an I/O efficiency close to a B+-tree. Disco improves upon REMIX's pioneering multi-run index design with additional compact key representations to help improve read performance. The representations are compact so the cost of persisting Disco to disk is small. Moreover, while a traditional LSM-tree has to choose a more aggressive compaction policy that slows down write performance to have better read performance, a Disco-indexed LSM-tree can employ a write-efficient policy and still have good read performance. Experimental results show that Disco can save I/Os and improve point and range query performance by up to 220% over RocksDB while maintaining efficient writes.

References

[1]
Jung-Sang Ahn, Mohiuddin Abdul Qader, Woon-Hak Kang, Hieu Nguyen, Guogen Zhang, and Sami Ben-Romdhane. "Jungle: Towards Dynamically Adjustable Key-Value Store by Combining LSM-Tree and Copy-On-Write B-Tree". In: 11th USENIX Workshop on Hot Topics in Storage and File Systems, HotStorage 2019, Renton, WA, USA, July 8--9, 2019. 2019.
[2]
Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. "Workload analysis of a large-scale key-value store". In: ACM SIGMETRICS / PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems (SIGMATRICS'12). 2012, pp. 53--64.
[3]
Rudolf Bayer and Karl Unterauer. "Prefix B-Trees". In: ACM Trans. Database Syst. 2.1 (1977), pp. 11--26.
[4]
Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, and Yang Zhan. "An Introduction to Bε-trees and Write-Optimization". In: login Usenix Mag. 40.5 (2015).
[5]
Daniel J. Berstein. Crit-bit trees. url: https://cr.yp.to/critbit.html.
[6]
Robert Binna, Eva Zangerle, Martin Pichl, Günther Specht, and Viktor Leis. "HOT: A Height Optimized Trie Index for Main-Memory Database Systems". In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10--15, 2018. 2018, pp. 521--534.
[7]
Burton H. Bloom. "Space/Time Trade-offs in Hash Coding with Allowable Errors". In: Commun. ACM 13.7 (1970), pp. 422--426.
[8]
Gerth Stølting Brodal and Rolf Fagerberg. "Lower bounds for external memory dictionaries". In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 12--14, 2003, Baltimore, Maryland, USA. 2003, pp. 546--554.
[9]
Zhichao Cao, Siying Dong, Sagar Vemuri, and David H. C. Du. "Characterizing, Modeling, and Benchmarking RocksDB Key-Value Workloads at Facebook". In: 18th USENIX Conference on File and Storage Technologies, (FAST'20). 2020, pp. 209--223.
[10]
Chen Chen, Wenshao Zhong, and Xingbo Wu. "Building an efficient key-value store in a flexible address space". In: EuroSys '22: Seventeenth European Conference on Computer Systems, Rennes, France, April 5 - 8, 2022. 2022, pp. 51--68.
[11]
Félix Cloutier. PEXT - Parallel Bits Extract. url: https://www.felixcloutier.com/x86/pext.
[12]
Alex Conway, Martin Farach-Colton, and Rob Johnson. "SplinterDB and Maplets: Improving the Tradeoffs in Key-Value Store Compaction Policy". In: Proc. ACM Manag. Data 1.1 (2023), 46:1--46:27.
[13]
Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. "Benchmarking Cloud Serving Systems with YCSB". In: Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC'10). 2010, pp. 143--154.
[14]
Yifan Dai, Yien Xu, Aishwarya Ganesan, Ramnatthan Alagappan, Brian Kroth, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. "From WiscKey to Bourbon: A Learned Index for Log-Structured Merge Trees". In: 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020, Virtual Event, November 4--6, 2020. 2020, pp. 155--171.
[15]
Our World In Data. Historical cost of computer memory and storage. url: https://ourworldindata.org/grapher/historical-cost-of-computer-memory-and-storage.
[16]
Niv Dayan and Stratos Idreos. "Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging". In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10--15, 2018. 2018, pp. 505--520.
[17]
Niv Dayan and Stratos Idreos. "The Log-Structured Merge-Bush & the Wacky Continuum". In: Proceedings of the 2019 International Conference on Management of Data (SIGMOD'19). 2019, pp. 449--466.
[18]
Niv Dayan and Moshe Twitto. "Chucky: A Succinct Cuckoo Filter for LSM-Tree". In: SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20--25, 2021. 2021, pp. 365--378.
[19]
Siying Dong, Andrew Kryczka, Yanqin Jin, and Michael Stumm. "Evolution of Development Priorities in Key-value Stores Serving Large-scale Applications: The RocksDB Experience". In: 19th USENIX Conference on File and Storage Technologies, FAST 2021, February 23--25, 2021. 2021, pp. 33--49.
[20]
Facebook. Index Block Format. url: https://github.com/facebook/rocksdb/wiki/Index-Block-Format.
[21]
Facebook. RocksDB. url: https://github.com/facebook/rocksdb.
[22]
Facebook. RocksDB Tuning Guide. url: https://github.com/facebook/rocksdb/wiki/RocksDBTuning-Guide (visited on 04/01/2024).
[23]
Michael L. Fredman and Dan E. Willard. "BLASTING through the Information Theoretic Barrier with FUSION TREES". In: Proceedings of the 22nd Annual ACM Symposium on Theory of Computing, May 13--17, 1990, Baltimore, Maryland, USA. 1990, pp. 1--7.
[24]
Eran Gilad, Edward Bortnikov, Anastasia Braginsky, Yonatan Gottesman, Eshcar Hillel, Idit Keidar, Nurit Moscovici, and Rana Shahout. "EvenDB: optimizing key-value storage for spatial locality". In: Proceedings of the Fifteenth EuroSys Conference 2020 (EuroSys'20). 2020, 27:1--27:16.
[25]
Google. LevelDB. url: https://github.com/google/leveldb.
[26]
Google. leveldb File format. url: https://github.com/google/leveldb/blob/main/doc/table_format.md.
[27]
Andy Goth. critbit. url: https://wiki.tcl-lang.org/page/critbit.
[28]
Haoyu Huang and Shahram Ghandeharizadeh. "Nova-LSM: A Distributed, Component-based LSM-tree Key-value Store". In: SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20--25, 2021. 2021, pp. 749--763.
[29]
Intel. Intel Optane SSD 905P Series 960GB, 2.5in PCIe x4, 3D XPoint. url: https://www.intel. com/content/www/us/en/products/sku/147529/intel-optane-ssd-905p-series-960gb-2--5inpcie-x4--3d-xpoint/specifications.html.
[30]
Intel. Intel Solid-State Drive DC S3500 Series. url: https://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3500-spec.pdf.
[31]
Eric R. Knorr, Baptiste Lemaire, Andrew Lim, Siqiang Luo, Huanchen Zhang, Stratos Idreos, and Michael Mitzenmacher. "Proteus: A Self-Designing Range Filter". In: SIGMOD '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022. 2022, pp. 1670--1684.
[32]
Cockroach Labs. Cockroach Labs, the company building CockroachDB. url: https://www.cockroachlabs.com/.
[33]
Avinash Lakshman and Prashant Malik. "Cassandra: a decentralized structured storage system". In: Operating Systems Review 44.2 (2010), pp. 35--40.
[34]
Yinan Li, Bingsheng He, Robin Jun Yang, Qiong Luo, and Ke Yi. "Tree Indexing on Solid State Drives". In: Proc. VLDB Endow. 3.1--2 (2010), pp. 1195--1206.
[35]
Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. "WiscKey: Separating Keys from Values in SSD-conscious Storage". In: 14th USENIX Conference on File and Storage Technologies (FAST'16). 2016, pp. 133--148.
[36]
Siqiang Luo, Subarna Chatterjee, Rafael Ketsetsidis, Niv Dayan, Wilson Qin, and Stratos Idreos. "Rosetta: A Robust Space-Time Optimized Range Filter for Key-Value Stores". In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD'20). 2020, pp. 2071--2086.
[37]
John C. McCallum. SSD and Flash Memory Price Decreasing with Time. url: https://jcmit.net/ flash2015.htm.
[38]
Ju Hyoung Mun, Zichen Zhu, Aneesh Raman, and Manos Athanassoulis. "LSM-Trees Under (Memory) Pressure". In: International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures, ADMS@VLDB 2022, Sydney, Australia, September 5, 2022. 2022, pp. 23--35.
[39]
Arjun Narayan and Peter Mattis. Why we built cockroachdb on top of rocksdb. url: https: //cockroachlabs.com/blog/cockroachdb-on-rocksd/#fast-scans.
[40]
Patrick E. O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth J. O'Neil. "The Log-Structured Merge-Tree (LSM-Tree)". In: Acta Informatica 33.4 (1996), pp. 351--385.
[41]
Yifan Qiao, Xubin Chen, Ning Zheng, Jiangpeng Li, Yang Liu, and Tong Zhang. "Closing the B-tree vs. LSM-tree Write Amplification Gap on Modern Storage Hardware with Built-in Transparent Compression". In: 20th USENIX Conference on File and Storage Technologies, FAST 2022, Santa Clara, CA, USA, February 22--24, 2022. 2022, pp. 69--82.
[42]
Ohad Rodeh. "B-Trees, Shadowing, and Clones". In: ACM Trans. Storage 3.4 (2008).
[43]
Subhadeep Sarkar, Dimitris Staratzis, Zichen Zhu, and Manos Athanassoulis. "Constructing and Analyzing the LSM Compaction Design Space". In: Proc. VLDB Endow. 14.11 (2021), pp. 2216--2229.
[44]
ScyllaDB. ScyllaDB. url: https://github.com/scylladb/scylla.
[45]
Facebook Open Source. MyRocks | A RocksDB storage engine with MySQL. url: http://myrocks.io/.
[46]
Chenlei Tang, Jiguang Wan, Zhi-hu Tan, and Guokuan Li. "Accelerating range queries of primary and secondary indices for key-value separation". In: Proceedings of the 13th Symposium on Cloud Computing, SoCC 2022, San Francisco, California, November 7--11, 2022. 2022, pp. 226--239.
[47]
Kapil Vaidya, Tim Kraska, Subarna Chatterjee, Eric R. Knorr, Michael Mitzenmacher, and Stratos Idreos. "SNARF: A Learning-Enhanced Range Filter". In: Proc. VLDB Endow. 15.8 (2022), pp. 1632--1644.
[48]
Hengrui Wang, Te Guo, Junzhao Yang, and Huanchen Zhang. "GRF: A Global Range Filter for LSM-Trees with Shape Encoding". In: Proc. ACM Manag. Data 2.3 (2024).
[49]
Ziwei Wang, Zheng Zhong, Jiarui Guo, Yuhan Wu, Haoyu Li, Tong Yang, Yaofeng Tu, Huanchen Zhang, and Bin Cui. "REncoder: A Space-Time Efficient Range Filter with Local Encoder". In: 39th IEEE International Conference on Data Engineering, ICDE 2023, Anaheim, CA, USA, April 3--7, 2023. 2023, pp. 2036--2049.
[50]
Xingbo Wu, Fan Ni, and Song Jiang. "Wormhole: A Fast Ordered Index for In-memory Data Management". In: Proceedings of the Fourteenth EuroSys Conference 2019, Dresden, Germany, March 25--28, 2019. 2019, 18:1--18:16.
[51]
Huanchen Zhang, Hyeontaek Lim, Viktor Leis, David G. Andersen, Michael Kaminsky, Kimberly Keeton, and Andrew Pavlo. "SuRF: Practical Range Query Filtering with Fast Succinct Tries". In: Proceedings of the 2018 International Conference on Management of Data, (SIGMOD'18). 2018, pp. 323--336.
[52]
Yueming Zhang, Yongkun Li, Fan Guo, Cheng Li, and Yinlong Xu. "ElasticBF: Fine-grained and Elastic Bloom Filter Towards Efficient Read for LSM-tree-based KV Stores". In: 10th USENIX Workshop on Hot Topics in Storage and File Systems, HotStorage 2018, Boston, MA, USA, July 9--10, 2018. 2018.
[53]
Wenshao Zhong, Chen Chen, Xingbo Wu, and Song Jiang. "REMIX: Efficient Range Query for LSM-trees". In: 19th USENIX Conference on File and Storage Technologies (FAST 21). 2021, pp. 51--64.
[54]
Zichen Zhu. "SHaMBa: Reducing Bloom Filter Overhead in LSM Trees". In: Proceedings of the VLDB 2023 PhD Workshop co-located with the 49th International Conference on Very Large Data Bases (VLDB 2023), Vancouver, Canada, August 28, 2023. Vol. 3452. 2023, pp. 17--20.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 3, Issue 1
SIGMOD
February 2025
2261 pages
EISSN:2836-6573
DOI:10.1145/3717614
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 February 2025
Published in PACMMOD Volume 3, Issue 1

Author Tags

  1. LSM-tree
  2. key-value store
  3. storage system

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 83
    Total Downloads
  • Downloads (Last 12 months)83
  • Downloads (Last 6 weeks)83
Reflects downloads up to 18 Feb 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media