Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Optimal Bloom Filters and Adaptive Merging for LSM-Trees

Published: 08 December 2018 Publication History
  • Get Citation Alerts
  • Abstract

    In this article, we show that key-value stores backed by a log-structured merge-tree (LSM-tree) exhibit an intrinsic tradeoff between lookup cost, update cost, and main memory footprint, yet all existing designs expose a suboptimal and difficult to tune tradeoff among these metrics. We pinpoint the problem to the fact that modern key-value stores suboptimally co-tune the merge policy, the buffer size, and the Bloom filters’ false-positive rates across the LSM-tree’s different levels.
    We present Monkey, an LSM-tree based key-value store that strikes the optimal balance between the costs of updates and lookups with any given main memory budget. The core insight is that worst-case lookup cost is proportional to the sum of the false-positive rates of the Bloom filters across all levels of the LSM-tree. Contrary to state-of-the-art key-value stores that assign a fixed number of bits-per-element to all Bloom filters, Monkey allocates memory to filters across different levels so as to minimize the sum of their false-positive rates. We show analytically that Monkey reduces the asymptotic complexity of the worst-case lookup I/O cost, and we verify empirically using an implementation on top of RocksDB that Monkey reduces lookup latency by an increasing margin as the data volume grows (50--80% for the data sizes we experimented with). Furthermore, we map the design space onto a closed-form model that enables adapting the merging frequency and memory allocation to strike the best tradeoff among lookup cost, update cost and main memory, depending on the workload (proportion of lookups and updates), the dataset (number and size of entries), and the underlying hardware (main memory available, disk vs. flash). We show how to use this model to answer what-if design questions about how changes in environmental parameters impact performance and how to adapt the design of the key-value store for optimal performance.

    References

    [1]
    M. Y. Ahmad and B. Kemme. 2015. Compaction management in distributed key-value datastores. Proc. VLDB Endow. 8, 8 (2015), 850--861.
    [2]
    S. Alsubaiee, Y. Altowim, H. Altwaijry, A. Behm, V. R. Borkar, Y. Bu, M. J. Carey, I. Cetindil, M. Cheelangi, K. Faraaz, E. Gabrielova, R. Grover, Z. Heilbron, Y.-S. Kim, C. Li, G. Li, J. M. Ok, N. Onose, P. Pirzadeh, V. J. Tsotras, R. Vernica, J. Wen, and T. Westmann. 2014. AsterixDB: A scalable, open source BDMS. Proc. VLDB Endow. 7, 14 (2014), 1905--1916.
    [3]
    A. Anand, C. Muthukrishnan, S. Kappes, A. Akella, and S. Nath. 2010. Cheap and large CAMs for high performance data-intensive networked systems. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI’10). 433--448.
    [4]
    D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. 2009. FAWN: A fast array of wimpy nodes. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’09). 1--14.
    [5]
    M. R. Anderson, D. Antenucci, V. Bittorf, M. Burgess, M. J. Cafarella, A. Kumar, F. Niu, Y. Park, C. Ré, and C. Zhang. 2013. Brainwash: A data system for feature engineering. In Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR’13).
    [6]
    Apache. Accumulo. Retrieved from https://accumulo.apache.org/.
    [7]
    Apache. Cassandra. Retrieved from http://cassandra.apache.org.
    [8]
    Apache. HBase. Retrieved from http://hbase.apache.org/.
    [9]
    L. Arge. 2003. The buffer tree: A technique for designing batched external data structures. Algorithmica 37, 1 (2003), 1--24.
    [10]
    T. G. Armstrong, V. Ponnekanti, D. Borthakur, and M. Callaghan. 2013. LinkBench: A database benchmark based on the facebook social graph. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1185--1196.
    [11]
    M. Athanassoulis, S. Chen, A. Ailamaki, P. B. Gibbons, and R. Stoica. 2011. MaSM: Efficient Online Updates in Data Warehouses. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 865--876.
    [12]
    M. Athanassoulis, S. Chen, A. Ailamaki, P. B. Gibbons, and R. Stoica. 2015. Online updates on data warehouses via judicious use of solid-state storage. ACM Trans. Database Syst. 40, 1 (2015).
    [13]
    M. Athanassoulis and S. Idreos. 2016. Design tradeoffs of data access methods. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Tutorial.
    [14]
    M. Athanassoulis, M. S. Kester, L. M. Maas, R. Stoica, S. Idreos, A. Ailamaki, and M. Callaghan. 2016. Designing access methods: The RUM conjecture. In Proceedings of the International Conference on Extending Database Technology (EDBT’16). 461--466.
    [15]
    A. Badam, K. Park, V. S. Pai, and L. L. Peterson. 2009. HashCache: Cache storage for the next billion. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI’09). 123--136.
    [16]
    O. Balmau, D. Didona, R. Guerraoui, W. Zwaenepoel, H. Yuan, A. Arora, K. Gupta, and P. Konka. 2017. TRIAD: Creating synergies between memory, disk and log in log structured key-value stores. In Proceedings of the USENIX Annual Technical Conference (ATC’17). 363--375.
    [17]
    M. A. Bender, M. Farach-Colton, J. T. Fineman, Y. R. Fogel, B. C. Kuszmaul, and J. Nelson. 2007. Cache-Oblivious Streaming B-trees. In Proceedings of the Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’07). 81--92.
    [18]
    M. A. Bender, M. Farach-Colton, R. Johnson, R. Kraner, B. C. Kuszmaul, D. Medjedovic, P. Montes, P. Shetty, R. P. Spillane, and E. Zadok. 2012. Don’t thrash: How to cache your hash on flash. Proc. VLDB Endow. 5, 11 (2012), 1627--1637.
    [19]
    B. H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426.
    [20]
    E. Bortnikov, A. Braginsky, E. Hillel, I. Keidar, and G. Sheffi. 2018. Accordion: Better memory organization for LSM key-value stores. Proc. VLDB Endow. 11, 12 (2018), 1863--1875.
    [21]
    G. S. Brodal and R. Fagerberg. 2003. Lower Bounds for External Memory Dictionaries. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’03). 546--554.
    [22]
    N. G. Bronson, Z. Amsden, G. Cabrera, P. Chakka, P. Dimov, H. Ding, J. Ferris, A. Giardullo, S. Kulkarni, H. C. Li, M. Marchukov, D. Petrov, L. Puzar, Y. J. Song, and V. Venkataramani. 2013. TAO: Facebook’s distributed data store for the social graph. In Proceedings of the USENIX Annual Technical Conference (ATC’13). 49--60.
    [23]
    Y. Bu, V. R. Borkar, J. Jia, M. J. Carey, and T. Condie. 2014. Pregelix: Big(ger) graph analytics on a dataflow engine. Proc. VLDB Endow. 8, 2 (2014), 161--172.
    [24]
    A. L. Buchsbaum, M. H. Goldwasser, S. Venkatasubramanian, and J. Westbrook. 2000. On external memory graph traversal. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’00). 859--860.
    [25]
    H. H. W. Chan, Y. Li, P. P. C. Lee, and Y. Xu. 2018. HashKV: Enabling efficient updates in kv storage via hashing. In Proceedings of the USENIX Annual Technical Conference (ATC’18). 1007--1019.
    [26]
    B. Chandramouli, G. Prasaad, D. Kossmann, J. J. Levandoski, J. Hunter, and M. Barnett. 2018. FASTER: A concurrent key-value store with in-place updates. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 275--290.
    [27]
    F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. 2006. Bigtable: A distributed storage system for structured data. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’06). 205--218.
    [28]
    B. Chazelle and L. J. Guibas. 1986. Fractional cascading: I. A data structuring technique. Algorithmica 1, 2 (1986), 133--162.
    [29]
    J. Chen, C. Douglas, M. Mutsuzaki, P. Quaid, R. Ramakrishnan, S. Rao, and R. Sears. 2012. Walnut: A unified cloud object store. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 743--754.
    [30]
    B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the ACM Symposium on Cloud Computing (SoCC’10). 143--154.
    [31]
    N. Dayan, M. Athanassoulis, and S. Idreos. 2017. Monkey: Optimal navigable key-value store. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 79--94.
    [32]
    N. Dayan, P. Bonnet, and S. Idreos. 2016. GeckoFTL: Scalable flash translation techniques for very large flash devices. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 327--342.
    [33]
    N. Dayan and S. Idreos. 2018. Dostoevsky: Better space-time trade-offs for LSM-tree based key-value stores via adaptive removal of superfluous merging. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 505--520.
    [34]
    B. Debnath, S. Sengupta, and J. Li. 2010. FlashStore: High throughput persistent key-value store. Proc. VLDB Endow. 3, 1--2 (2010), 1414--1425.
    [35]
    B. Debnath, S. Sengupta, and J. Li. 2011. SkimpyStash: RAM space skimpy key-value store on flash-based storage. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 25--36.
    [36]
    G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. 2007. Dynamo: Amazon’s highly available key-value store. ACM SIGOPS Operat. Syst. Review 41, 6 (2007), 205--220.
    [37]
    J. Dejun, G. Pierre, and C.-H. Chi. 2009. EC2 performance analysis for resource provisioning of service-oriented applications. In Proceedings of the ICSOC/ServiceWave 2009 WorkshopsService-Oriented Computing. 197--207.
    [38]
    S. Dong, M. Callaghan, L. Galanis, D. Borthakur, T. Savor, and M. Strum. 2017. Optimizing space amplification in RocksDB. In Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR’17).
    [39]
    Facebook. MyRocks. Retrieved from http://myrocks.io/.
    [40]
    Facebook. RocksDB. Retrieved from https://github.com/facebook/rocksdb.
    [41]
    B. Fan, D. G. Andersen, M. Kaminsky, and M. Mitzenmacher. 2014. Cuckoo filter: Practically better than bloom. In Proceedings of the ACM International on Conference on emerging Networking Experiments and Technologies (CoNEXT’14). 75--88.
    [42]
    B. Fitzpatrick and A. Vorobey. 2011. Memcached: A distributed memory object caching system. White Paper.
    [43]
    G. Golan-Gueta, E. Bortnikov, E. Hillel, and I. Keidar. 2015. Scaling concurrent log-structured data stores. In Proceedings of the ACM European Conference on Computer Systems (EuroSys’15). 32:1--32:14
    [44]
    Google. LevelDB. Retrieved from https://github.com/google/leveldb/.
    [45]
    S. Idreos, K. Zoumpatianos, M. Athanassoulis, N. Dayan, B. Hentschel, M. S. Kester, D. Guo, L. M. Maas, W. Qin, A. Wasay, and Y. Sun. 2018. The periodic table of data structures. IEEE Data Eng. Bull. 41, 3 (2018), 64--75.
    [46]
    S. Idreos, K. Zoumpatianos, B. Hentschel, M. S. Kester, and D. Guo. 2018. The data calculator: Data structure design and cost synthesis from first principles and learned cost models. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 535--550.
    [47]
    H. V. Jagadish, P. P. S. Narayan, S. Seshadri, S. Sudarshan, and R. Kanneganti. 1997. Incremental organization for data recording and warehousing. In Proceedings of the International Conference on Very Large Data Bases (VLDB’97). 16--25.
    [48]
    C. Jermaine, A. Datta, and E. Omiecinski. 1999. A novel index supporting high volume data warehouse insertion. In Proceedings of the International Conference on Very Large Data Bases (VLDB’99). 235--246.
    [49]
    C. Jermaine, E. Omiecinski, and W. G. Yee. 2007. The partitioned exponential file for database storage management. VLDB J. 16, 4 (2007), 417--437.
    [50]
    B. C. Kuszmaul. 2014. A comparison of fractal trees to log-structured merge (LSM) trees. Tokutek White Paper.
    [51]
    A. Lakshman and P. Malik. 2010. Cassandra—A decentralized structured storage system. ACM SIGOPS Operat. Syst. Rev. 44, 2 (2010), 35--40.
    [52]
    Y. Li, B. He, J. Yang, Q. Luo, K. Yi, and R. J. Yang. 2010. Tree indexing on solid state drives. Proc. VLDB Endow 3, 1--2 (2010), 1195--1206.
    [53]
    H. Lim, D. G. Andersen, and M. Kaminsky. 2016. Towards accurate and fast evaluation of multi-stage log-structured designs. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’16). 149--166.
    [54]
    H. Lim, B. Fan, D. G. Andersen, and M. Kaminsky. 2011. SILT: A memory-efficient, high-performance key-value store. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’11). 1--13.
    [55]
    LinkedIn. 2016. Online reference. Retrieved from http://www.project-voldemort.com.
    [56]
    L. Lu, T. S. Pillai, A. C. Arpaci-Dusseau and R. H. Arpaci-Dusseau. 2016. WiscKey: Separating keys from values in ssd-conscious storage. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’16). 133--148.
    [57]
    P. E. O’Neil, E. Cheng, D. Gawlick and E. J. O’Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Inf. 33, 4 (1996), 351--385.
    [58]
    A. Papagiannis, G. Saloustros, P. González-Férez, and A. Bilas. 2016. Tucana: Design and implementation of a fast and efficient scale-up key-value store. In Proceedings of the USENIX Annual Technical Conference (ATC’16). 537--550.
    [59]
    M. Pilman, K. Bocksrocker, L. Braun, R. Marroquin, and D. Kossmann. 2017. Fast scans on key-value stores. Proc. VLDB Endow. 10, 11 (2017), 1526--1537.
    [60]
    P. Raju, R. Kadekodi, V. Chidambaram, and I. Abraham. 2017. PebblesDB: Building key-value stores using fragmented log-structured merge trees. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’17). 497--514.
    [61]
    Redis. Online reference. Retrieved from http://redis.io/.
    [62]
    K. Ren, Q. Zheng, J. Arulraj, and G. Gibson. 2017. SlimDB: A space-efficient key-value storage engine for semi-sorted data. Proc. VLDB Endow. 10, 13 (2017), 2037--2048.
    [63]
    R. Sears and R. Ramakrishnan. 2012. bLSM: A general purpose log structured merge tree. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 217--228.
    [64]
    J. Sheehy and D. Smith. 2010. Bitcask: A log-structured hash table for fast key/value data. Basho White Paper.
    [65]
    P. Shetty, R. P. Spillane, R. Malpani, B. Andrews, J. Seyster, and E. Zadok. 2013. Building workload-independent storage with VT-trees. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’13). 17--30.
    [66]
    S. Tarkoma, C. E. Rothenberg, and E. Lagerspetz. 2012. Theory and practice of bloom filters for distributed systems. IEEE Commun. Surv. Tutor 14, 1 (2012), 131--155.
    [67]
    R. Thonangi and J. Yang. 2017. On log-structured merge for solid-state drives. In Proceedings of the IEEE International Conference on Data Engineering (ICDE’17). 683--694.
    [68]
    D. Tsirogiannis, S. Harizopoulos, and M. A. Shah. 2010. Analyzing the energy efficiency of a database server. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 231--242.
    [69]
    P. Wang, G. Sun, S. Jiang, J. Ouyang, S. Lin, C. Zhang, and J. Cong. 2014. An efficient design and implementation of lsm-tree based key-value store on open-channel SSD. In Proceedings of the ACM European Conference on Computer Systems (EuroSys’14). 16:1--16:14
    [70]
    WiredTiger. Source Code. Retrieved from https://github.com/wiredtiger/wiredtiger.
    [71]
    X. Wu, Y. Xu, Z. Shao, and S. Jiang. 2015. LSM-trie: An LSM-tree-based ultra-large key-value store for small data items. In Proceedings of the USENIX Annual Technical Conference (ATC’15). 71--82.
    [72]
    H. Zhang, H. Lim, V. Leis, D. G. Andersen, M. Kaminsky, K. Keeton, and A. Pavlo. 2018. SuRF: Practical range query filtering with fast succinct tries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 323--336.

    Cited By

    View all

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Database Systems
    ACM Transactions on Database Systems  Volume 43, Issue 4
    Best of SIGMOD 2017 Papers
    December 2018
    173 pages
    ISSN:0362-5915
    EISSN:1557-4644
    DOI:10.1145/3298792
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 December 2018
    Accepted: 01 September 2018
    Revised: 01 September 2018
    Received: 01 December 2017
    Published in TODS Volume 43, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Bloom filters
    2. LSM-tree
    3. NoSQL
    4. key-value stores
    5. system design

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)76
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Oasis: An Optimal Disjoint Segmented Learned Range FilterProceedings of the VLDB Endowment10.14778/3659437.365944717:8(1911-1924)Online publication date: 1-Apr-2024
    • (2024)Optimizing Time Series Queries with VersionsProceedings of the ACM on Management of Data10.1145/36549622:3(1-27)Online publication date: 30-May-2024
    • (2024)Grafite: Taming Adversarial Queries with Optimal Range FiltersProceedings of the ACM on Management of Data10.1145/36392582:1(1-23)Online publication date: 26-Mar-2024
    • (2024)Beyond Bloom: A Tutorial on Future Feature-Rich FiltersCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654681(636-644)Online publication date: 9-Jun-2024
    • (2024)Brief Announcement: Root-to-Leaf Scheduling in Write-Optimized TreesProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3660514(475-477)Online publication date: 17-Jun-2024
    • (2024)Enabling space-time efficient range queries with REncoderThe VLDB Journal10.1007/s00778-024-00873-wOnline publication date: 7-Aug-2024
    • (2024)Towards flexibility and robustness of LSM treesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00826-933:4(1105-1128)Online publication date: 1-Jul-2024
    • (2023)Practical Dynamic Extension for Sampling IndexesProceedings of the ACM on Management of Data10.1145/36267441:4(1-26)Online publication date: 12-Dec-2023
    • (2023)Enabling Timely and Persistent Deletion in LSM-EnginesACM Transactions on Database Systems10.1145/359972448:3(1-40)Online publication date: 9-Aug-2023
    • (2023)InfiniFilter: Expanding Filters to Infinity and BeyondProceedings of the ACM on Management of Data10.1145/35892851:2(1-27)Online publication date: 20-Jun-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media