Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

SNARF: a learning-enhanced range filter

Published: 01 April 2022 Publication History

Abstract

We present Sparse Numerical Array-Based Range Filters (SNARF), a learned range filter that efficiently supports range queries for numerical data. SNARF creates a model of the data distribution to map the keys into a bit array which is stored in a compressed form. The model along with the compressed bit array which constitutes SNARF are used to answer membership queries.
We evaluate SNARF on multiple synthetic and real-world datasets as a stand-alone filter and by integrating it into RocksDB. For range queries, SNARF provides up to 50x better false positive rate than state-of-the-art range filters, such as SuRF and Rosetta, with the same space usage. We also evaluate SNARF in RocksDB as a filter replacement for filtering requests before they access on-disk data structures. For RocksDB, SNARF can improve the execution time of the system up to 10x compared to SuRF and Rosetta for certain read-only workloads.

References

[1]
Karolina Alexiou, Donald Kossmann, and Paul Larson. 2013. Adaptive Range Filters for Cold Data: Avoiding Trips to Siberia. In Proceedings of the VLDB Endowment, Vol. 6, No. 14.
[2]
Michael A Bender, Martin Farach-Colton, Rob Johnson, Russell Kraner, Bradley C Kuszmaul, Dzejla Medjedovic, Pablo Montes, Pradeep Shetty, Richard P Spillane, and Erez Zadok. 2012. Don't Thrash: How to Cache Your Hash on Flash. Proc. VLDB Endow. 5, 11 (2012), 1627--1637.
[3]
Burton H. Bloom. 1970. Space/Time Trade-Offs in Hash Coding with Allowable Errors. (1970), 422--426.
[4]
Andrei Z. Broder and Michael Mitzenmacher. 2003. Survey: Network Applications of Bloom Filters: A Survey. Internet Math. 1, 4 (2003), 485--509.
[5]
Samy Chambi, Daniel Lemire, Owen Kaser, and Robert Godin. 2016. Better bitmap performance with roaring bitmaps. Software: practice and experience 46, 5 (2016), 709--719.
[6]
Efficient Lab CMU. 2020. https://github.com/efficient/cuckoofilter.
[7]
Efficient Lab CMU. 2020. https://github.com/efficient/SuRF.
[8]
Alessandro Colantonio and Roberto Di Pietro. 2010. Concise: Compressed `n'composable integer set. Inform. Process. Lett. (2010), 644--650.
[9]
Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium on Cloud computing. 143--154.
[10]
Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking Cloud Serving Systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC '10). 143--154.
[11]
Andrew Crotty. 2021. Hist-Tree: Those Who Ignore It Are Doomed to Learn. In CIDR.
[12]
François Deliège and Torben Bach Pedersen. 2010. Position List Word Aligned Hybrid: Optimizing Space and Performance for Compressed Bitmaps (EDBT '10). 228--239.
[13]
Peter C Dillinger and Stefan Walzer. 2021. Ribbon filter: practically smaller than Bloom and Xor. arXiv preprint arXiv:2103.02515 (2021).
[14]
Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, and et al. 2020. ALEX: An Updatable Adaptive Learned Index. Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (2020).
[15]
Facebook. 2015. http://myrocks.io/.
[16]
Bin Fan, David G. Andersen, Michael Kaminsky, and Michael D. Mitzenmacher. 2014. Cuckoo Filter: Practically Better Than Bloom. In Proc. CoNEXT.
[17]
Paolo Ferragina and Giorgio Vinciguerra. 2020. The PGM-index. Proceedings of the VLDB Endowment 13 (2020).
[18]
Frederick N Fritsch and Ralph E Carlson. 1980. Monotone piecewise cubic interpolation. SIAM J. Numer. Anal. 17, 2 (1980), 238--246.
[19]
Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. 2019. FITing-Tree: A Data-Aware Index Structure (SIGMOD '19). 1189--1206.
[20]
R. Gallager and D. van Voorhis. 1975. Optimal source codes for geometrically distributed integer alphabets (Corresp.). IEEE Transactions on Information Theory 21, 2 (1975), 228--230.
[21]
Rodrigo González, Szymon Grabowski, Veli Mäkinen, and Gonzalo Navarro. 2005. Practical implementation of rank and select queries. In Poster Proc. Volume of 4th Workshop on Efficient and Experimental Algorithms (WEA). 27--38.
[22]
Mayank Goswami, Allan Grønlund, Kasper Green Larsen, and Rasmus Pagh. 2014. Approximate Range Emptiness in Constant Time and Optimal Space. arXiv:1407.2907 [cs.DS]
[23]
Thomas Mueller Graf and Daniel Lemire. 2020. Xor filters: Faster and smaller than bloom and cuckoo filters. Journal of Experimental Algorithmics (JEA) 25 (2020), 1--16.
[24]
Roberto Grossi, Alessio Orlandi, Rajeev Raman, and S. Srinivasa Rao. 2009. More Haste, Less Waste: Lowering the Redundancy in Fully Indexable Dictionaries. arXiv:0902.2648 [cs.DS]
[25]
Ali Hadian and Thomas Heinis. 2021. Shift-Table: A Low-latency Learned Index for Range Queries using Model Correction. arXiv:2101.10457 [cs.DB]
[26]
Stratos Idreos and Mark Callaghan. 2020. Key-Value Storage Engines. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020.
[27]
Tamer Kahveci and Ambuj Singh. 2001. Variable length queries for time series data. In Proceedings 17th International Conference on Data Engineering. IEEE, 273--282.
[28]
Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2020. RadixSpline: A Single-Pass Learned Index. arXiv:2004.14541 [cs.DB]
[29]
Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The Case for Learned Index Structures. arXiv:1712.01208 [cs.DB]
[30]
Ani Kristo, Kapil Vaidya, Ugur Çetintemel, Sanchit Misra, and Tim Kraska. 2020. The Case for a Learned Sorting Algorithm. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD '20). 1001--1016.
[31]
Ani Kristo, Kapil Vaidya, and Tim Kraska. 2021. Defeating duplicates: A re-design of the LearnedSort algorithm. arXiv:2107.03290 [cs.DS]
[32]
Cockroach Labs. 2015. https://github.com/cockroachdb/cockroach.
[33]
Yongkun Li, Chengjin Tian, Fan Guo, Cheng Li, and Yinlong Xu. 2019. Elasticbf: elastic bloom filter with hotness awareness for boosting read performance in large key-value stores. In 2019 {USENIX} Annual Technical Conference ({USENIX} {ATC} 19). 739--752.
[34]
Siqiang Luo, Subarna Chatterjee, Rafael Ketsetsidis, Niv Dayan, Wilson Qin, and Stratos Idreos. 2020. Rosetta: A Robust Space-Time Optimized Range Filter for Key-Value Stores. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 2071--2086.
[35]
Stephen Macke, Alex Beutel, Tim Kraska, Maheswaran Sathiamoorthy, Derek Zhiyuan Cheng, and EH Chi. 2018. Lifting the curse of multidimensional data with learned existence indexes. In Workshop on ML for Systems at NeurIPS.
[36]
Ryan Marcus, Andreas Kipf, Alexander van Renen, Mihail Stoian, Sanchit Misra, Alfons Kemper, Thomas Neumann, and Tim Kraska. 2020. Benchmarking Learned Indexes. arXiv:2006.12804 [cs.DB]
[37]
M. Mitzenmacher. 2002. Compressed Bloom filters. IEEE/ACM Transactions on Networking 10, 5 (2002), 604--612.
[38]
Michael Mitzenmacher. 2018. A Model for Learned Bloom Filters and Optimizing by Sandwiching. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2018/file/0f49c89d1e7298bb9930789c8ed59d48-Paper.pdf
[39]
Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Informatica 33, 4 (1996), 351--385.
[40]
Mihai Patrascu. 2008. Succincter. In 2008 49th Annual IEEE Symposium on Foundations of Computer Science. 305--313.
[41]
Felix Putze, Peter Sanders, and Johannes Singler. 2007. Cache-, hash-and space-efficient bloom filters. In International Workshop on Experimental and Efficient Algorithms. 108--121.
[42]
Christian Riegger, Arthur Bernhardt, Bernhard Moessner, and Ilia Petrov. 2020. bloomRF: On Performing Range-Queries with Bloom-Filters based on Piecewise-Monotone Hash Functions and Dyadic Trace-Trees. arXiv:2012.15596 [cs.DB]
[43]
Russell Sears, Mark Callaghan, and Eric Brewer. 2008. Rose: Compressed, log-structured replication. Proceedings of the VLDB Endowment 1, 1 (2008), 526--537.
[44]
Sasu Tarkoma, Christian Esteve Rothenberg, and Eemil Lagerspetz. 2011. Theory and practice of bloom filters for distributed systems. IEEE Communications Surveys & Tutorials 14, 1 (2011), 131--155.
[45]
Kapil Vaidya, Eric Knorr, Tim Kraska, and Michael Mitzenmacher. 2020. Partitioned Learned Bloom Filter. CoRR abs/2006.03176 (2020). arXiv:2006.03176 https://arxiv.org/abs/2006.03176
[46]
Peter Van Sandt, Yannis Chronis, and Jignesh M. Patel. 2019. Efficiently Searching In-Memory Sorted Arrays: Revenge of the Interpolation Search?. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19). 36--53.
[47]
Huanchen Zhang, Hyeontaek Lim, Viktor Leis, David G. Andersen, Michael Kaminsky, Kimberly Keeton, and Andrew Pavlo. 2018. SuRF: Practical Range Query Filtering with Fast Succinct Tries (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 323--336.
[48]
Wenshao Zhong, Chen Chen, Xingbo Wu, and Song Jiang. 2021. REMIX: Efficient Range Query for LSM-trees. In FAST.
[49]
Dong Zhou, David G Andersen, and Michael Kaminsky. 2013. Space-efficient, high-performance rank and select structures on uncompressed bit sequences. In International Symposium on Experimental Algorithms. Springer, 151--163.

Cited By

View all
  • (2024)Aleph Filter: To Infinity in Constant TimeProceedings of the VLDB Endowment10.14778/3681954.368202717:11(3644-3656)Online publication date: 1-Jul-2024
  • (2024)Oasis: An Optimal Disjoint Segmented Learned Range FilterProceedings of the VLDB Endowment10.14778/3659437.365944717:8(1911-1924)Online publication date: 1-Apr-2024
  • (2024)Structural Designs Meet Optimality: Exploring Optimized LSM-tree Structures in a Colossal Configuration SpaceProceedings of the ACM on Management of Data10.1145/36549782:3(1-26)Online publication date: 30-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 15, Issue 8
April 2022
220 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 April 2022
Published in PVLDB Volume 15, Issue 8

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)87
  • Downloads (Last 6 weeks)6
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Aleph Filter: To Infinity in Constant TimeProceedings of the VLDB Endowment10.14778/3681954.368202717:11(3644-3656)Online publication date: 1-Jul-2024
  • (2024)Oasis: An Optimal Disjoint Segmented Learned Range FilterProceedings of the VLDB Endowment10.14778/3659437.365944717:8(1911-1924)Online publication date: 1-Apr-2024
  • (2024)Structural Designs Meet Optimality: Exploring Optimized LSM-tree Structures in a Colossal Configuration SpaceProceedings of the ACM on Management of Data10.1145/36549782:3(1-26)Online publication date: 30-May-2024
  • (2024)GRF: A Global Range Filter for LSM-Trees with Shape EncodingProceedings of the ACM on Management of Data10.1145/36549442:3(1-27)Online publication date: 30-May-2024
  • (2024)Grafite: Taming Adversarial Queries with Optimal Range FiltersProceedings of the ACM on Management of Data10.1145/36392582:1(1-23)Online publication date: 26-Mar-2024
  • (2024)Beyond Bloom: A Tutorial on Future Feature-Rich FiltersCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654681(636-644)Online publication date: 9-Jun-2024
  • (2024)Efficient Stream Join Processing: Novel Approaches and ChallengesProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658833(409-412)Online publication date: 3-Jun-2024
  • (2023)An Empirical Evaluation of Columnar Storage FormatsProceedings of the VLDB Endowment10.14778/3626292.362629817:2(148-161)Online publication date: 1-Oct-2023
  • (2023)A Learned Cuckoo Filter for Approximate Membership Queries over Variable-sized Sliding Windows on Data StreamsProceedings of the ACM on Management of Data10.1145/36267581:4(1-26)Online publication date: 12-Dec-2023
  • (2023)MirrorKV: An Efficient Key-Value Store on Hybrid Cloud Storage with Balanced Performance of Compaction and QueryingProceedings of the ACM on Management of Data10.1145/36267361:4(1-27)Online publication date: 12-Dec-2023

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media