Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3035918.3035963acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

A General-Purpose Counting Filter: Making Every Bit Count

Published: 09 May 2017 Publication History

Abstract

Approximate Membership Query (AMQ) data structures, such as the Bloom filter, quotient filter, and cuckoo filter, have found numerous applications in databases, storage systems, networks, computational biology, and other domains. However, many applications must work around limitations in the capabilities or performance of current AMQs, making these applications more complex and less performant. For example, many current AMQs cannot delete or count the number of occurrences of each input item, take up large amounts of space, are slow, cannot be resized or merged, or have poor locality of reference and hence perform poorly when stored on SSD or disk. This paper proposes a new general-purpose AMQ, the counting quotient filter (CQF). The CQF supports approximate membership testing and counting the occurrences of items in a data set. This general-purpose AMQ is small and fast, has good locality of reference, scales out of RAM to SSD, and supports deletions, counting (even on skewed data sets), resizing, merging, and highly concurrent access. The paper reports on the structure's performance on both manufactured and application-generated data sets.
In our experiments, the CQF performs in-memory inserts and queries up to an order-of magnitude faster than the original quotient filter, several times faster than a Bloom filter, and similarly to the cuckoo filter, even though none of these other data structures support counting. On SSD, the CQF outperforms all structures by a factor of at least 2 because the CQF has good data locality.
The CQF achieves these performance gains by restructuring the metadata bits of the quotient filter to obtain fast lookups at high load factors (i.e., even when the data structure is almost full). As a result, the CQF offers good lookup performance even up to a load factor of 95%. Counting is essentially free in the CQF in the sense that the structure is comparable or more space efficient even than non-counting data structures (e.g., Bloom, quotient, and cuckoo filters).
The paper also shows how to speed up CQF operations by using new x86 bit-manipulation instructions introduced in Intel's Haswell line of processors. The restructured metadata transforms many quotient filter metadata operations into rank-and-select bit-vector operations. Thus, our efficient implementations of rank and select may be useful for other rank-and-select-based data structures.

References

[1]
F. vesca genome read dataset. ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA020/SRA020125/SRX030576/SRR072006.fastq.bz2. {Online; accessed 19-February-2016}.
[2]
P. K. Agarwal, G. Cormode, Z. Huang, J. M. Phillips, Z. Wei, and K. Yi. Mergeable summaries. ACM Transactions on Database Systems (TODS), 38(4):26, 2013.
[3]
P. S. Almeida, C. Baquero, N. Preguiça, and D. Hutchison. Scalable Bloom filters. Journal of Information Processing Letters, 101(6):255--261, 2007.
[4]
S. Alsubaiee, A. Behm, V. Borkar, Z. Heilbron, Y.-S. Kim, M. J. Carey, M. Dreseler, and C. Li. Storage management in AsterixDB. Proceedings of the VLDB Endowment, 7(10):841--852, 2014.
[5]
M. A. Bender, M. Farach-Colton, R. Johnson, R. Kaner, B. C. Kuszmaul, D. Medjedovic, P. Montes, P. Shetty, R. P. Spillane, and E. Zadok. Don't thrash: How to cache your hash on flash. Proceedings of the VLDB Endowment, 5(11), 2012.
[6]
M. A. Bender, M. Farach-Colton, and M. A. Mosteiro. Insertion sort is O(n log n). Theory of Computing Systems, 39(3):391--397, 2006. Special Issue on FUN '04.
[7]
B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422--426, 1970.
[8]
F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G. Varghese. An improved construction for counting Bloom filters. In European Symposium on Algorithms (ESA), pages 684--695. Springer, 2006.
[9]
A. Broder and M. Mitzenmacher. Network applications of Bloom filters: A survey. Internet Mathematics, 1(4):485--509, 2004.
[10]
M. Canim, G. A. Mihaila, B. Bhattacharjee, C. A. Lang, and K. A. Ross. Buffered Bloom filters on solid state storage. In Proceedings of the International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures (ADMS), pages 1--8, 2010.
[11]
S. Cohen and Y. Matias. Spectral Bloom filters. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), pages 241--252, 2003.
[12]
G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms, 55(1):58--75, 2005.
[13]
B. Corominas-Murtra and R. V. Solé. Universality of Zipf's law. Physical Review E, 82(1):011102, 2010.
[14]
B. Debnath, S. Sengupta, J. Li, D. J. Lilja, and D. H. Du. BloomFlash: Bloom filter on flash-based storage. In Proceedings of the 31st International Conference on Distributed Computing Systems (ICDCS), pages 635--644, 2011.
[15]
B. K. Debnath, S. Sengupta, and J. Li. Chunkstash: Speeding up inline storage deduplication using flash memory. In Proceedings of the USENIX Annual Technical Conference (ATC), 2010.
[16]
B. Fan. Cuckoo filter source code in C++. https://github.com/efficient/cuckoofilter, 2014. {Online; accessed 19-July-2014}.
[17]
B. Fan, D. G. Andersen, M. Kaminsky, and M. D. Mitzenmacher. Cuckoo filter: Practically better than Bloom. In Proceedings of the 10th ACM International on Conference on Emerging Networking Experiments and Technologies, pages 75--88, 2014.
[18]
L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache: A scalable wide-area web cache sharing protocol. IEEE/ACM Transactions on Networking (TON), 8(3):281--293, 2000.
[19]
A. Geil. Quotient filters: Approximate membership queries on the GPU. http://on-demand.gputechconf.com/gtc/2016/presentation/s6464-afton-geil-quoetient-filters.pdf, 2016.
[20]
E. Georganas, A. Buluç, J. Chapman, L. Oliker, D. Rokhsar, and K. Yelick. Parallel de Bruijn graph construction and traversal for de novo genome assembly. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 437--448, 2014.
[21]
R. González, S. Grabowski, V. Mäkinen, and G. Navarro. Practical implementation of rank and select queries. In Poster Proceedings Volume of 4th Workshop on Efficient and Experimental Algorithms (WEA), pages 27--38, 2005.
[22]
S. P. Karl Anderson. Firehose. http://firehose.sandia.gov/, 2013. {Online; accessed 19-Dec-2015}.
[23]
G. Lu, B. Debnath, and D. H. Du. A forest-structured Bloom filter with flash memory. In Proceedings of the 27th Symposium on Mass Storage Systems and Technologies (MSST), pages 1--6, 2011.
[24]
P. Melsted and J. K. Pritchard. Efficient counting of k-mers in DNA sequences using a Bloom filter. BMC Bioinformatics, 12(1):1, 2011.
[25]
P. O'Neil, E. Cheng, D. Gawlic, and E. O'Neil. The log-structured merge-tree (LSM-tree). Acta Informatica, 33(4):351--385, 1996.
[26]
A. Partow. C++Bloom filter library. https://code.google.com/p/bloom/. {Online; accessed 19-July-2014}.
[27]
F. Putze, P. Sanders, and J. Singler. Cache-, hash-and space-efficient bloom filters. In International Workshop on Experimental and Efficient Algorithms, pages 108--121, 2007.
[28]
Y. Qiao, T. Li, and S. Chen. Fast Bloom filters and their generalization. IEEE Transactions on Parallel and Distributed Systems (TPDS), 25(1):93--103, 2014.
[29]
R. S. Roy, D. Bhattacharya, and A. Schliep. Turtle: Identifying frequent k-mers with cache-efficient algorithms. Bioinformatics, 30:1950--1957, 2014.
[30]
S. Tarkoma, C. E. Rothenberg, and E. Lagerspetz. Theory and practice of Bloom filters for distributed systems. IEEE Communications Surveys & Tutorials, 14(1):131--155, 2012.
[31]
M. Vallentin. Counting Bloom filter source code in C++. https://github.com/mavam/libbf, 2014. {Online; accessed 19-July-2015}.
[32]
P. Wang, G. Sun, S. Jiang, J. Ouyang, S. Lin, C. Zhang, and J. Cong. An efficient design and implementation of LSM-tree based key-value store on open-channel SSD. In Proceedings of the 9th European Conference on Computer Systems (EuroSys), pages 16:1--16:14, 2014.
[33]
Q. Zhang, J. Pell, R. Canino-Koning, A. C. Howe, and C. T. Brown. These are not the k-mers you are looking for: Efficient online k-mer counting using a probabilistic data structure. PLoS One, 9(7):e101271, 2014.
[34]
B. Zhu, K. Li, and R. H. Patterson. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST), pages 1--14, 2008.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
May 2017
1810 pages
ISBN:9781450341974
DOI:10.1145/3035918
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 09 May 2017

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. bloom filters and hashing
  2. computational biology
  3. network monitoring
  4. sketching and sampling

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS'17
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,007
  • Downloads (Last 6 weeks)103
Reflects downloads up to 18 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Đánh giá các thuật toán lọc hiệu quả trong xử lý dữ liệu lớnCTU Journal of Science10.22144/ctujos.2024.42660:5Online publication date: 24-Oct-2024
  • (2024)Aleph Filter: To Infinity in Constant TimeProceedings of the VLDB Endowment10.14778/3681954.368202717:11(3644-3656)Online publication date: 1-Jul-2024
  • (2024)Optimizing Collections of Bloom Filters within a Space BudgetProceedings of the VLDB Endowment10.14778/3681954.368202017:11(3551-3564)Online publication date: 1-Jul-2024
  • (2024)Memento Filter: A Fast, Dynamic, and Robust Range FilterProceedings of the ACM on Management of Data10.1145/36988202:6(1-27)Online publication date: 20-Dec-2024
  • (2024)Adaptive Quotient FiltersProceedings of the ACM on Management of Data10.1145/36771282:4(1-28)Online publication date: 30-Sep-2024
  • (2024)Wormhole Filters: Caching Your Hash on Persistent MemoryProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629590(456-471)Online publication date: 22-Apr-2024
  • (2024)Beyond Bloom: A Tutorial on Future Feature-Rich FiltersCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654681(636-644)Online publication date: 9-Jun-2024
  • (2024)Space Lower Bounds for Dynamic Filters and Value-Dynamic RetrievalProceedings of the 56th Annual ACM Symposium on Theory of Computing10.1145/3618260.3649649(1153-1164)Online publication date: 10-Jun-2024
  • (2024)Bamboo Filters: Make Resizing Smooth and AdaptiveIEEE/ACM Transactions on Networking10.1109/TNET.2024.340399732:5(3776-3791)Online publication date: Oct-2024
  • (2024)Cardinality Estimation Adaptive Cuckoo Filters (CE-ACF): Approximate Membership Check and Distinct Query Count for High-Speed Network MonitoringIEEE/ACM Transactions on Networking10.1109/TNET.2023.330230632:2(959-970)Online publication date: Apr-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media