Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

On the Evolutionary of Bloom Filter False Positives - An Information Theoretical Approach to Optimizing Bloom Filter Parameters

Published: 01 July 2023 Publication History

Abstract

The fundamental issue of how to calculate the false positive probability of widely used Bloom Filters (BF), from which the conventional wisdom is to derive the optimal value of <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="fan-ieq1-3200045.gif"/></alternatives></inline-formula>, remains elusive. Since Bloom gave the false positive formula in 1970, in 2008, Bose et al. pointed out that Bloom&#x0027;s formula is flawed; and in 2010, Christensen et al. pointed out that Bose&#x0027;s formula is also flawed and gave another formula. Although Christensen&#x0027;s formula is perfectly accurate, it is time-consuming and impossible to calculate the optimal value of <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="fan-ieq2-3200045.gif"/></alternatives></inline-formula>. Based on the following observation: for a BF with <inline-formula><tex-math notation="LaTeX">$m$</tex-math><alternatives><mml:math><mml:mi>m</mml:mi></mml:math><inline-graphic xlink:href="fan-ieq3-3200045.gif"/></alternatives></inline-formula> bits and <inline-formula><tex-math notation="LaTeX">$n$</tex-math><alternatives><mml:math><mml:mi>n</mml:mi></mml:math><inline-graphic xlink:href="fan-ieq4-3200045.gif"/></alternatives></inline-formula> elements, if and only if its entropy is the largest, its false positive probability is the smallest, we propose the first approach to calculating the optimal <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="fan-ieq5-3200045.gif"/></alternatives></inline-formula> without any false positive formula. Furthermore, we propose a new and more accurate upper bound for the false positive probability. When the size of a Bloom Filter becomes infinitely large, our upper bound turns equal to the lower bound, which becomes Bloom&#x0027;s formula and deepens our understanding towards it. Besides, we derive the bounds of correct rate of Counting Bloom Filters (CBFs) by applying our proposed formulas about BFs to them.

References

[1]
B. H. Bloom, “Space/time trade-offs in hash coding with allowable errors,” Commun. ACM, vol. 13, no. 7, pp. 422–426, 1970.
[2]
M. Mitzenmacher, P. Reviriego, and S. Pontarelli, “OMASS: One memory access set separation,” IEEE Trans. Knowl. Data Eng., vol. 28, no. 7, pp. 1940–1943, Jul. 2016.
[3]
L. Fan, P. Cao, J. Almeida, and A. Z. Broder, “Summary cache: A scalable wide-area web cache sharing protocol,” IEEE ACM Trans. Netw., vol. 8, no. 3, pp. 281–293, Jun. 2000.
[4]
M. Mitzenmacher, “Distributed, compressed bloom filter web cache server,” US Patent 6,920,477, 2005.
[5]
M. Yoon, “Aging bloom filter with two active buffers for dynamic sets,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 1, pp. 134–138, Jan. 2010.
[6]
F. Ye, H. Luo, S. Lu, and L. Zhang, “Statistical en-route filtering of injected false data in sensor networks,” IEEE J. Sel. Areas Commun., vol. 23, no. 4, pp. 839–850, Apr. 2006.
[7]
M. Luk, G. Mezzour, A. Perrig, and V. Gligor, “MiniSec: A secure sensor network communication architecture,” in Proc. IEEE ACM 6th Int. Symp. Inf. Proces. Sensor Netw., 2007, pp. 479–488.
[8]
T. Chen, D. Guo, Y. He, H. Chen, X. Liu, and X. Luo, “A bloom filters based dissemination protocol in wireless sensor networks,” Ad Hoc Netw., vol. 11, no. 4, pp. 1359–1371, 2013.
[9]
M. Yu, A. Fabrikant, and J. Rexford, “Buffalo: Bloom filter forwarding architecture for large organizations,” in Proc. ACM Conf. Emerg. Netw. Exp. Technol., 2009, pp. 313–324.
[10]
D. Li, H. Cui, Y. Hu, Y. Xia, and X. Wang, “Scalable data center multicast using multi-class bloom filter,” in Proc. IEEE Int. Conf. Netw. Protoc., 2011, pp. 266–275.
[11]
D. Li, Y. Li, J. Wu, S. Su, and J. Yu, “ESM: Efficient and scalable data center multicast routing,” IEEE ACM Trans. Netw., vol. 20, no. 3, pp. 944–955, Jun. 2012.
[12]
A. Papadopoulos and D. Katsaros, “A-tree: Distributed indexing of multidimensional data for cloud computing environments,” in Proc. IEEE Int. Conf. Cloud Comput. Technol. Sci., 2011, pp. 407–414.
[13]
V. Roussev, L. Wang, G. Richard, and L. Marziale, “A cloud computing platform for large-scale forensic computing,” in Proc. IFIP Adv. Inf. Commun. Technol., 2009, pp. 201–214.
[14]
S. Xiong et al., “kBF: Towards approximate and bloom filter based key-value storage for cloud computing systems,” IEEE Trans. Cloud Comput., vol. 5, no. 1, pp. 85–98, Jan.–Mar. 2014.
[15]
S. Dharmapurikar, P. Krishnamurthy, and D. E. Taylor, “Longest prefix matching using bloom filters,” ACM SIGCOMM Comput. Commun. Rev., vol. 33, no. 4, pp. 201–212, 2003.
[16]
F. Bonomi, M. Mitzenmacher, R. Panigrah, S. Singh, and G. Varghese, “Beyond bloom filters: From approximate membership checks to approximate state machines,” ACM SIGCOMM Comput. Commun. Rev., vol. 36, no. 4, pp. 315–326, 2006.
[17]
Y. Wang et al., “NameFilter: Achieving fast name lookup with low memory cost via applying two-stage bloom filters,” in Proc. IEEE Conf. Comput. Commun., 2013, pp. 95–99.
[18]
B. Debnath, S. Sengupta, and J. Li, “Flashstore: High throughput persistent key-value store,” Proc. VLDB Endow., vol. 3, no. 1-2, pp. 1414–1425, 2010.
[19]
Rocksdb - a persistent key-value store for fast storage environments. [Online]. Available: http://rocksdb.org/
[20]
Y. Li, C. Tian, F. Guo, C. Li, and Y. Xu, “ElasticBF: Elastic bloom filter with hotness awareness for boosting read performance in large key-value stores,” in Proc. USENIX Annu. Tech. Conf., 2019, pp. 739–752.
[21]
N. Dayan, M. Athanassoulis, and S. Idreos, “Optimal bloom filters and adaptive merging for LSM-trees,” ACM Trans. Database Syst., vol. 43, no. 4, pp. 1–48, 2018.
[22]
S. Luo, S. Chatterjee, R. Ketsetsidis, N. Dayan, W. Qin, and S. Idreos, “Rosetta: A robust space-time optimized range filter for key-value stores,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2020, pp. 2071–2086.
[23]
Y. Chai, Y. Chai, X. Wang, H. Wei, and Y. Wang, “Adaptive lower-level driven compaction to optimize LSM-tree key-value stores,” IEEE Trans. Knowl. Data Eng., vol. 34, no. 6, pp. 2595–2609, Jun. 2022.
[24]
R. Schnell, T. Bachteler, and J. Reiher, “Privacy-preserving record linkage using bloom filters,” BMC Med. Inform. Decis. Mak., vol. 9, no. 1, pp. 1–11, 2009.
[25]
E. A. Durham, M. Kantarcioglu, Y. Xue, C. Toth, M. Kuzu, and B. Malin, “Composite bloom filters for secure record linkage,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 12, pp. 2956–2968, Dec. 2014.
[26]
D. Vatsalan and P. Christen, “Scalable privacy-preserving record linkage for multiple databases,” in Proc. ACM Int. Conf. Inf. Knowl. Manag., 2014, pp. 1795–1798.
[27]
R. Schnell and C. Borgs, “Randomized response and balanced bloom filters for privacy preserving record linkage,” in Proc. IEEE Int. Conf. Data Min. Workshops, 2016, pp. 218–224.
[28]
P. Christen, R. Schnell, D. Vatsalan, and T. Ranbaduge, “Efficient cryptanalysis of bloom filters for privacy-preserving record linkage,” in Proc. Pacific-Asia Conf. Knowl. Discov. Data Min., 2017, pp. 628–640.
[29]
P. Christen, T. Ranbaduge, D. Vatsalan, and R. Schnell, “Precise and fast cryptanalysis for bloom filter based privacy-preserving record linkage,” IEEE Trans. Knowl. Data Eng., vol. 31, no. 11, pp. 2164–2177, Nov. 2019.
[30]
S. Jiang et al., “Privacy-preserving and efficient multi-keyword search over encrypted data on blockchain,” in Proc. IEEE Int. Conf. Blockchain, 2019, pp. 405–410.
[31]
T. Wang, W. Zhu, Q. Ma, Z. Shen, and Z. Shao, “Abacus: Address-partitioned bloom filter on address checking for uniqueness in IoT blockchain,” in Proc. IEEE ACM Int. Conf. Comput. Des. Dig. Tech. Pap., 2020, pp. 1–7.
[32]
J. Han, M. Song, H. Eom, and Y. Son, “An efficient multi-signature wallet in blockchain using bloom filter,” in Proc. ACM Symp. Appl. Comput., 2021, pp. 273–281.
[33]
B. Debnath, S. Sengupta, J. Li, D. J. Lilja, and D. H. Du, “Bloomflash: Bloom filter on flash-based storage,” in Proc. IEEE Int. Conf. Distrib. Comput. Syst., 2011, pp. 635–644.
[34]
O. Papapetrou, E. Ioannou, and D. Skoutas, “Efficient discovery of frequent subgraph patterns in uncertain graph databases,” in Proc. IEEE Adv. Database Technol., 2011, pp. 355–366.
[35]
H. Lang, T. Mühlbauer, F. Funke, P. A. Boncz, T. Neumann, and A. Kemper, “Data blocks: Hybrid OLTP and OLAP on compressed storage using both vectorization and compilation,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2016, pp. 311–326.
[36]
K. Cheng, L. Xiang, and M. Iwaihara, “Time-decaying bloom filters for data streams with skewed distributions,” in Proc. IEEE Int. Workshop Res. Issues Data Eng., 2005, pp. 63–69.
[37]
F. Deng and D. Rafiei, “Approximately detecting duplicates for streaming data using stable bloom filters,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2006, pp. 25–36.
[38]
L. Qiu, Y. Li, and X. Wu, “Preserving privacy in association rule mining with bloom filters,” J. Intell. Inf. Syst., vol. 29, no. 3, pp. 253–278, 2007.
[39]
Y. Tian, T. Zou, F. Ozcan, R. Goncalves, and H. Pirahesh, “Joins for hybrid warehouses: Exploiting massive parallelism in hadoop and enterprise data warehouses,” in Proc. Int. Conf. Extending Database Technol., 2015, pp. 373–384.
[40]
H. Dai, M. Shahzad, A. X. Liu, and Y. Zhong, “Finding persistent items in data streams,” Proc. VLDB Endow., vol. 10, no. 4, pp. 289–300, 2016.
[41]
Y. Peng, J. Guo, F. Li, W. Qian, and A. Zhou, “Persistent bloom filter: Membership testing for the entire history,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2018, pp. 1037–1052.
[42]
J. Li et al., “WavingSketch: An unbiased and generic sketch for finding top-K items in data streams,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., 2020, pp. 1574–1584.
[43]
M. M. Cisse, N. Usunier, T. Artieres, and P. Gallinari, “Robust bloom filters for large multilabel classification tasks,” in Proc. Adv. Neural Inf. Proces. Syst., 2013, pp. 1851–1859.
[44]
M. Mitzenmacher, “A model for learned bloom filters and optimizing by sandwiching,” in Proc. Adv. Neural Inf. Proces. Syst., 2018, pp. 462–471.
[45]
Q. Liu, L. Zheng, Y. Shen, and L. Chen, “Stable learned bloom filters for data streams,” Proc. VLDB Endow., vol. 13, no. 12, pp. 2355–2367, 2020.
[46]
J. R. Anderson, Q. Huang, W. Krichene, S. Rendle, and L. Zhang, “Superbloom: Bloom filter meets transformer,” CoRR, 2020. [Online]. Available: https://arxiv.org/abs/2002.04723
[47]
R. Patgiri, A. Biswas, and S. Nayak, “deepBF: Malicious URL detection using learned bloom filter and evolutionary deep learning,” CoRR, 2021. [Online]. Available: https://arxiv.org/abs/2103.12544
[48]
D. Guo, Y. Liu, X. Li, and P. Yang, “False negative problem of counting bloom filter,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 5, pp. 651–664, May 2010.
[49]
P. Bose et al., “On the false-positive rate of bloom filters,” Inf. Process. Lett., vol. 108, no. 4, pp. 210–213, 2008.
[50]
K. Christensen, A. Roginsky, and M. Jimeno, “A new analysis of the false positive rate of a bloom filter,” Inf. Process. Lett., vol. 110, no. 21, pp. 944–949, 2010.
[51]
Our open source website. [Online]. Available: https://github.com/pkufzc/Bloom-Error-TKDE
[52]
F. Grandi, “The γ-transform: A new approach to the study of a discrete and finite random variable,” Int. J. Math. Models Appl. Sci, vol. 9, pp. 624–635, 2015.
[53]
F. Grandi, “On the analysis of bloom filters,” Inf. Process. Lett., vol. 129, pp. 35–39, 2018.
[54]
C. E. Shannon, “A mathematical theory of communication,” Bell Syst. Tech. J., vol. 27, no. 3, pp. 379–423, 1948.
[55]
L. Brillouin, “Science and information theory,” Dover Pub., vol. 2, 2004, Art. no.
[56]
The caida anonymized 2016 internet traces, 2016. [Online]. Available: http://www.caida.org/data/overview/
[57]
The data center dataset, 2010. [Online]. Available: http://pages.cs.wisc.edu/tbenson/IMC10_Data.html
[58]
T. Benson, A. Akella, and D. A. Maltz, “Network traffic characteristics of data centers in the wild,” in Proc. ACM SIGCOMM Internet Meas. Conf., 2010, pp. 267–280.
[59]
The network dataset internet traces, 2014. [Online]. Available: http://snap.stanford.edu/data/
[60]
D. M. Powers, “Applications and explanations of ZIPF's law,” in Proc. J. Conf. New Methods Lang. Process. Comput. Nat. Lang. Learn., 1998, pp. 151–160.
[61]
A. Rousskov and D. Wessels, “High-performance benchmarking with web polygraph,” Softw. Pract Exper, vol. 34, no. 2, pp. 187–211, 2004.
[62]
Hash website, 1997. [Online]. Available: http://burtleburtle.net/bob/hash/evahash.html
[63]
C. Henke, C. Schmoll, and T. Zseby, “Empirical evaluation of hash functions for multipoint measurements,” ACM SIGCOMM Comput. Commun. Rev., vol. 38, no. 3, pp. 39–50, 2008.

Cited By

View all
  • (2025)Efficient and provably secured puncturable attribute-based signature for Web 3.0Future Generation Computer Systems10.1016/j.future.2024.107568164:COnline publication date: 1-Mar-2025
  • (2023)ChainedFilter: Combining Membership Filters by Chain RuleProceedings of the ACM on Management of Data10.1145/36267211:4(1-27)Online publication date: 12-Dec-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering  Volume 35, Issue 7
July 2023
1090 pages

Publisher

IEEE Educational Activities Department

United States

Publication History

Published: 01 July 2023

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Efficient and provably secured puncturable attribute-based signature for Web 3.0Future Generation Computer Systems10.1016/j.future.2024.107568164:COnline publication date: 1-Mar-2025
  • (2023)ChainedFilter: Combining Membership Filters by Chain RuleProceedings of the ACM on Management of Data10.1145/36267211:4(1-27)Online publication date: 12-Dec-2023

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media