Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Cardinality estimation and dynamic length adaptation for Bloom filters

Published: 01 December 2010 Publication History

Abstract

Bloom filters are extensively used in distributed applications, especially in distributed databases and distributed information systems, to reduce network requirements and to increase performance. In this work, we propose two novel Bloom filter features that are important for distributed databases and information systems. First, we present a new approach to encode a Bloom filter such that its length can be adapted to the cardinality of the set it represents, with negligible overhead with respect to computation and false positive probability. The proposed encoding allows for significant network savings in distributed databases, as it enables the participating nodes to optimize the length of each Bloom filter before sending it over the network, for example, when executing Bloom joins. Second, we show how to estimate the number of distinct elements in a Bloom filter, for situations where the represented set is not materialized. These situations frequently arise in distributed databases, where estimating the cardinality of the represented sets is necessary for constructing an efficient query plan. The estimation is highly accurate and comes with tight probabilistic bounds. For both features we provide a thorough probabilistic analysis and extensive experimental evaluation which confirm the effectiveness of our approaches.

References

[1]
Abiteboul, S., Manolescu, I., Polyzotis, N., Preda, N., Sun, C.: XML processing in dht networks. In: Proceedings of the 24th International Conference on Data Engineering (ICDE), pp. 606-615 (2008).
[2]
Anciaux, N., Benzine, M., Bouganim, L., Pucheral, P., Shasha, D.: Revelation on demand. Distrib. Parallel Databases 25(1-2), 5-28 (2009).
[3]
Artan, N.S., Sinkar, K., Patel, J., Chao, H.J.: Aggregated bloom filters for intrusion detection and prevention hardware. In: Proceedings of the Global Communications Conference (GLOBECOM), pp. 349-354 (2007).
[4]
Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D., Trevisan, L.: Counting distinct elements in a data stream. In: Proceedings of the 6th International Workshop on Randomization and Approximation Techniques (RANDOM'02), pp. 1-10 (2002).
[5]
Bender, M., Michel, S., Triantafillou, P., Weikum, G., Zimmer, C.: Improving collection selection with overlap awareness in P2P search engines. In: Proceedings of the 28th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR), pp. 67-74 (2005).
[6]
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. ACM Commun. 13(7), 422- 426 (1970).
[7]
Bratbergsengen, K.: Hashing methods and relational algebra operations. In: Proceedings of the Tenth International Conference on Very Large Data Bases (VLDB), pp. 323-333 (1984).
[8]
Broder, A., Mitzenmacher, M.: Network applications of bloom filters: A survey. In: Allerton Conference (2002).
[9]
Byers, J.W., Considine, J., Mitzenmacher, M., Rost, S.: Informed content delivery across adaptive overlay networks. IEEE/ACM Trans. Netw. 12(5), 767-780 (2004).
[10]
Chazelle, B., Kilian, J., Rubinfeld, R., Tal, A.: The Bloomier filter: an efficient data structure for static support lookup tables. In: Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 30-39 (2004).
[11]
Chervenak, A.L., Deelman, E., Foster, I.T., Guy, L., Hoschek, W., Iamnitchi, A., Kesselman, C., Kunszt, P.Z., Ripeanu, M., Schwartzkopf, R., Stockinger, H., Stockinger, K., Tierney, B.: Giggle: a framework for constructing scalable replica location services. In: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing, pp. 1-17 (2002).
[12]
Cohen, S., Matias, Y.: Spectral bloom filters. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 241-252 (2003).
[13]
Cuenca-Acuna, F.M., Peery, C., Martin, R.P., Nguyen, T.D.: PlanetP: using gossiping to build content addressable peer-to-peer information sharing communities. In: Twelfth IEEE International Symposium on High Performance Distributed Computing (HPDC-12), June 2003.
[14]
Erdogan, O., Cao, P.: Hash-av: fast virus signature scanning by cache-resident filters. Int. J. Security Netw. 2(1/2), 50-59 (2007).
[15]
Fan, L., Cao, P., Almeida, J.M., Broder, A.Z.: Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Netw. 8(3), 281-293 (2000).
[16]
Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31(2), 182-209 (1985).
[17]
Guo, D., Wu, J., Chen, H., Luo, X.: Theory and network applications of dynamic bloom filters. In: Proceedings of the 25th Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM) (2006).
[18]
Guo, D., Wu, J., Chen, H., Yuan, Y., Luo, X.: The dynamic bloom filters. Trans. Knowl. Data Eng., 22(1) 2010.
[19]
Hao, F., Kodialam, M.S., Lakshman, T.V.: Incremental bloom filters. In: Proceedings of the 27th IEEE International Conference on Computer Communications (INFOCOM), pp. 1067-1075 (2008).
[20]
Koloniari, G., Petrakis, Y., Pitoura, E.: Content-based overlay networks of XML peers based on multi level bloom filters. In: Proceedings of VLDB International Workshop on Databases, Information Systems and Peer-to-Peer Computing, pp. 232-247. Springer, Berlin (2003).
[21]
Koloniari, G., Pitoura, E.: Content-based routing of path queries in peer-to-peer systems. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 29-47 (2004).
[22]
Koloniari, G., Pitoura, E.: Distributed structural relaxation of xpath queries. In: Proceedings of the 25th International Conference on Data Engineering (ICDE), pp. 529-540 (2009).
[23]
Kossmann, D.: The state of the art in distributed query processing. ACM Comput. Surv. 32(4), 422- 469 (2000).
[24]
Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S.E., Eaton, P.R., Geels, D., Gummadi, R., Rhea, S.C., Weatherspoon, H., Weimer, W., Wells, C., Zhao, B.Y.: Oceanstore: an architecture for global-scale persistent storage. In: Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 190-201, Cambridge, MA, USA (2000).
[25]
Kumar, A., Xu, J., Wang, J., Spatscheck, O., Li, L.: Space-code bloom filter for efficient per-flow traffic measurement. In: Proceedings of the 23rd Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM) (2004).
[26]
Kumar, A., Xu, J.(J.), Zegura, E.W.: Efficient and scalable query routing for unstructured peer-to-peer networks. In: Proceedings of the 24rd Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM) (2005).
[27]
Mackert, L.F., Lohman, G.M.: R* optimizer validation and performance evaluation for distributed queries. In: Proceedings of the Twelfth International Conference on Very Large Data Bases (VLDB), pp. 149-159 (1986).
[28]
Metwally, A., Agrawal, D., El Abbadi, A.: Duplicate detection in click streams. In: Proceedings of the 14th International Conference on World Wide Web (WWW'05), pp. 12-21. ACM, New York (2005).
[29]
Michael, L., Nejdl, W., Papapetrou, O., Siberski, W.: Improving distributed join efficiency with extended bloom filter operations. In: Proceedings of 21st International Conference on Advanced Information Networking and Applications (AINA), pp. 187-194 (2007).
[30]
Michel, S., Triantafillou, P., Weikum, G.: KLEE: a framework for distributed top-k query algorithms. In: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB), pp. 637-648 (2005).
[31]
Mitzenmacher, M.: Compressed bloom filters. IEEE/ACM Trans. Netw. 10(5), 604-612 (2002).
[32]
Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge (2000).
[33]
Mullin, J.K.: Optimal semijoins for distributed database systems. IEEE Trans. Softw. Eng. 16(5), 558-560 (1990).
[34]
Muthukrishnan, S.: Data streams: algorithms and applications. Found. Trends Theor. Comput. 1(2) (2005).
[35]
Neumann, T., Bender, M., Michel, S., Schenkel, R., Triantafillou, P., Weikum, G.: Distributed top-k aggregation queries at large. Distrib. Parallel Databases 26(1), 3-27 (2009).
[36]
Ramesh, S., Papapetrou, O., Siberski, W.: Optimizing distributed joins with bloom filters. In: Proceedings of International Conference of Distributed Computing and Internet Technology (ICDCIT) (2008).
[37]
Reynolds, P., Vahdat, A.: Efficient peer-to-peer keyword searching. In: Middleware '03: Proceedings of the ACM/IFIP/USENIX 2003 International Conference on Middleware, pp. 21-40. Springer, New York (2003).
[38]
Rhea, S.C., Kubiatowicz, J.: Probabilistic location and routing. In: INFOCOM (2002).
[39]
Safi, E., Moshovos, A., Veneris, A.G.: L-CBF: A low-power, fast counting bloom filter architecture. IEEE Trans. Very Large Scale Integration (VLSI) Syst. 16(6), 628-638 (2008).
[40]
Saroiu, S., Gummadi, P.K., Gribble, S.D.: A measurement study of peer-to-peer file sharing systems. In: SPIE/ACM Conference on Multimedia Computing and Networking (MMCN) (2002).
[41]
Snoeren, A.C.: Hash-based ip traceback. In: SIGCOMM, pp. 3-14 (2001).
[42]
Song, H., Sproull, T.S., Attig, M., Lockwood, J.W.: Snort offloader: A reconfigurable hardware NIDS filter. In: Proceedings of the International Conference on Field Programmable Logic and Applications (FPL), pp. 493-498 (2005).
[43]
Wang, W., Jiang, H., Lu, H., Yu, J.X.: Bloom histogram: path selectivity estimation for XML data with updates. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases (VLDB), pp. 240-251 (2004).
[44]
Yan, J., Cho, P.L.: Enhancing collaborative spam detection with bloom filters. In: Computer Security Applications Conference, Annual, pp. 414-428 (2006).
[45]
Zhou, R., Hwang, K., Cai, M.: Gossiptrust for fast reputation aggregation in peer-to-peer networks. IEEE Trans. Knowl. Data Eng. 20(9), 1282-1295 (2008).

Cited By

View all
  • (2024)TTLs Matter: Efficient Cache Sizing with TTL-Aware Miss Ratio Curves and Working Set SizesProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3650066(387-404)Online publication date: 22-Apr-2024
  • (2024)Privacy‐preserving WiFi‐based crowd monitoringTransactions on Emerging Telecommunications Technologies10.1002/ett.495635:3Online publication date: 11-Mar-2024
  • (2023)A Case for Partitioned Bloom FiltersIEEE Transactions on Computers10.1109/TC.2022.321899572:6(1681-1691)Online publication date: 1-Jun-2023
  • Show More Cited By

Index Terms

  1. Cardinality estimation and dynamic length adaptation for Bloom filters
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Distributed and Parallel Databases
      Distributed and Parallel Databases  Volume 28, Issue 2-3
      December 2010
      125 pages

      Publisher

      Kluwer Academic Publishers

      United States

      Publication History

      Published: 01 December 2010

      Author Tags

      1. Bloom filters
      2. Distributed databases
      3. Distributed information systems

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 01 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)TTLs Matter: Efficient Cache Sizing with TTL-Aware Miss Ratio Curves and Working Set SizesProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3650066(387-404)Online publication date: 22-Apr-2024
      • (2024)Privacy‐preserving WiFi‐based crowd monitoringTransactions on Emerging Telecommunications Technologies10.1002/ett.495635:3Online publication date: 11-Mar-2024
      • (2023)A Case for Partitioned Bloom FiltersIEEE Transactions on Computers10.1109/TC.2022.321899572:6(1681-1691)Online publication date: 1-Jun-2023
      • (2022)ProbGraphProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571942(1-17)Online publication date: 13-Nov-2022
      • (2022)Anonymized Counting of Nonstationary Wi-Fi Devices When Monitoring CrowdsProceedings of the 25th International ACM Conference on Modeling Analysis and Simulation of Wireless and Mobile Systems10.1145/3551659.3559042(213-222)Online publication date: 24-Oct-2022
      • (2022)Challenges in Automated Measurement of Pedestrian DynamicsDistributed Applications and Interoperable Systems10.1007/978-3-031-16092-9_12(187-199)Online publication date: 13-Jun-2022
      • (2020)SEIZE: Runtime Inspection for Parallel Dataflow SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.303517032:4(842-854)Online publication date: 13-Nov-2020
      • (2019)Bloom HoppingIEEE Transactions on Mobile Computing10.1109/TMC.2018.284012318:3(534-545)Online publication date: 1-Mar-2019
      • (2019)Non-parametric Class Completeness Estimators for Collaborative Knowledge Graphs—The Case of WikidataThe Semantic Web – ISWC 201910.1007/978-3-030-30793-6_26(453-469)Online publication date: 26-Oct-2019
      • (2019)Decentralized Indexing over a Network of RDF PeersThe Semantic Web – ISWC 201910.1007/978-3-030-30793-6_1(3-20)Online publication date: 26-Oct-2019
      • Show More Cited By

      View Options

      View options

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media