Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2247596.2247624acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

Towards "intelligent compression" in streams: a biased reservoir sampling based Bloom filter approach

Published: 27 March 2012 Publication History

Abstract

With the explosion of information stored world-wide, data intensive computing has emerged as a central area of research. Efficient management and processing of this massively exponential amount of data from diverse sources, such as telecommunication call data records, telescope imagery, online transaction records, web pages, stock markets, medical records (monitoring critical health conditions of patients), climate warning systems, etc., has become a necessity. Removing redundancy from such huge (multi-billion records) datasets results in resource and compute efficiency for downstream processing and constitutes an important area of study. "Intelligent compression" or deduplication in streaming scenarios, for precise identification and elimination of duplicates from the unbounded data stream is a greater challenge given the real-time nature of data arrival. Stable Bloom Filters (SBF) [13] address this problem to a certain extent. However, SBF suffers from a high false negative rate and slow convergence rate, thereby rendering it inefficient for applications with low false negative rate tolerance.
In this paper, we present a novel reservoir sampling based Bloom filter (RSBF) technique, based on the combined concepts of reservoir sampling and Bloom filters for approximate detection of duplicates in data streams. Using detailed theoretical analysis we prove analytical bounds on its false positive rate, false negative rate and convergence rates with low memory requirements. We show that RSBF outperforms SBF in terms of false negative rates and convergence rates while consuming the same amount of memory. Using empirical analysis on real-world datasets (3 million records) and synthetic datasets with around 1 billion records, we demonstrate upto 2× improvement in false negative rate with better convergence rates as compared to SBF, while maintaining comparable false positive rates. To the best of our knowledge, this is the first attempt to integrate reservoir sampling method with Bloom filters for deduplication in streaming scenarios.

References

[1]
A. Adya, W. J. Bolosky, M. Castro, G. Cermak, R. Chaiken, J. R. Douceur, J. Howell, R. J. Lorch, M. Theimer, and R. Wattenhofer. Farsite: Federated, available, and reliable storage for an incompletely trusted environment. In OSDI, 2002.
[2]
C. Aggarwal and P. Yu. Data Streams: Models and Algorithms. Springer, 2007.
[3]
C. C. Aggarwal. On biased reservoir sampling in the presence of stream evolution. In VLDB, 2006.
[4]
B. Ahu, K. Li, and R. H. Patterson. Avoiding the disk bottleneck in the data domain deduplication file system. In FAST, pages 269--282, 2008.
[5]
N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In STOC, pages 20--29, 1996.
[6]
B. Babcock, M. Datar, and R. Motwani. Sampling from moving window over streaming data. In SODA, 2002.
[7]
F. Baboescu and G. Varghese. Scalable packet classification. In ACM SIGCOMM, pages 199--210, 2001.
[8]
M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proc. SIGKDD, pages 39--48, 2003.
[9]
B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422--426, 1970.
[10]
A. Z. Broder and M. Mitzenmacher. Network applications of bloom filters: A survey. Internet Mathematics, 1(4):485--509, 2003.
[11]
A. Chowdhury, O. Frieder, D. Grossman, and M. McCabe. Collection statistics for fast duplicate document detection. ACM Trans. on Information Systems, 20(2):171--191, 2002.
[12]
J. Conrad, X. Guo, and C. Schriber. Online duplicate document detection: Signature reliability in a dynamic retrieval environment. In CIKM, pages 443--452, 2003.
[13]
F. Deng and D. Rafiei. Approximately detecting duplicates for streaming data using stable bloom filters. In SIGMOD, 2006.
[14]
S. Dharmapurikar, P. Krishnamurthy, and D. Taylor. Longest prefix matching using bloom filters. In ACM SIGCOMM, pages 201--212, 2003.
[15]
F. Douglis, J. Lavoie, J. M. Tracey, P. Kulkarni, and P. Kulkarni. Redundancy elimination within large collections of files. In USENIX, pages 59--72, 2004.
[16]
L. Fan, P. Cao, J. Almeida, and Z. Broder. Summary cache: a scalable wide area web cache sharing protocol. In IEEE/ACM Transaction on Networking, pages 281--293, 2000.
[17]
W. Feng, D. Kandlur, D. Sahu, and K. Shin. Stochastic fair blue: A queue management algorithm for enforcing fairness. In IEEE INFOCOM, pages 1520--1529, 2001.
[18]
P. Flajolet and G. N. Martin. Probabilistic counting algorithms for database applications. Comput. Syst. Science, 31(2):182--209, 1985.
[19]
H. Garcia-Molina, J. D. Ullman, and W. J. Database System Implementation. Prentice Hall, 1999.
[20]
V. K. Garg, A. Narang, and S. Bhattacherjee. Real-time memory efficient data redundancy removal algorithm. In CIKM, pages 1259--1268, 2010.
[21]
J. Gehrke, F. Korn, and J. Srivastava. On computing correlated aggregates over continual data streams. In SIGMOD, pages 13--24, 2001.
[22]
P. Gibbons and Y. Mattias. New sampling-based summary statistics for improving approximate query answers. In ACM SIGMOD, pages 331--342, 1998.
[23]
P. Gupta and N. McKeown. Packet classification on multiple fields. In SIGCOMM, pages 147--160, 1999.
[24]
A. Heydon and M. N. Mercator. A scalable, extensive web crawler. In World Wide Web, volume 2, 1999.
[25]
T. Hofmann. Optimizing distributed joins using bloom filters. Distributed Computing and Internet technology (Springer/LNCS), 5375:145--156, 2009.
[26]
Y. Hua and B. Xiao. A multi-attribute data structure with parallel bloom filters for network services. In International Conference on High Performance Computing, pages 277--288, 2006.
[27]
N. Jain, M. Dahlin, and R. Tewari. Taper: Tiered approach for eliminating redundancy in replica synchronization. In FAST, 2005.
[28]
A. Kumar, J. Xu, J. Wang, O. Spatschek, and L. Li. Space-code bloom filter for efficient per-flow traffic measurement. In IEEE INFOCOM, pages 1762--1773, 2004.
[29]
D. Lee and J. Hull. Duplicate detection in symbolically compressed documents. In ICDAR, pages 305--308, 1999.
[30]
M. Little, N. Speirs, and S. Shrivastava. Using bloom filters to speed-up name lookup in distributed systems. The Computer Journal (Oxford University Press), 45(6):645--652, 2002.
[31]
A. Metwally, D. Agrawal, and A. E. Abbadi. Duplicate detection in click streams. In WWW, 2005.
[32]
M. Mitzenmacher. Compressed bloom filters. In IEEE/ACM Transaction on Networking, pages 604--612, 2002.
[33]
P. Gibbons. Distinct sampling for highly accurate answers to distinct value queries and event reports. In VLDB, 2001.
[34]
F. Putze, P. Sanders, and J. Singler. Cache-, hash-, and space-efficient bloom filters. ACM Journal of Experimental Algorithmics, 14, 2009.
[35]
S. Quinlan and S. Dorward. Venti: A new approach to archival storage. In FAST, pages 89--101, 2002.
[36]
M. Reiter, V. Anupam, and A. Mayer. Detecting hit-shaving in click-through payment schemes. In USENIX, pages 155--166, 1998.
[37]
C. Saar and M. Yossi. Spectral bloom filters. In ACM SIGMOD, 2003.
[38]
H. Shen and Y. Zhang. Improved approximate detection of duplicates for data streams over sliding windows. J. of Computer Science and Technology, 23(6), 2008.
[39]
N. Tolia, M. Kozuch, M. Satyanarayanan, B. Karp, T. C. Bressoud, and A. Perrig. Opportunistic use of content addressable storage for distributed file systems. In USENIX, pages 127--140, 2003.
[40]
J. S. Vitter. Random sampling with a reservoir. ACM Trans. on Mathematical Software, 11(1):37--57, March 1985.
[41]
M. Weis and F. Naumann. Dogmatrix tracks down duplicates in xml. In Proc. ACM SIGMOD, pages 431--442, 2005.

Cited By

View all
  • (2021)A Stateful Bloom Filter for Per-Flow State MonitoringIEEE Transactions on Network Science and Engineering10.1109/TNSE.2021.30574598:2(1399-1413)Online publication date: 1-Apr-2021
  • (2019)Ship Spatiotemporal Key Feature Point Online Extraction Based on AIS Multi-Sensor Data Using an Improved Sliding Window AlgorithmSensors10.3390/s1912270619:12(2706)Online publication date: 16-Jun-2019
  • (2015)BloofiInformation Systems10.1016/j.is.2015.01.00254:C(311-324)Online publication date: 1-Dec-2015
  • Show More Cited By

Index Terms

  1. Towards "intelligent compression" in streams: a biased reservoir sampling based Bloom filter approach

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology
        March 2012
        643 pages
        ISBN:9781450307901
        DOI:10.1145/2247596
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 27 March 2012

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Bloom filter
        2. data streams
        3. deduplication
        4. reservoir sampling

        Qualifiers

        • Research-article

        Conference

        EDBT '12

        Acceptance Rates

        Overall Acceptance Rate 7 of 10 submissions, 70%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)4
        • Downloads (Last 6 weeks)1
        Reflects downloads up to 12 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2021)A Stateful Bloom Filter for Per-Flow State MonitoringIEEE Transactions on Network Science and Engineering10.1109/TNSE.2021.30574598:2(1399-1413)Online publication date: 1-Apr-2021
        • (2019)Ship Spatiotemporal Key Feature Point Online Extraction Based on AIS Multi-Sensor Data Using an Improved Sliding Window AlgorithmSensors10.3390/s1912270619:12(2706)Online publication date: 16-Jun-2019
        • (2015)BloofiInformation Systems10.1016/j.is.2015.01.00254:C(311-324)Online publication date: 1-Dec-2015
        • (2014)Advanced Algorithms for Efficient Approximate Duplicate Detection in Data Streams Using Bloom FiltersLarge Scale and Big Data10.1201/b17112-14(409-434)Online publication date: 12-Jun-2014
        • (2013)Streaming quotient filterProceedings of the VLDB Endowment10.14778/2536354.25363596:8(589-600)Online publication date: 1-Jun-2013
        • (2013)BloofiProceedings of the 2nd International Workshop on Cloud Intelligence10.1145/2501928.2501931(1-8)Online publication date: 26-Aug-2013
        • (2013)Near-optimal approximate membership query over time-decaying windows2013 Proceedings IEEE INFOCOM10.1109/INFCOM.2013.6566939(1447-1455)Online publication date: Apr-2013
        • (2013)Reducing the HPC-datastorage footprint with MAFISC--Multidimensional Adaptive Filtering Improved Scientific data CompressionComputer Science - Research and Development10.1007/s00450-012-0222-428:2-3(231-239)Online publication date: 1-May-2013

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media