Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1871437.1871596acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Real-time memory efficient data redundancy removal algorithm

Published: 26 October 2010 Publication History

Abstract

Data intensive computing has become a central theme in research community and industry. There is an ever growing need to process and analyze massive amounts of data from diverse sources such as telecom call data records, telescope imagery, online transaction records, web pages, stock markets, medical records (monitoring critical health conditions of patients), climate warning systems, etc. Removing redundancy in the data is an important problem as it helps in resource and compute efficiency for downstream processing of the massive (1 billion to 10 billion records) datasets. In application domains such as IR, stock markets, telecom and others, there is a strong need for real-time data redundancy removal (referred to as DRR) of enormous amounts of data flowing at the rate of 1 GB/s or more. Real-time scalable data redundancy removal on massive datasets is a challenging problem. We present the design of a novel parallel data redundancy removal algorithm for both in-memory and disk-based execution. We also develop queueing theoretic analysis to optimize the throughput of our parallel algorithm on multi-core architectures. For 500 million records, our parallel algorithm can perform complete de-duplication in 255s, on 16 core Intel Xeon 5570 architecture, with in-memory execution. This gives a throughput of 2M records/s. For 6 billion records, our parallel algorithm can perform complete de-duplication in less than 4.5 hours, using 6 cores of Intel Xeon 5570, with disk-based execution. This gives a throughput of around 370K records/s. To the best of our knowledge, this is the highest real-time throughput for data redundancy removal on such massive datasets. We also demonstrate the scalability of our algorithm with increasing number of cores and data.

References

[1]
F. Baboescu and G. Varghese. Scalable packet clasification. In ACM SIGCOMM, pages 199--210, 2001.
[2]
B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422--426, 1970.
[3]
A. Z. Broder and M. Mitzenmacher. Network applications of bloom filters: A survey. Internet Mathematics, 1(4):485--509, 2003.
[4]
Y. Chen, A. Kumar, and J. Xu. A new design of bloom filter for packet inspection speedup. In GLOBECOMM, pages 1--5, 2007.
[5]
S. Dharmapurikar, P. Krishnamurthy, T. S. Sproull, and J. W. Lockwood. Deep packet inspection using parallel bloom filters. IEEE Micro, 24(1):52--61, 2004.
[6]
S. Dharmapurikar, P. Krishnamurthy, and D. Taylor. Longest prefix matching using bloom filters. In ACM SIGCOMM, pages 201--212, 2003.
[7]
L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache: a scalable wide area web cache sharing protocol. In IEEE/ACM Transactions on Networking, pages 281--293, 2000.
[8]
W. Feng, D. Kandlur, D. Sahu, and K. Shin. Stochastic fair blue: A queue management algorithm for enforcing fairness. In IEEE INFOCOM, pages 1520--1529, 2001.
[9]
T. Hofmann. Optimizing distributed joins using bloom filters. Distributed Computing and Internet technology (Springer / LNCS), 5375:145--156, 2009.
[10]
Y. Hua and B. Xiao. A multi-attribute data structure with parallel bloom filters for network services. In International Conference on High Performance Computing, pages 277--288, 2006.
[11]
L. Kleinrock. Queueing Systems, Volume I: Theory. Wiley Interscience, New York, NY, USA, 1975.
[12]
A. Kumar, J. Xu, J. Wang, O. Spatschek, and L. Li. Space-code bloom filter for efficient per-flow traffic measurement. In IEEE INFOCOM, pages 1762--1773, 2004.
[13]
M. Little, N. Speirs, and S. Shrivastava. Using bloom filters to speed-up name lookup in distributed systems. The Computer Journal (Oxford University Press), 45(6):645--652, 2002.
[14]
M. Mitzenmacher. Compressed bloom filters. In IEEE/ACM Transactions on Networking, pages 604--612, 2002.
[15]
F. Putze, P. Sanders, and J. Singler. Cache-, hash-, and space-efficient bloom filters. ACM Journal of Experimental Algorithmics, 14, 2009.
[16]
S. M. Ross. Introduction to Probability Models. Academic Press, tenth edition, 2009.
[17]
C. Saar and M. Yossi. Spectral bloom filters. In ACM SIGMOD, 2003.

Cited By

View all
  • (2012)Towards "intelligent compression" in streamsProceedings of the 15th International Conference on Extending Database Technology10.1145/2247596.2247624(228-238)Online publication date: 27-Mar-2012

Index Terms

  1. Real-time memory efficient data redundancy removal algorithm

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management
    October 2010
    2036 pages
    ISBN:9781450300995
    DOI:10.1145/1871437
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 October 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. bloom filter
    2. data and knowledge mining
    3. data de-duplication
    4. high performance computing
    5. queueing theory

    Qualifiers

    • Research-article

    Conference

    CIKM '10

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 11 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2012)Towards "intelligent compression" in streamsProceedings of the 15th International Conference on Extending Database Technology10.1145/2247596.2247624(228-238)Online publication date: 27-Mar-2012

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media