research-article

Real-time memory efficient data redundancy removal algorithm

Authors:

Souvik BhattacherjeeAuthors Info & Claims

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

Pages 1259 - 1268

https://doi.org/10.1145/1871437.1871596

Published: 26 October 2010 Publication History

Abstract

Data intensive computing has become a central theme in research community and industry. There is an ever growing need to process and analyze massive amounts of data from diverse sources such as telecom call data records, telescope imagery, online transaction records, web pages, stock markets, medical records (monitoring critical health conditions of patients), climate warning systems, etc. Removing redundancy in the data is an important problem as it helps in resource and compute efficiency for downstream processing of the massive (1 billion to 10 billion records) datasets. In application domains such as IR, stock markets, telecom and others, there is a strong need for real-time data redundancy removal (referred to as DRR) of enormous amounts of data flowing at the rate of 1 GB/s or more. Real-time scalable data redundancy removal on massive datasets is a challenging problem. We present the design of a novel parallel data redundancy removal algorithm for both in-memory and disk-based execution. We also develop queueing theoretic analysis to optimize the throughput of our parallel algorithm on multi-core architectures. For 500 million records, our parallel algorithm can perform complete de-duplication in 255s, on 16 core Intel Xeon 5570 architecture, with in-memory execution. This gives a throughput of 2M records/s. For 6 billion records, our parallel algorithm can perform complete de-duplication in less than 4.5 hours, using 6 cores of Intel Xeon 5570, with disk-based execution. This gives a throughput of around 370K records/s. To the best of our knowledge, this is the highest real-time throughput for data redundancy removal on such massive datasets. We also demonstrate the scalability of our algorithm with increasing number of cores and data.

References

[1]

F. Baboescu and G. Varghese. Scalable packet clasification. In ACM SIGCOMM, pages 199--210, 2001.

Digital Library

[2]

B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422--426, 1970.

Digital Library

[3]

A. Z. Broder and M. Mitzenmacher. Network applications of bloom filters: A survey. Internet Mathematics, 1(4):485--509, 2003.

[4]

Y. Chen, A. Kumar, and J. Xu. A new design of bloom filter for packet inspection speedup. In GLOBECOMM, pages 1--5, 2007.

[5]

S. Dharmapurikar, P. Krishnamurthy, T. S. Sproull, and J. W. Lockwood. Deep packet inspection using parallel bloom filters. IEEE Micro, 24(1):52--61, 2004.

Digital Library

[6]

S. Dharmapurikar, P. Krishnamurthy, and D. Taylor. Longest prefix matching using bloom filters. In ACM SIGCOMM, pages 201--212, 2003.

Digital Library

[7]

L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache: a scalable wide area web cache sharing protocol. In IEEE/ACM Transactions on Networking, pages 281--293, 2000.

Digital Library

[8]

W. Feng, D. Kandlur, D. Sahu, and K. Shin. Stochastic fair blue: A queue management algorithm for enforcing fairness. In IEEE INFOCOM, pages 1520--1529, 2001.

[9]

T. Hofmann. Optimizing distributed joins using bloom filters. Distributed Computing and Internet technology (Springer / LNCS), 5375:145--156, 2009.

Digital Library

[10]

Y. Hua and B. Xiao. A multi-attribute data structure with parallel bloom filters for network services. In International Conference on High Performance Computing, pages 277--288, 2006.

Digital Library

[11]

L. Kleinrock. Queueing Systems, Volume I: Theory. Wiley Interscience, New York, NY, USA, 1975.

Digital Library

[12]

A. Kumar, J. Xu, J. Wang, O. Spatschek, and L. Li. Space-code bloom filter for efficient per-flow traffic measurement. In IEEE INFOCOM, pages 1762--1773, 2004.

[13]

M. Little, N. Speirs, and S. Shrivastava. Using bloom filters to speed-up name lookup in distributed systems. The Computer Journal (Oxford University Press), 45(6):645--652, 2002.

[14]

M. Mitzenmacher. Compressed bloom filters. In IEEE/ACM Transactions on Networking, pages 604--612, 2002.

Digital Library

[15]

F. Putze, P. Sanders, and J. Singler. Cache-, hash-, and space-efficient bloom filters. ACM Journal of Experimental Algorithmics, 14, 2009.

Digital Library

[16]

S. M. Ross. Introduction to Probability Models. Academic Press, tenth edition, 2009.

[17]

C. Saar and M. Yossi. Spectral bloom filters. In ACM SIGMOD, 2003.

Digital Library

Cited By

Dutta SBhattacherjee SNarang A(2012)Towards "intelligent compression" in streamsProceedings of the 15th International Conference on Extending Database Technology10.1145/2247596.2247624(228-238)Online publication date: 27-Mar-2012
https://dl.acm.org/doi/10.1145/2247596.2247624

Index Terms

Real-time memory efficient data redundancy removal algorithm
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Real-time approximate Range Motif discovery & data redundancy removal algorithm
EDBT/ICDT '11: Proceedings of the 14th International Conference on Extending Database Technology

Removing redundancy in the data is an important problem as it helps in resource and compute efficiency for downstream processing of massive (10 million to 100 million records) datasets. In application domains such as IR, stock markets, telecom and ...
High throughput data redundancy removal algorithm with scalable performance
HiPEAC '11: Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers

The ever growing need to process and analyze massive amounts of data from diverse sources such as telecom call data records, telescope imagery, web pages, stock markets, medical records and other domains has triggered worldwide research in data intensive ...
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

October 2010

2036 pages

ISBN:9781450300995

DOI:10.1145/1871437

General Chair:
Jimmy Huang
York University, Canada
,
Program Chairs:
Nick Koudas
University of Toronto, Canada
,
Gareth Jones
Dublin City University, Ireland
,
Xindong Wu
University of Vermont, USA
,
Kevyn Collins-Thompson
Microsoft Research, USA
,
Aijun An
York University, Canada

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM '10

Sponsor:

CIKM '10: International Conference on Information and Knowledge Management

October 26 - 30, 2010

ON, Toronto, Canada

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
377
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 11 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Dutta SBhattacherjee SNarang A(2012)Towards "intelligent compression" in streamsProceedings of the 15th International Conference on Extending Database Technology10.1145/2247596.2247624(228-238)Online publication date: 27-Mar-2012
https://dl.acm.org/doi/10.1145/2247596.2247624

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents