Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1944862.1944877acmotherconferencesArticle/Chapter ViewAbstractPublication PageshipeacConference Proceedingsconference-collections
research-article

High throughput data redundancy removal algorithm with scalable performance

Published: 24 January 2011 Publication History

Abstract

The ever growing need to process and analyze massive amounts of data from diverse sources such as telecom call data records, telescope imagery, web pages, stock markets, medical records and other domains has triggered worldwide research in data intensive computing. A key requirement here involves removing redundancy from data, as this enhances the compute efficiency for downstream data processing. These application domains have an intense need for high throughput data deduplication for huge volumes of data flowing at the rate of 1 GB/s or more. In this paper, we present the design of a novel parallel data redundancy removal algorithm. We also present a queueing theoretic analysis to optimize the throughput of our parallel algorithm on multi-core architectures. For 500M records, our parallel algorithm can perform complete deduplication in 255s, on 16 core Intel Xeon 5570 architecture. This gives a throughput of around 2M records/s. For 2048 byte records, we achieve a throughput of 0.81 GB/s. To the best of our knowledge, this is the highest throughput for data redundancy removal on such massive datasets. We also demonstrate strong and weak scalability of our algorithm for both multi-core Power6 and Intel Xeon 5570 architectures.

References

[1]
A. Adya, W. J. Bolosky, M. Castro, G. Cermak, R. Chaiken, J. R. Douceur, J. Howell, J. R. Lorch, M. Theimer, and R. Wattenhofer. Farsite: Federated, available, and reliable storage for an incompletely trusted environment. In OSDI, 2002.
[2]
F. Baboescu and G. Varghese. Scalable packet clasification. In ACM SIGCOMM, pages 199--210, 2001.
[3]
B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422--426, 1970.
[4]
S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In SIGMOD Conference, pages 398--409, 1995.
[5]
A. Broder and M. Mitzenmacher. Network applications of bloom filters: A survey. Internet Mathematics, 1(4):485--509, 2004.
[6]
Y. Chen, A. Kumar, and J. Xu. A new design of bloom filter for packet inspection speedup. In GLOBECOMM, pages 1--5, 2007.
[7]
S. Dharmapurikar, P. Krishnamurthy, T. S. Sproull, and J. W. Lockwood. Deep packet inspection using parallel bloom filters. IEEE Micro, 24(1):52--61, 2004.
[8]
S. Dharmapurikar, P. Krishnamurthy, and D. Taylor. Longest prefix matching using bloom filters. In ACM SIGCOMM, pages 201--212, 2003.
[9]
P. C. Dillinger and P. Manolios. Bloom filters in probabilistic verification. In FMCAD, pages 367--381, 2004.
[10]
F. Douglis, J. Lavoie, J. M. Tracey, P. Kulkarni, and P. Kulkarni. Redundancy elimination within large collections of files. In In USENIX Annual Technical Conference, General Track, pages 59--72, 2004.
[11]
L. Fan, P. Cao, J. Almeida, and Z. Broder. Summary cache: a scalable wide area web cache sharing protocol. In IEEE/ACM Transaction on Networking, pages 281--293, 2000.
[12]
W. Feng, D. Kandlur, D. Sahu, and K. Shin. Stochastic fair blue: A queue management algorithm for enforcing fairness. In IEEE INFOCOM, pages 1520--1529, 2001.
[13]
T. Hofmann. Optimizing distributed joins using bloom filters. Distributed Computing and Internet technology (Springer / LNCS), 5375:145--156, 2009.
[14]
Y. Hua and B. Xiao. A multi-attribute data structure with parallel bloom filters for network services. In International Conference on High Performance Computing, pages 277--288, 2006.
[15]
N. Jain, M. Dahlin, and R. Tewari. Taper: Tiered approach for eliminating redundancy in replica synchronization. In FAST, 2005.
[16]
A. Kirsch and M. Mitzenmacher. Less hashing, same performance: Building a better bloom filter. Random Struct. Algorithms, 33(2):187--218, 2008.
[17]
L. Kleinrock. Queueing Systems, Volume I: Theory. Wiley Interscience, New York, NY, USA, 1975.
[18]
A. Kumar, J. Xu, J. Wang, O. Spatschek, and L. Li. Space-code bloom filter for efficient per-flow traffic measurement. In IEEE INFOCOM, pages 1762--1773, 2004.
[19]
M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezis, and P. Camble. Sparse indexing: Large scale, inline deduplication using sampling and locality. In FAST, pages 111--123, 2009.
[20]
M. Little, N. Speirs, and S. Shrivastava. Using bloom filters to speed-up name lookup in distributed systems. The Computer Journal (Oxford University Press), 45(6):645--652, 2002.
[21]
M. Mitzenmacher. Compressed bloom filters. In IEEE/ACM Transaction on Networking, pages 604--612, 2002.
[22]
F. Putze, P. Sanders, and J. Singler. Cache-, hash-, and space-efficient bloom filters. ACM Journal of Experimental Algorithmics, 14, 2009.
[23]
S. Quinlan and S. Dorward. Venti: A new approach to archival storage. In FAST, pages 89--101, 2002.
[24]
S. M. Ross. Introduction to Probability Models. Academic Press, tenth edition, 2009.
[25]
C. Saar and M. Yossi. Spectral bloom filters. In ACM SIGMOD, 2003.
[26]
H. Song, S. Dharmapurikar, J. Turner, and J. Lockwood. Fast hash table lookup using extended bloom filter: An aid to network processing. In ACM SIGCOMM, pages 181--192, 2005.
[27]
N. Tolia, M. Kozuch, M. Satyanarayanan, B. Karp, T. C. Bressoud, and A. Perrig. Opportunistic use of content addressable storage for distributed file systems. In USENIX Annual Technical Conference, General Track, pages 127--140, 2003.
[28]
B. Zhu, K. Li, and R. H. Patterson. Avoiding the disk bottleneck in the data domain deduplication file system. In FAST, pages 269--282, 2008.
[29]
Y. Zhu, H. Jiang, and J. Wang. Hierarchical bloom filter arrays (hba): A novel, scalable metadata management system for large cluster-based storage. In 5th IEEE International Conference on Cluster Computing (Cluster), pages 165--174, 2004.

Cited By

View all
  • (2021)Introduction to data deduplication approachesData Deduplication Approaches10.1016/B978-0-12-823395-5.00019-7(1-15)Online publication date: 2021
  • (2013)G-ParadexRevised Selected Papers of the 10th International Symposium on Advanced Parallel Processing Technologies - Volume 829910.1007/978-3-642-45293-2_7(91-103)Online publication date: 27-Aug-2013
  • (2012)Grabfast: A CUDA based GPU accelerated fast short sequence alignment algorithm2012 19th International Conference on High Performance Computing10.1109/HiPC.2012.6507502(1-10)Online publication date: Dec-2012
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
HiPEAC '11: Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
January 2011
226 pages
ISBN:9781450302418
DOI:10.1145/1944862
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • HiPEAC: HiPEAC Network of Excellence

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 January 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Bloom filter
  2. data redundancy removal
  3. deduplication
  4. multicore architecture
  5. parallel algorithms
  6. queueing theory

Qualifiers

  • Research-article

Conference

HIPEAC '11
Sponsor:
  • HiPEAC

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)1
Reflects downloads up to 12 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Introduction to data deduplication approachesData Deduplication Approaches10.1016/B978-0-12-823395-5.00019-7(1-15)Online publication date: 2021
  • (2013)G-ParadexRevised Selected Papers of the 10th International Symposium on Advanced Parallel Processing Technologies - Volume 829910.1007/978-3-642-45293-2_7(91-103)Online publication date: 27-Aug-2013
  • (2012)Grabfast: A CUDA based GPU accelerated fast short sequence alignment algorithm2012 19th International Conference on High Performance Computing10.1109/HiPC.2012.6507502(1-10)Online publication date: Dec-2012
  • (2012)Distributed hierarchical co-clustering and collaborative filtering algorithm2012 19th International Conference on High Performance Computing10.1109/HiPC.2012.6507497(1-10)Online publication date: Dec-2012
  • (2012)E-DAIDProceedings of the 2012 International Conference on Communication Systems and Network Technologies10.1109/CSNT.2012.101(438-442)Online publication date: 11-May-2012
  • (2012)A Parallel Architecture for In-Line Data De-duplicationProceedings of the 2012 Second International Conference on Advanced Computing & Communication Technologies10.1109/ACCT.2012.10(399-403)Online publication date: 7-Jan-2012
  • (2012)TBF: A High-Efficient Query Mechanism in De-duplication Backup SystemAdvances in Grid and Pervasive Computing10.1007/978-3-642-30767-6_21(244-253)Online publication date: 2012
  • (2011)DTR-filterProceedings of the 5th international conference on Convergence and hybrid information technology10.5555/2045005.2045017(90-97)Online publication date: 22-Sep-2011
  • (2011)Real-time approximate Range Motif discovery & data redundancy removal algorithmProceedings of the 14th International Conference on Extending Database Technology10.1145/1951365.1951422(485-496)Online publication date: 21-Mar-2011
  • (2011)DTR-Filter: An Efficient Transmission Scheme for Real-Time Monitoring in Wireless Bulky Sensor NetworksConvergence and Hybrid Information Technology10.1007/978-3-642-24082-9_11(90-97)Online publication date: 2011

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media