research-article

Addressing Hadoop's Small File Problem With an Appendable Archive File Format

Authors:

Thomas Renner,

Johannes Müller,

Lauritz Thamsen, and

Odej KaoAuthors Info & Claims

CF'17: Proceedings of the Computing Frontiers Conference

May 2017

Pages 367 - 372

https://doi.org/10.1145/3075564.3078888

Published: 15 May 2017 Publication History

Get Access

Abstract

Hadoop has been used widely for data analytic tasks in various domains. At the same time, data volume is expected to grow even further in the next years. Hadoop recently introduced the concept Archival Storage, an automated tiered storage technique for increasing storage capacity for long-term storage. However, Hadoop Distributed File System's scalability is limited by the total number of files that can be stored, and it is likely that the number of files increases fast when using it for archival purposes.

This paper presents an approach for improving HDFS' scalability when using it as an archival storage. We present a tool that extends Hadoop Archive to an appendable file format. New files are appended to one of the existing archive data files efficiently without rewriting the whole archive. Therefore, a first fit algorithm is used to fill up the often not fully utilized fixed-sized data blocks of the archive data files. Index files are updated using a red-black tree providing guaranteed fast lookup and insert performance. We show that the tool performs well for different sizes of archives and number of files to add. By distributing new files efficiently, we also reduce the number of data blocks needed for archiving and, thus, reduce the memory footprint on the NameNode.

References

[1]

Paris Carbone, Stephan Ewen, Seif Haridi, Asterios Katsifodimos, Volker Markl, and Kostas Tzoumas. 2015. Apache flink: Stream and batch processing in a single engine. Data Engineering (2015), 28.

Google Scholar

[2]

Thomas H. Cormen, Charles Eric Leiserson, Ronald L Rivest, and Clifford Stein. 2001. Introduction to algorithms. Vol. 6. MIT press Cambridge.

Digital Library

Google Scholar

[3]

Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation (OSDI'04). USENIX Association, 10--10.

Digital Library

Google Scholar

[4]

B. Dong, J. Qiu, Q. Zheng, X. Zhong, J. Li, and Y. Li. 2010. A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: A Case Study by PowerPoint Files. In 2010 IEEE International Conference on Services Computing. 65--72.

Digital Library

Google Scholar

[5]

Liu Jiang, Bing Li, and Meina Song. 2010. THE optimization of HDFS based on small files. In 2010 3rd IEEE International Conference on Broadband Network and Multimedia Technology (IC-BNMT). 912--915.

Crossref

Google Scholar

[6]

KR Krish, Ali Anwar, and Ali R Butt. 2014. hatS: A Heterogeneity-Aware Tiered Storage for Hadoop. In Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on. IEEE, 502--511.

Digital Library

Google Scholar

[7]

X. Liu, J. Han, Y. Zhong, C. Han, and X. He. 2009. Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS. In 2009 IEEE International Conference on Cluster Computing and Workshops. 1--8.

Google Scholar

[8]

G. Mackey, S. Sehrish, and J. Wang. 2009. Improving metadata management for small files in HDFS. In 2009 IEEE International Conference on Cluster Computing and Workshops. 1--4.

Google Scholar

[9]

S. Radia and S. Srinivas. 2010. Scaling HDFS Cluster Using Namenode Federation, HDFS-1052. (2010).

Google Scholar

[10]

K. Shvachko, Hairong Kuang, S. Radia, and R. Chansler. 2010. The Hadoop Distributed File System. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on. 1--10.

Digital Library

Google Scholar

[11]

C. Vorapongkitipun and N. Nupairoj. 2014. Improving performance of small-file accessing in Hadoop. In Computer Science and Software Engineering (JCSSE), 2014 11th International Joint Conference on. 200--205.

Google Scholar

[12]

Jiong Xie, Shu Yin, Xiaojun Ruan, Zhiyang Ding, Yun Tian, James Majors, Adam Manzanares, and Xiao Qin. 2010. Improving mapreduce performance through data placement in heterogeneous hadoop clusters. In Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on. IEEE, 1--9.

Google Scholar

[13]

Cairong Yan, Tie Li, Yongfeng Huang, and Yanglan Gan. 2014. Hmfs: efficient support of small files processing over HDFS. In International Conference on Algorithms and Architectures for Parallel Processing. Springer, 54--67.

Crossref

Google Scholar

[14]

Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 423--438.

Digital Library

Google Scholar

Cited By

View all

Sharma VBarwar N(2021)An Efficient Approach to Enhance the Scalability of the HDFS: Extended Hadoop Archive (EHAR)2021 Emerging Trends in Industry 4.0 (ETI 4.0)10.1109/ETI4.051663.2021.9619367(1-6)Online publication date: 19-May-2021
https://doi.org/10.1109/ETI4.051663.2021.9619367
Zeng LShi WNi FJiang SFan XXu CWang Y(2018)SHAstor: A Scalable HDFS-Based Storage Framework for Small-Write Efficiency in Pervasive Computing2018 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI)10.1109/SmartWorld.2018.00198(1140-1145)Online publication date: Oct-2018
https://doi.org/10.1109/SmartWorld.2018.00198

Recommendations

A multiple-file write scheme for improving write performance of small files in Fast File System

Fast File System (FFS) stores files to disk in separate disk writes, each of which incurs a disk positioning (seek + rotation) limiting the write performance for small files. We propose a new scheme called co-writing to accelerate small file writes in ...
Read More
Optimization strategy of Hadoop small file storage for big data in healthcare

As the era of "big data" comes, the data processing platform like Hadoop was born at the right moment. But its carrier for storage, Hadoop distributed file system (HDFS) has the great weakness in storage of the numerous small files. The storage of ...
Read More
Tuning file system block addressing for performance
ACM-SE 44: Proceedings of the 44th annual Southeast regional conference

In most general purpose file systems, data blocks are scattered throughout the disk so as not to require arbitrary chunks of contiguous disk space. To be able to find the n^th data block in a file, both an index and indexing function must exist. The ...
Read More

Comments

Information & Contributors

Information

Published In

CF'17: Proceedings of the Computing Frontiers Conference

May 2017

450 pages

ISBN:9781450344876

DOI:10.1145/3075564

General Chair:
Roberto Giorgi
University of Siena, IT
,
Program Chairs:
Michela Becchi
North Carolina State University
,
Francesca Palumbo
University of Sassari, IT

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 May 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

BMBF

Conference

CF '17

Sponsor:

SIGMICRO

CF '17: Computing Frontiers Conference

May 15 - 17, 2017

Siena, Italy

Acceptance Rates

CF'17 Paper Acceptance Rate 43 of 87 submissions, 49%;

Overall Acceptance Rate 273 of 785 submissions, 35%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
190
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Other Metrics

View Author Metrics

Citations

Cited By

View all

Sharma VBarwar N(2021)An Efficient Approach to Enhance the Scalability of the HDFS: Extended Hadoop Archive (EHAR)2021 Emerging Trends in Industry 4.0 (ETI 4.0)10.1109/ETI4.051663.2021.9619367(1-6)Online publication date: 19-May-2021
https://doi.org/10.1109/ETI4.051663.2021.9619367
Zeng LShi WNi FJiang SFan XXu CWang Y(2018)SHAstor: A Scalable HDFS-Based Storage Framework for Small-Write Efficiency in Pervasive Computing2018 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI)10.1109/SmartWorld.2018.00198(1140-1145)Online publication date: Oct-2018
https://doi.org/10.1109/SmartWorld.2018.00198

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Recommendations

A multiple-file write scheme for improving write performance of small files in Fast File System

Optimization strategy of Hadoop small file storage for big data in healthcare

Tuning file system block addressing for performance