Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3075564.3078888acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
research-article

Addressing Hadoop's Small File Problem With an Appendable Archive File Format

Published: 15 May 2017 Publication History
  • Get Citation Alerts
  • Abstract

    Hadoop has been used widely for data analytic tasks in various domains. At the same time, data volume is expected to grow even further in the next years. Hadoop recently introduced the concept Archival Storage, an automated tiered storage technique for increasing storage capacity for long-term storage. However, Hadoop Distributed File System's scalability is limited by the total number of files that can be stored, and it is likely that the number of files increases fast when using it for archival purposes.
    This paper presents an approach for improving HDFS' scalability when using it as an archival storage. We present a tool that extends Hadoop Archive to an appendable file format. New files are appended to one of the existing archive data files efficiently without rewriting the whole archive. Therefore, a first fit algorithm is used to fill up the often not fully utilized fixed-sized data blocks of the archive data files. Index files are updated using a red-black tree providing guaranteed fast lookup and insert performance. We show that the tool performs well for different sizes of archives and number of files to add. By distributing new files efficiently, we also reduce the number of data blocks needed for archiving and, thus, reduce the memory footprint on the NameNode.

    References

    [1]
    Paris Carbone, Stephan Ewen, Seif Haridi, Asterios Katsifodimos, Volker Markl, and Kostas Tzoumas. 2015. Apache flink: Stream and batch processing in a single engine. Data Engineering (2015), 28.
    [2]
    Thomas H. Cormen, Charles Eric Leiserson, Ronald L Rivest, and Clifford Stein. 2001. Introduction to algorithms. Vol. 6. MIT press Cambridge.
    [3]
    Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation (OSDI'04). USENIX Association, 10--10.
    [4]
    B. Dong, J. Qiu, Q. Zheng, X. Zhong, J. Li, and Y. Li. 2010. A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: A Case Study by PowerPoint Files. In 2010 IEEE International Conference on Services Computing. 65--72.
    [5]
    Liu Jiang, Bing Li, and Meina Song. 2010. THE optimization of HDFS based on small files. In 2010 3rd IEEE International Conference on Broadband Network and Multimedia Technology (IC-BNMT). 912--915.
    [6]
    KR Krish, Ali Anwar, and Ali R Butt. 2014. hatS: A Heterogeneity-Aware Tiered Storage for Hadoop. In Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on. IEEE, 502--511.
    [7]
    X. Liu, J. Han, Y. Zhong, C. Han, and X. He. 2009. Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS. In 2009 IEEE International Conference on Cluster Computing and Workshops. 1--8.
    [8]
    G. Mackey, S. Sehrish, and J. Wang. 2009. Improving metadata management for small files in HDFS. In 2009 IEEE International Conference on Cluster Computing and Workshops. 1--4.
    [9]
    S. Radia and S. Srinivas. 2010. Scaling HDFS Cluster Using Namenode Federation, HDFS-1052. (2010).
    [10]
    K. Shvachko, Hairong Kuang, S. Radia, and R. Chansler. 2010. The Hadoop Distributed File System. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on. 1--10.
    [11]
    C. Vorapongkitipun and N. Nupairoj. 2014. Improving performance of small-file accessing in Hadoop. In Computer Science and Software Engineering (JCSSE), 2014 11th International Joint Conference on. 200--205.
    [12]
    Jiong Xie, Shu Yin, Xiaojun Ruan, Zhiyang Ding, Yun Tian, James Majors, Adam Manzanares, and Xiao Qin. 2010. Improving mapreduce performance through data placement in heterogeneous hadoop clusters. In Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on. IEEE, 1--9.
    [13]
    Cairong Yan, Tie Li, Yongfeng Huang, and Yanglan Gan. 2014. Hmfs: efficient support of small files processing over HDFS. In International Conference on Algorithms and Architectures for Parallel Processing. Springer, 54--67.
    [14]
    Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 423--438.

    Cited By

    View all
    • (2021)An Efficient Approach to Enhance the Scalability of the HDFS: Extended Hadoop Archive (EHAR)2021 Emerging Trends in Industry 4.0 (ETI 4.0)10.1109/ETI4.051663.2021.9619367(1-6)Online publication date: 19-May-2021
    • (2018)SHAstor: A Scalable HDFS-Based Storage Framework for Small-Write Efficiency in Pervasive Computing2018 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI)10.1109/SmartWorld.2018.00198(1140-1145)Online publication date: Oct-2018

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CF'17: Proceedings of the Computing Frontiers Conference
    May 2017
    450 pages
    ISBN:9781450344876
    DOI:10.1145/3075564
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 May 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Archival Storage
    2. File Systems
    3. HDFS
    4. Hadoop Distributed File System
    5. Metadata Management

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    • BMBF

    Conference

    CF '17
    Sponsor:
    CF '17: Computing Frontiers Conference
    May 15 - 17, 2017
    Siena, Italy

    Acceptance Rates

    CF'17 Paper Acceptance Rate 43 of 87 submissions, 49%;
    Overall Acceptance Rate 273 of 785 submissions, 35%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)0

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)An Efficient Approach to Enhance the Scalability of the HDFS: Extended Hadoop Archive (EHAR)2021 Emerging Trends in Industry 4.0 (ETI 4.0)10.1109/ETI4.051663.2021.9619367(1-6)Online publication date: 19-May-2021
    • (2018)SHAstor: A Scalable HDFS-Based Storage Framework for Small-Write Efficiency in Pervasive Computing2018 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI)10.1109/SmartWorld.2018.00198(1140-1145)Online publication date: Oct-2018

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media