Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/CCGrid.2014.51acmotherconferencesArticle/Chapter ViewAbstractPublication PagesccgridConference Proceedingsconference-collections
research-article

hatS: a heterogeneity-aware tiered storage for hadoop

Published: 26 May 2014 Publication History
  • Get Citation Alerts
  • Abstract

    Hadoop has become the de-facto large-scale data processing framework for modern analytics applications. A major obstacle for sustaining high performance and scalability in Hadoop is managing the data growth while meeting the ever higher I/O demand. To this end, a promising trend in storage systems is to utilize hybrid and heterogeneous devices --- Solid State Disks (SSD), ramdisks and Network Attached Storage (NAS), which can help achieve very high I/O rates at acceptable cost. However, the Hadoop Distributed File System (HDFS) that is unable to exploit such heterogeneous storage. This is because HDFS works on the assumption that the underlying devices are homogeneous storage blocks, disregarding their individual I/O characteristics, which leads to performance degradation. In this paper, we present hatS, a <u>H</u>eterogeneity-<u>A</u>ware <u>T</u>iered <u>S</u>torage, which is a novel redesign of HDFS into a multi-tiered storage system that seamlessly integrates heterogeneous storage technologies into the Hadoop ecosystem. hatS also proposes data placement and retrieval policies, which improve the utilization of the storage devices based on their characteristics such as I/O throughput and capacity.
    We evaluate hatS using an actual implementation on a medium-sized cluster consisting of HDDs and two types of SSDs (i.e., SATA SSD and PCIe SSD). Experiments show that hatS achieves 32.6% higher read bandwidth, on average, than HDFS for the test Hadoop jobs (such as Grep and TestDFSIO) by directing 64% of the I/O accesses to the SSD tiers. We also evaluate our approach with trace-driven simulations using synthetic Facebook workloads, and show that compared to the standard setup, hatS improves the average I/O rate by 36%, which results in 26% improvement in the job completion time.

    References

    [1]
    Data storage on a multi-tiered disk system, Aug. 2 2005. US Patent 6,925,529.
    [2]
    Apache Software Foundation. Hadoop, 2011. http://hadoop.apache.org/core/.
    [3]
    D. Borthakur, J. Gray, J. S. Sarma, K. Muthukkaruppan, N. Spiegelberg, H. Kuang, K. Ranganathan, D. Molkov, A. Menon, S. Rash, R. Schmidt, and A. Aiyer. Apache hadoop goes realtime at Facebook. In Proc. ACM SIGMOD, 2011.
    [4]
    L.-P. Chang. Hybrid solid-state disks: combining heterogeneous nand flash in large ssds. In Proc. IEEE ASPDAC, 2008.
    [5]
    F. Chen, D. A. Koufaty, and X. Zhang. Hystor: making the best use of solid state drives in high performance storage systems. In Proc. ACM SC, 2011.
    [6]
    Y. Chen, A. Ganapathi, R. Griffith, and R. Katz. The case for evaluating mapreduce performance using workload suites. In Proc. IEEE MASCOTS, 2011.
    [7]
    S. S. Chu and C. V. Ho. Self-recovering erase scheme to enhance flash memory endurance, Mar. 21 1995. US Patent 5,400,286.
    [8]
    P. Costa, A. Donnelly, A. Rowstron, and G. OShea. Camdoop: Exploiting in-network aggregation for big data applications. In Proc. USENIX NSDI, 2012.
    [9]
    J. Dean. Designs, lessons and advice from building large distributed systems. Keynote from LADIS, 2009.
    [10]
    J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008.
    [11]
    N. Dhondy and D. Petersen. Geographically dispersed parallel sysplex: The ultimate e-business availability solution. Technical report, IBM Corp, 2002.
    [12]
    Dhruba Borthakur. Facebook has the world's largest Hadoop cluster!, 2010. http://hadoopblog.blogspot.com/2010/05/facebook-has-worlds-largest-hadoop.html.
    [13]
    F. Dong. Extending starfish to support the growing hadoop ecosystem. Master's thesis, Duke University, 2012.
    [14]
    S. Feldman and R. L. Villars. The information lifecycle management imperative. IDC White Paper, July, 2006.
    [15]
    J. H. Hartman and J. K. Ousterhout. The zebra striped network file system. ACM TOCS, 13(3):274--310, 1995.
    [16]
    HDFS-2832. Enable support for heterogeneous storages in HDFS, 2012. https://issues.apache.org/jira/browse/HDFS-2832.
    [17]
    S. Huang, J. Huang, Y. Liu, L. Yi, and J. Dai. Hibench: A representative and comprehensive hadoop benchmark suite. In Proc. ICDE Workshops, 2010.
    [18]
    H. Kario. Lvmts, 2012. https://github.com/tomato42/lvmts.
    [19]
    K. Krish, A. Khasymski, A. R. Butt, S. Tiwari, and M. Bhandarkar. Aptstore:dynamic storage management for hadoop. In Proc. IEEE CloudCom, 2013.
    [20]
    K. Krish, A. Khasymski, G. Wang, A. R. Butt, and G. Makkar. On the use of shared storage in shared-nothing environments. In Proc. IEEE Big Data, 2013.
    [21]
    Linux man page. sar(1), 2013. http://linux.die.net/man/1/sar.
    [22]
    R. Mantri, R. Ingle, and P. Patil. Scdp: Scalable, cost-effective, distributed and parallel computing model for academics. In Proc. ICECT, 2011.
    [23]
    R. Micheloni, L. Crippa, and M. Picca. Hybrid storage. In Inside Solid State Drives (SSDs), pages 61--77. Springer, 2013.
    [24]
    M. Mihailescu, G. Soundararajan, and C. Amza. Mixapart: decoupled analytics for shared storage systems. In Proc. USENIX FAST, 2013.
    [25]
    E. B. Nightingale, J. Elson, J. Fan, O. Hofmann, J. Howell, and Y. Suzue. Flat datacenter storage. In Proc. USENIX OSDI, 2012.
    [26]
    H. Payer, M. A. Sanvido, Z. Z. Bandic, and C. M. Kirsch. Combo drive: Optimizing cost and performance in a heterogeneous storage device. In Proc. WISH, 2009.
    [27]
    M. Peterson. Ilm and tiered storage. Technical report, Storage Networking Industry Association, 2006.
    [28]
    E. Pinheiro, W.-D. Weber, and L. A. Barroso. Failure trends in a large disk drive population. In Proc. USENIX FAST, 2007.
    [29]
    Sameer Tiwari. Hadoop and Disparate Data Stores, 2012. http://blog.gopivotal.com/products/hadoop-and-disparate-data-stores.
    [30]
    Sameer Tiwari. Managing Hot and Cold Data Using a Unified Storage System, 2012. http://blog.gopivotal.com/products/managing-hot-and-cold-data-using-a-unified-storage-system.
    [31]
    F. B. Schmuck and R. L. Haskin. Gpfs: A shared-disk file system for large computing clusters. In Proc. USENIX FAST, 2002.
    [32]
    K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Proc. IEEE MSST, 2010.
    [33]
    G. Soundararajan, V. Prabhakaran, M. Balakrishnan, and T. Wobber. Extending ssd lifetimes with disk-based write caches. In Proc. USENIX FAST, 2010.
    [34]
    V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache hadoop yarn: Yet another resource negotiator. In Proc. SOCC, 2013.
    [35]
    A.-I. Wang, P. L. Reiher, G. J. Popek, and G. H. Kuenning. Conquest: Better performance through a disk/persistent-ram hybrid file system. In Proc. USENIX ATC, 2002.
    [36]
    B. Welch, M. Unangst, Z. Abbasi, G. A. Gibson, B. Mueller, J. Small, J. Zelenka, and B. Zhou. Scalable performance of the panasas parallel file system. In Proc. USENIX FAST, 2008.
    [37]
    T. White. Hadoop: the definitive guide. O'Reilly Media, 2012.
    [38]
    J. Xie, S. Yin, X. Ruan, Z. Ding, Y. Tian, J. Majors, A. Manzanares, and X. Qin. Improving mapreduce performance through data placement in heterogeneous hadoop clusters. In Proc. IEEE IPDPSW, 2010.
    [39]
    M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proc. USENIX HotCloud, 2010.
    [40]
    X. Zhao, Z. Li, X. Zhang, and L. Zeng. Block-level data migration in tiered storage system. In Proc. IEEE ICCNT, 2010.

    Cited By

    View all

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    CCGRID '14: Proceedings of the 14th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing
    May 2014
    987 pages
    ISBN:9781479927838

    Publisher

    IEEE Press

    Publication History

    Published: 26 May 2014

    Check for updates

    Author Tags

    1. data placement and retrieval policy
    2. hadoop distributed file system (HDFS)
    3. tiered storage

    Qualifiers

    • Research-article

    Conference

    CCGrid '14

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 10 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Cost-based Data Prefetching and Scheduling in Big Data Platforms over Tiered Storage SystemsACM Transactions on Database Systems10.1145/362538948:4(1-40)Online publication date: 13-Nov-2023
    • (2021)TridentProceedings of the VLDB Endowment10.14778/3461535.346154514:9(1570-1582)Online publication date: 22-Oct-2021
    • (2020)MosaicProceedings of the VLDB Endowment10.14778/3407790.340785213:12(2662-2675)Online publication date: 1-Jul-2020
    • (2019)Automating distributed tiered storage management in cluster computingProceedings of the VLDB Endowment10.14778/3357377.335738113:1(43-56)Online publication date: 1-Sep-2019
    • (2019)Approximate CodeProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337869(1-10)Online publication date: 5-Aug-2019
    • (2018)bespoKVProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291659(1-16)Online publication date: 11-Nov-2018
    • (2018)bespoKVProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00005(1-16)Online publication date: 11-Nov-2018
    • (2017)Addressing Hadoop's Small File Problem With an Appendable Archive File FormatProceedings of the Computing Frontiers Conference10.1145/3075564.3078888(367-372)Online publication date: 15-May-2017
    • (2016)High Performance Design for HDFS with Byte-Addressability of NVM and RDMAProceedings of the 2016 International Conference on Supercomputing10.1145/2925426.2926290(1-14)Online publication date: 1-Jun-2016
    • (2016)MOSProceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing10.1145/2907294.2907304(177-188)Online publication date: 31-May-2016
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media