research-article

hatS: a heterogeneity-aware tiered storage for hadoop

Authors:

Ali R. ButtAuthors Info & Claims

CCGRID '14: Proceedings of the 14th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing

Pages 502 - 511

https://doi.org/10.1109/CCGrid.2014.51

Published: 26 May 2014 Publication History

Abstract

Hadoop has become the de-facto large-scale data processing framework for modern analytics applications. A major obstacle for sustaining high performance and scalability in Hadoop is managing the data growth while meeting the ever higher I/O demand. To this end, a promising trend in storage systems is to utilize hybrid and heterogeneous devices --- Solid State Disks (SSD), ramdisks and Network Attached Storage (NAS), which can help achieve very high I/O rates at acceptable cost. However, the Hadoop Distributed File System (HDFS) that is unable to exploit such heterogeneous storage. This is because HDFS works on the assumption that the underlying devices are homogeneous storage blocks, disregarding their individual I/O characteristics, which leads to performance degradation. In this paper, we present hatS, a <u>H</u>eterogeneity-<u>A</u>ware <u>T</u>iered <u>S</u>torage, which is a novel redesign of HDFS into a multi-tiered storage system that seamlessly integrates heterogeneous storage technologies into the Hadoop ecosystem. hatS also proposes data placement and retrieval policies, which improve the utilization of the storage devices based on their characteristics such as I/O throughput and capacity.

We evaluate hatS using an actual implementation on a medium-sized cluster consisting of HDDs and two types of SSDs (i.e., SATA SSD and PCIe SSD). Experiments show that hatS achieves 32.6% higher read bandwidth, on average, than HDFS for the test Hadoop jobs (such as Grep and TestDFSIO) by directing 64% of the I/O accesses to the SSD tiers. We also evaluate our approach with trace-driven simulations using synthetic Facebook workloads, and show that compared to the standard setup, hatS improves the average I/O rate by 36%, which results in 26% improvement in the job completion time.

References

[1]

Data storage on a multi-tiered disk system, Aug. 2 2005. US Patent 6,925,529.

[2]

Apache Software Foundation. Hadoop, 2011. http://hadoop.apache.org/core/.

[3]

D. Borthakur, J. Gray, J. S. Sarma, K. Muthukkaruppan, N. Spiegelberg, H. Kuang, K. Ranganathan, D. Molkov, A. Menon, S. Rash, R. Schmidt, and A. Aiyer. Apache hadoop goes realtime at Facebook. In Proc. ACM SIGMOD, 2011.

Digital Library

[4]

L.-P. Chang. Hybrid solid-state disks: combining heterogeneous nand flash in large ssds. In Proc. IEEE ASPDAC, 2008.

Digital Library

[5]

F. Chen, D. A. Koufaty, and X. Zhang. Hystor: making the best use of solid state drives in high performance storage systems. In Proc. ACM SC, 2011.

Digital Library

[6]

Y. Chen, A. Ganapathi, R. Griffith, and R. Katz. The case for evaluating mapreduce performance using workload suites. In Proc. IEEE MASCOTS, 2011.

Digital Library

[7]

S. S. Chu and C. V. Ho. Self-recovering erase scheme to enhance flash memory endurance, Mar. 21 1995. US Patent 5,400,286.

[8]

P. Costa, A. Donnelly, A. Rowstron, and G. OShea. Camdoop: Exploiting in-network aggregation for big data applications. In Proc. USENIX NSDI, 2012.

Digital Library

[9]

J. Dean. Designs, lessons and advice from building large distributed systems. Keynote from LADIS, 2009.

[10]

J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008.

Digital Library

[11]

N. Dhondy and D. Petersen. Geographically dispersed parallel sysplex: The ultimate e-business availability solution. Technical report, IBM Corp, 2002.

[12]

Dhruba Borthakur. Facebook has the world's largest Hadoop cluster!, 2010. http://hadoopblog.blogspot.com/2010/05/facebook-has-worlds-largest-hadoop.html.

[13]

F. Dong. Extending starfish to support the growing hadoop ecosystem. Master's thesis, Duke University, 2012.

[14]

S. Feldman and R. L. Villars. The information lifecycle management imperative. IDC White Paper, July, 2006.

[15]

J. H. Hartman and J. K. Ousterhout. The zebra striped network file system. ACM TOCS, 13(3):274--310, 1995.

Digital Library

[16]

HDFS-2832. Enable support for heterogeneous storages in HDFS, 2012. https://issues.apache.org/jira/browse/HDFS-2832.

[17]

S. Huang, J. Huang, Y. Liu, L. Yi, and J. Dai. Hibench: A representative and comprehensive hadoop benchmark suite. In Proc. ICDE Workshops, 2010.

[18]

H. Kario. Lvmts, 2012. https://github.com/tomato42/lvmts.

[19]

K. Krish, A. Khasymski, A. R. Butt, S. Tiwari, and M. Bhandarkar. Aptstore:dynamic storage management for hadoop. In Proc. IEEE CloudCom, 2013.

Digital Library

[20]

K. Krish, A. Khasymski, G. Wang, A. R. Butt, and G. Makkar. On the use of shared storage in shared-nothing environments. In Proc. IEEE Big Data, 2013.

[21]

Linux man page. sar(1), 2013. http://linux.die.net/man/1/sar.

[22]

R. Mantri, R. Ingle, and P. Patil. Scdp: Scalable, cost-effective, distributed and parallel computing model for academics. In Proc. ICECT, 2011.

[23]

R. Micheloni, L. Crippa, and M. Picca. Hybrid storage. In Inside Solid State Drives (SSDs), pages 61--77. Springer, 2013.

[24]

M. Mihailescu, G. Soundararajan, and C. Amza. Mixapart: decoupled analytics for shared storage systems. In Proc. USENIX FAST, 2013.

Digital Library

[25]

E. B. Nightingale, J. Elson, J. Fan, O. Hofmann, J. Howell, and Y. Suzue. Flat datacenter storage. In Proc. USENIX OSDI, 2012.

Digital Library

[26]

H. Payer, M. A. Sanvido, Z. Z. Bandic, and C. M. Kirsch. Combo drive: Optimizing cost and performance in a heterogeneous storage device. In Proc. WISH, 2009.

[27]

M. Peterson. Ilm and tiered storage. Technical report, Storage Networking Industry Association, 2006.

[28]

E. Pinheiro, W.-D. Weber, and L. A. Barroso. Failure trends in a large disk drive population. In Proc. USENIX FAST, 2007.

Digital Library

[29]

Sameer Tiwari. Hadoop and Disparate Data Stores, 2012. http://blog.gopivotal.com/products/hadoop-and-disparate-data-stores.

[30]

Sameer Tiwari. Managing Hot and Cold Data Using a Unified Storage System, 2012. http://blog.gopivotal.com/products/managing-hot-and-cold-data-using-a-unified-storage-system.

[31]

F. B. Schmuck and R. L. Haskin. Gpfs: A shared-disk file system for large computing clusters. In Proc. USENIX FAST, 2002.

Digital Library

[32]

K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Proc. IEEE MSST, 2010.

Digital Library

[33]

G. Soundararajan, V. Prabhakaran, M. Balakrishnan, and T. Wobber. Extending ssd lifetimes with disk-based write caches. In Proc. USENIX FAST, 2010.

Digital Library

[34]

V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache hadoop yarn: Yet another resource negotiator. In Proc. SOCC, 2013.

Digital Library

[35]

A.-I. Wang, P. L. Reiher, G. J. Popek, and G. H. Kuenning. Conquest: Better performance through a disk/persistent-ram hybrid file system. In Proc. USENIX ATC, 2002.

Digital Library

[36]

B. Welch, M. Unangst, Z. Abbasi, G. A. Gibson, B. Mueller, J. Small, J. Zelenka, and B. Zhou. Scalable performance of the panasas parallel file system. In Proc. USENIX FAST, 2008.

Digital Library

[37]

T. White. Hadoop: the definitive guide. O'Reilly Media, 2012.

Digital Library

[38]

J. Xie, S. Yin, X. Ruan, Z. Ding, Y. Tian, J. Majors, A. Manzanares, and X. Qin. Improving mapreduce performance through data placement in heterogeneous hadoop clusters. In Proc. IEEE IPDPSW, 2010.

[39]

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proc. USENIX HotCloud, 2010.

Digital Library

[40]

X. Zhao, Z. Li, X. Zhang, and L. Zeng. Block-level data migration in tiered storage system. In Proc. IEEE ICCNT, 2010.

Digital Library

Cited By

Herodotou HKakoulli E(2023)Cost-based Data Prefetching and Scheduling in Big Data Platforms over Tiered Storage SystemsACM Transactions on Database Systems10.1145/362538948:4(1-40)Online publication date: 13-Nov-2023
https://dl.acm.org/doi/10.1145/3625389
Herodotou HKakoulli E(2021)TridentProceedings of the VLDB Endowment10.14778/3461535.346154514:9(1570-1582)Online publication date: 22-Oct-2021
https://dl.acm.org/doi/10.14778/3461535.3461545
Vogel LLeis Vvan Renen ANeumann TImamura SKemper A(2020)MosaicProceedings of the VLDB Endowment10.14778/3407790.340785213:12(2662-2675)Online publication date: 1-Jul-2020
https://dl.acm.org/doi/10.14778/3407790.3407852
Show More Cited By

hatS: a heterogeneity-aware tiered storage for hadoop

Recommendations

Differentiated storage services

This article presents a Differentiated Storage Services architecture for file and storage systems. By classifying data at the block-level, a filesystem can request that different classes of data (e.g., file, directory, executable, text) be handled with ...
Evaluation of Exclusive Data Allocation Between SSD Tier and SSD Cache in Storage Systems
ICEIS 2014: Proceedings of the 16th International Conference on Enterprise Information Systems - Volume 1

We proposed and evaluated the storage I/O response time with the exclusive allocation method between SSD for tiered volume and SSD for cache in the storage system utilizing SSD and HDD. In the proposed method, the SSD cache function with exclusive ...
An Optimal Solution of Storing and Processing Small Image Files on Hadoop
Abstract
The rapid development of the Internet, especially mobile Internet, makes it much easier for people to make social contacts online. Nowadays they tend to spend more and more time on social network service, producing a lot of image files. This ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

CCGRID '14: Proceedings of the 14th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing

May 2014

987 pages

ISBN:9781479927838

General Chairs:
Daniel A. Reed
University of Iowa
,
Xian-He Sun
Illinois Institute of Technology
,
Ian T. Foster
University of Chicago & Argonne National Laboratory

Publisher

IEEE Press

Publication History

Published: 26 May 2014

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CCGrid '14

CCGrid '14: 2014 IEEE International Symposium on Cluster Computing and the Grid

May 26 - 29, 2014

Illinois, Chicago

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
14
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Herodotou HKakoulli E(2023)Cost-based Data Prefetching and Scheduling in Big Data Platforms over Tiered Storage SystemsACM Transactions on Database Systems10.1145/362538948:4(1-40)Online publication date: 13-Nov-2023
https://dl.acm.org/doi/10.1145/3625389
Herodotou HKakoulli E(2021)TridentProceedings of the VLDB Endowment10.14778/3461535.346154514:9(1570-1582)Online publication date: 22-Oct-2021
https://dl.acm.org/doi/10.14778/3461535.3461545
Vogel LLeis Vvan Renen ANeumann TImamura SKemper A(2020)MosaicProceedings of the VLDB Endowment10.14778/3407790.340785213:12(2662-2675)Online publication date: 1-Jul-2020
https://dl.acm.org/doi/10.14778/3407790.3407852
Herodotou HKakoulli E(2019)Automating distributed tiered storage management in cluster computingProceedings of the VLDB Endowment10.14778/3357377.335738113:1(43-56)Online publication date: 1-Sep-2019
https://dl.acm.org/doi/10.14778/3357377.3357381
Jin HWu CXie XLi JGuo MLin HZhang J(2019)Approximate CodeProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337869(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337869
Anwar ACheng YHuang HHan JSim HLee DDouglis FButt A(2018)bespoKVProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291659(1-16)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.5555/3291656.3291659
Anwar ACheng YHuang HHan JSim HLee DDouglis FButt A(2018)bespoKVProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00005(1-16)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.1109/SC.2018.00005
Renner TMüller JThamsen LKao OGiorgi RBecchi MPalumbo F(2017)Addressing Hadoop's Small File Problem With an Appendable Archive File FormatProceedings of the Computing Frontiers Conference10.1145/3075564.3078888(367-372)Online publication date: 15-May-2017
https://dl.acm.org/doi/10.1145/3075564.3078888
Islam NWasi-ur-Rahman MLu XPanda D(2016)High Performance Design for HDFS with Byte-Addressability of NVM and RDMAProceedings of the 2016 International Conference on Supercomputing10.1145/2925426.2926290(1-14)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1145/2925426.2926290
Anwar ACheng YGupta AButt ANakashima HTaura KLange J(2016)MOSProceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing10.1145/2907294.2907304(177-188)Online publication date: 31-May-2016
https://dl.acm.org/doi/10.1145/2907294.2907304
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents