Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2480362.2480434acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Modeling I/O interference for data intensive distributed applications

Published: 18 March 2013 Publication History

Abstract

Data intensive applications such as MapReduce can have large performance degradation from the effects of I/O interference when multiple processes access the same I/O resources simultaneously, particularly in the case of disks. It is necessary to understand this effect in order to improve resource allocation and utilization for these applications. In this paper, we propose a model for predicting the impact of I/O interference on MapReduce application performance. Our model takes basic parameters of the workload and hardware environment, and knowledge of the I/O behavior of the application to predict how I/O interference affects the scalability of an application. We compare the model's predictions for several workloads (TeraSort, WordCount, PFP Growth and PageRank) against the actual behavior of those workloads in a real cluster environment, and confirm that our model can provide highly accurate predictions.

References

[1]
Apache. Hadoop Core. http://hadoop.apache.org/core.
[2]
P. Castagna. Having fun with PageRank and MapReduce. Hadoop User Group UK talk, http://static.last.fm/johan/huguk-20090414/paolo_castagna-pagerank.pdf.
[3]
Y. Chen, A. Ganapathi, and R. H. Katz. To compress or not to compress - compute vs. IO tradeoffs for MapReduce energy efficiency. In Proc. of Green Networking, pages 23--28, New York, NY, USA, 2010. ACM.
[4]
R. Chiang and H. Huang. TRACON: Interference-aware scheduling for data-intensive applications in virtualized environments. In Proc. of SC, pages 1--12, nov. 2011.
[5]
J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008.
[6]
A. Gulati, G. Shanmuganathan, I. Ahmad, C. Waldspurger, and M. Uysal. Pesto: online storage performance management in virtualized datacenters. In Proc. of SOCC, pages 19:1--19:14, New York, NY, USA, 2011. ACM.
[7]
H. Herodotou. Hadoop performance models. Technical report, Duke University, 2010. http://www.cs.duke.edu/starfish/files/hadoop-models.pdf.
[8]
H. Herodotou, F. Dong, and S. Babu. No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics. In Proc. of SOCC, pages 18:1--18:14, New York, NY, USA, 2011. ACM.
[9]
Y. Huai, R. Lee, S. Zhang, C. H. Xia, and X. Zhang. DOT: a matrix model for analyzing, optimizing and deploying software for big data analytics in distributed systems. In Proc. of SOCC, pages 4:1--4:14, New York, NY, USA, 2011. ACM.
[10]
E. Jahani, M. J. Cafarella, and C. Ré. Automatic optimization for mapreduce programs. Proc. VLDB Endow., 4(6):385--396, Mar. 2011.
[11]
A. Jindal, J.-A. Quiané-Ruiz, and J. Dittrich. Trojan data layouts: right shoes for a running elephant. In Proc. of SOCC, pages 21:1--21:14, New York, NY, USA, 2011. ACM.
[12]
K. Kambatla, A. Pathak, and H. Pucha. Towards optimizing hadoop provisioning in the cloud. In Proc. of HotCloud, Berkeley, CA, USA, 2009. USENIX.
[13]
H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang. PFP: Parallel FP-growth for query recommendation. In Proc. of RecSys, pages 107--114, New York, NY, USA, 2008. ACM.
[14]
M. P. Mesnier, M. Wachs, R. R. Sambasivan, A. X. Zheng, and G. R. Ganger. Modeling the relative fitness of storage. In Proc. of SIGMETRICS, pages 37--48, New York, NY, USA, 2007. ACM.
[15]
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical Report 1999--66, Stanford InfoLab, November 1999.
[16]
X. Pu, L. Liu, Y. Mei, S. Sivathanu, Y. Koh, and C. Pu. Understanding performance interference of I/O workload in virtualized cloud environments. In Proc. of CLOUD, pages 51--58, july 2010.
[17]
H. Shan, K. Antypas, and J. Shalf. Characterizing and predicting the I/O performance of HPC applications using a parameterized synthetic benchmark. In Proc. of SC, pages 42:1--42:12, Piscataway, NJ, USA, 2008. IEEE Press.
[18]
TeraSort. http://sortbenchmark.org/.
[19]
A. Verma, L. Cherkasova, and R. H. Campbell. ARIA: automatic resource inference and allocation for mapreduce environments. In Proc. of ICAC, pages 235--244, New York, NY, USA, 2011. ACM.
[20]
X. Wang, C. Olston, A. D. Sarma, and R. Burns. CoScan: cooperative scan sharing in the cloud. In Proc. of SOCC, pages 11:1--11:12, New York, NY, USA, 2011. ACM.
[21]
S. Wu, F. Li, S. Mehrotra, and B. C. Ooi. Query optimization for massively parallel data processing. In Proc. of SOCC, pages 12:1--12:13, New York, NY, USA, 2011. ACM.
[22]
H. Yang, Z. Luan, W. Li, and D. Qian. MapReduce workload modeling with statistical approach. Journal of Grid Computing, 10:279--310, 2012. 10.1007/s10723-011-9201-4.

Cited By

View all
  • (2019)Can I/O Variability Be Reduced on QoS-Less HPC Storage Systems?IEEE Transactions on Computers10.1109/TC.2018.288170968:5(631-645)Online publication date: 1-May-2019
  • (2018)Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2018.00127(803-812)Online publication date: May-2018
  • (2016)Data Intensive Cloud ComputingBig Data10.4018/978-1-4666-9840-6.ch029(639-654)Online publication date: 2016
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SAC '13: Proceedings of the 28th Annual ACM Symposium on Applied Computing
March 2013
2124 pages
ISBN:9781450316569
DOI:10.1145/2480362
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 March 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. I/O behavior
  2. MapReduce
  3. cloud computing
  4. data inte I/O interference

Qualifiers

  • Research-article

Funding Sources

  • Ministry of Internal Affairs and Communications

Conference

SAC '13
Sponsor:
SAC '13: SAC '13
March 18 - 22, 2013
Coimbra, Portugal

Acceptance Rates

SAC '13 Paper Acceptance Rate 255 of 1,063 submissions, 24%;
Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Upcoming Conference

SAC '25
The 40th ACM/SIGAPP Symposium on Applied Computing
March 31 - April 4, 2025
Catania , Italy

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2019)Can I/O Variability Be Reduced on QoS-Less HPC Storage Systems?IEEE Transactions on Computers10.1109/TC.2018.288170968:5(631-645)Online publication date: 1-May-2019
  • (2018)Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2018.00127(803-812)Online publication date: May-2018
  • (2016)Data Intensive Cloud ComputingBig Data10.4018/978-1-4666-9840-6.ch029(639-654)Online publication date: 2016
  • (2016)P-DOTInternational Journal of Parallel, Emergent and Distributed Systems10.1080/17445760.2015.101651531:3(233-253)Online publication date: 1-May-2016
  • (2016)Distributed scheduling with probabilistic and fuzzy classifications of processesFuture Generation Computer Systems10.1016/j.future.2016.03.00162:C(1-16)Online publication date: 1-Sep-2016
  • (2015)Data Intensive Cloud ComputingAdvanced Research on Cloud Computing Design and Applications10.4018/978-1-4666-8676-2.ch019(305-320)Online publication date: 2015
  • (2015)Design and implement of pre-loading SSD cache data using split file on Hadoop MapReduceProceedings of the 2015 Conference on research in adaptive and convergent systems10.1145/2811411.2811499(457-460)Online publication date: 9-Oct-2015
  • (2015)Probabilistic Estimation of Resource Affinities of Processes in Computing Systems2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing10.1109/CIT/IUCC/DASC/PICOM.2015.223(1492-1497)Online publication date: Oct-2015
  • (2015)Big data analytics: a literature reviewJournal of Management Analytics10.1080/23270012.2015.10824492:3(175-201)Online publication date: 13-Oct-2015
  • (2014)SMARTHProceedings of the 2014 Brazilian Conference on Intelligent Systems10.1109/ICPP.2014.12(30-39)Online publication date: 18-Oct-2014
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media