Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3297663.3310302acmconferencesArticle/Chapter ViewAbstractPublication PagesicpeConference Proceedingsconference-collections
research-article

Characterization of a Big Data Storage Workload in the Cloud

Published: 04 April 2019 Publication History

Abstract

The proliferation of big data processing platforms has led to radically different system designs, such as MapReduce and the newer Spark. Understanding the workloads of such systems facilitates tuning and could foster new designs. However, whereas MapReduce workloads have been characterized extensively, relatively little public knowledge exists about the characteristics of Spark workloads in representative environments. To address this problem, in this work we collect and analyze a 6-month Spark workload from a major provider of big data processing services, Databricks. Our analysis focuses on a number of key features, such as the long-term trends of reads and modifications, the statistical properties of reads, and the popularity of clusters and of file formats. Overall, we present numerous findings that could form the basis of new systems studies and designs. Our quantitative evidence and its analysis suggest the existence of daily and weekly load imbalances, of heavy-tailed and bursty behaviour, of the relative rarity of modifications, and of proliferation of big data specific formats.

References

[1]
Abad et almbox. 2012. A storage-centric analysis of MapReduce workloads: File popularity, temporal locality and arrival patterns. In IISWC.
[2]
Atikoglu et almbox. 2012. Workload analysis of a large-scale key-value store. In SIGMETRICS .
[3]
H. Carns et almbox. 2011. Understanding and Improving Computational Science Storage Access through Continuous Characterization. TOS, Vol. 7, 3 (2011).
[4]
Chen et almbox. 2011. Design implications for enterprise storage systems via multi-dimensional trace analysis. In SOSP .
[5]
Chen et almbox. 2012. Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads. PVLDB, Vol. 5, 12 (2012).
[6]
Conover. 1980. Practical Nonparametric Statistics, Chapter 6. Wiley New York.
[7]
Microsoft Corp. 2013. 343 Industries Gets New User Insights from Big Data in the Cloud . https://azure.microsoft.com/en-us/case-studies/customer-stories-343industries/
[8]
Dean et almbox. 2004. MapReduce: Simplified Data Processing on Large Clusters. In OSDI.
[9]
Feder. 2013. Fractals, Chapter 8 .Springer Science & Business Media.
[10]
Ghemawat et almbox. 2003. The Google file system. In SOSP .
[11]
Ghit et almbox. 2014. Balanced resource allocations across multiple dynamic MapReduce clusters. In SIGMETRICS . 329--341.
[12]
Ghodsnia et almbox. 2014. Parallel I/O aware query optimization. In SIGMOD .
[13]
Gunasekaran et almbox. 2015. Comparative I/O workload characterization of two leadership class storage clusters. In PDSW.
[14]
Harter et almbox. 2014. Analysis of HDFS under HBase: a facebook messages case study. In USENIX FAST.
[15]
Hashem et almbox. 2015. The rise of "big data" on cloud computing: Review and open research issues. Inf. Syst., Vol. 47 (2015).
[16]
Iosup et almbox. 2011. Performance Analysis of Cloud Computing Services for Many-Tasks Scientific Computing. TPDS, Vol. 22, 6 (2011), 931--945.
[17]
Iosup et almbox. 2018. Massivizing Computer Systems: A Vision to Understand, Design, and Engineer Computer Ecosystems Through and Beyond Modern Distributed Systems. In ICDCS . 1224--37.
[18]
Kitchenham and Charters. 2007. Guidelines for performing systematic literature reviews in software engineering. EBSE Technical Report EBSE-2007-01. updated, version 2.3.
[19]
Lee. 2017. Innovation in big data analytics: Applications of mathematical programming in medicine and healthcare. In BigData .
[20]
Liu et almbox. 2013. Understanding Data Characteristics and Access Patterns in a Cloud Storage System. In CCGrid .
[21]
Pan et almbox. 2014. I/O Characterization of Big Data Workloads in Data Centers. In BPOE.
[22]
Raghupathi et almbox. 2014. Big data analytics in healthcare: promise and potential. Health information science and systems, Vol. 2, 1 (2014).
[23]
Rice. 2003. Mathematical statistics and data analysis, Chapter 13 .China machine press Beijing.
[24]
Amazon Web Services. 2016. FINRA Adopts AWS to Perform 500 Billion Validation Checks Daily . https://aws.amazon.com/solutions/case-studies/finra-data-validation/
[25]
Summers et almbox. 2016. Characterizing the workload of a Netflix streaming video server. In IISWC.
[26]
Trivedi et almbox. 2018. Albis: High-Performance File Format for Big Data Systems. In USENIX ATC.
[27]
Zaharia et almbox. 2010. Spark: Cluster Computing with Working Sets. In USENIX HotCloud .

Cited By

View all
  • (2023)Characterization of I/O Behaviors in Cloud Storage WorkloadsIEEE Transactions on Computers10.1109/TC.2023.326372672:10(2726-2739)Online publication date: Oct-2023
  • (2022)Revisiting Temporal Storage I/O Behaviors of Smartphone Applications: Analysis and Synthesis2022 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC55918.2022.00027(215-227)Online publication date: Nov-2022
  • (2021)Cluster resource scheduling in cloud computing: literature review and research challengesThe Journal of Supercomputing10.1007/s11227-021-04138-z78:5(6898-6943)Online publication date: 29-Oct-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICPE '19: Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering
April 2019
348 pages
ISBN:9781450362399
DOI:10.1145/3297663
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. apache spark
  2. big data
  3. characterization
  4. cloud storage
  5. file formats
  6. interarrival time
  7. long-term trend
  8. popularity

Qualifiers

  • Research-article

Conference

ICPE '19

Acceptance Rates

ICPE '19 Paper Acceptance Rate 13 of 71 submissions, 18%;
Overall Acceptance Rate 252 of 851 submissions, 30%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)36
  • Downloads (Last 6 weeks)8
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Characterization of I/O Behaviors in Cloud Storage WorkloadsIEEE Transactions on Computers10.1109/TC.2023.326372672:10(2726-2739)Online publication date: Oct-2023
  • (2022)Revisiting Temporal Storage I/O Behaviors of Smartphone Applications: Analysis and Synthesis2022 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC55918.2022.00027(215-227)Online publication date: Nov-2022
  • (2021)Cluster resource scheduling in cloud computing: literature review and research challengesThe Journal of Supercomputing10.1007/s11227-021-04138-z78:5(6898-6943)Online publication date: 29-Oct-2021
  • (2020)Characterization and modeling of an edge computing mixed reality workloadJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-020-00190-x9:1Online publication date: 22-Dec-2020
  • (2020)Large-Scale Analysis of the Docker Images and Performance Implications to Container Storage SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.3034517(1-1)Online publication date: 2020
  • (2020)The Workflow Trace Archive: Open-Access Data From Public and Private Computing InfrastructuresIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.298482131:9(2170-2184)Online publication date: 1-Sep-2020
  • (2019)Efficient Estimation of Read Density when Caching for Big Data ProcessingIEEE INFOCOM 2019 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS)10.1109/INFCOMW.2019.8845043(502-507)Online publication date: Apr-2019

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media