research-article

Characterization of a Big Data Storage Workload in the Cloud

Authors:

Sacheendra Talluri,

Alicja Łuszczak,

Cristina L. Abad,

Alexandru IosupAuthors Info & Claims

ICPE '19: Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

Pages 33 - 44

https://doi.org/10.1145/3297663.3310302

Published: 04 April 2019 Publication History

Abstract

The proliferation of big data processing platforms has led to radically different system designs, such as MapReduce and the newer Spark. Understanding the workloads of such systems facilitates tuning and could foster new designs. However, whereas MapReduce workloads have been characterized extensively, relatively little public knowledge exists about the characteristics of Spark workloads in representative environments. To address this problem, in this work we collect and analyze a 6-month Spark workload from a major provider of big data processing services, Databricks. Our analysis focuses on a number of key features, such as the long-term trends of reads and modifications, the statistical properties of reads, and the popularity of clusters and of file formats. Overall, we present numerous findings that could form the basis of new systems studies and designs. Our quantitative evidence and its analysis suggest the existence of daily and weekly load imbalances, of heavy-tailed and bursty behaviour, of the relative rarity of modifications, and of proliferation of big data specific formats.

References

[1]

Abad et almbox. 2012. A storage-centric analysis of MapReduce workloads: File popularity, temporal locality and arrival patterns. In IISWC.

Digital Library

[2]

Atikoglu et almbox. 2012. Workload analysis of a large-scale key-value store. In SIGMETRICS .

Digital Library

[3]

H. Carns et almbox. 2011. Understanding and Improving Computational Science Storage Access through Continuous Characterization. TOS, Vol. 7, 3 (2011).

Digital Library

[4]

Chen et almbox. 2011. Design implications for enterprise storage systems via multi-dimensional trace analysis. In SOSP .

Digital Library

[5]

Chen et almbox. 2012. Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads. PVLDB, Vol. 5, 12 (2012).

Digital Library

[6]

Conover. 1980. Practical Nonparametric Statistics, Chapter 6. Wiley New York.

[7]

Microsoft Corp. 2013. 343 Industries Gets New User Insights from Big Data in the Cloud . https://azure.microsoft.com/en-us/case-studies/customer-stories-343industries/

[8]

Dean et almbox. 2004. MapReduce: Simplified Data Processing on Large Clusters. In OSDI.

Digital Library

[9]

Feder. 2013. Fractals, Chapter 8 .Springer Science & Business Media.

[10]

Ghemawat et almbox. 2003. The Google file system. In SOSP .

Digital Library

[11]

Ghit et almbox. 2014. Balanced resource allocations across multiple dynamic MapReduce clusters. In SIGMETRICS . 329--341.

Digital Library

[12]

Ghodsnia et almbox. 2014. Parallel I/O aware query optimization. In SIGMOD .

Digital Library

[13]

Gunasekaran et almbox. 2015. Comparative I/O workload characterization of two leadership class storage clusters. In PDSW.

Digital Library

[14]

Harter et almbox. 2014. Analysis of HDFS under HBase: a facebook messages case study. In USENIX FAST.

Digital Library

[15]

Hashem et almbox. 2015. The rise of "big data" on cloud computing: Review and open research issues. Inf. Syst., Vol. 47 (2015).

Digital Library

[16]

Iosup et almbox. 2011. Performance Analysis of Cloud Computing Services for Many-Tasks Scientific Computing. TPDS, Vol. 22, 6 (2011), 931--945.

Digital Library

[17]

Iosup et almbox. 2018. Massivizing Computer Systems: A Vision to Understand, Design, and Engineer Computer Ecosystems Through and Beyond Modern Distributed Systems. In ICDCS . 1224--37.

[18]

Kitchenham and Charters. 2007. Guidelines for performing systematic literature reviews in software engineering. EBSE Technical Report EBSE-2007-01. updated, version 2.3.

[19]

Lee. 2017. Innovation in big data analytics: Applications of mathematical programming in medicine and healthcare. In BigData .

[20]

Liu et almbox. 2013. Understanding Data Characteristics and Access Patterns in a Cloud Storage System. In CCGrid .

[21]

Pan et almbox. 2014. I/O Characterization of Big Data Workloads in Data Centers. In BPOE.

[22]

Raghupathi et almbox. 2014. Big data analytics in healthcare: promise and potential. Health information science and systems, Vol. 2, 1 (2014).

[23]

Rice. 2003. Mathematical statistics and data analysis, Chapter 13 .China machine press Beijing.

[24]

Amazon Web Services. 2016. FINRA Adopts AWS to Perform 500 Billion Validation Checks Daily . https://aws.amazon.com/solutions/case-studies/finra-data-validation/

[25]

Summers et almbox. 2016. Characterizing the workload of a Netflix streaming video server. In IISWC.

[26]

Trivedi et almbox. 2018. Albis: High-Performance File Format for Big Data Systems. In USENIX ATC.

Digital Library

[27]

Zaharia et almbox. 2010. Spark: Cluster Computing with Working Sets. In USENIX HotCloud .

Digital Library

Cited By

Zou QZhu YChen JDeng YQin X(2023)Characterization of I/O Behaviors in Cloud Storage WorkloadsIEEE Transactions on Computers10.1109/TC.2023.326372672:10(2726-2739)Online publication date: Oct-2023
https://doi.org/10.1109/TC.2023.3263726
Zou QMao B(2022)Revisiting Temporal Storage I/O Behaviors of Smartphone Applications: Analysis and Synthesis2022 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC55918.2022.00027(215-227)Online publication date: Nov-2022
https://doi.org/10.1109/IISWC55918.2022.00027
Khallouli WHuang J(2021)Cluster resource scheduling in cloud computing: literature review and research challengesThe Journal of Supercomputing10.1007/s11227-021-04138-z78:5(6898-6943)Online publication date: 29-Oct-2021
https://doi.org/10.1007/s11227-021-04138-z
Show More Cited By

Index Terms

Characterization of a Big Data Storage Workload in the Cloud
1. General and reference
  1. Cross-computing tools and techniques
    1. Measurement
2. Information systems
  1. Data management systems
  2. Information storage systems
    1. Storage architectures
      1. Cloud based storage

Recommendations

Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet
BDCAT '17: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies

Big Data has been seen as a remedy for the efficient management of the ever-increasing genomic data. In this paper, we investigate the use of Apache Spark to store and process Variant Calling Files (VCF) on a Hadoop cluster. We demonstrate Tomatula, a ...
Big data analytics in Cloud computing: an overview
Abstract
Big Data and Cloud Computing as two mainstream technologies, are at the center of concern in the IT field. Every day a huge amount of data is produced from different sources. This data is so big in size that traditional processing tools are unable ...
A novel big data analytics framework for smart cities
Abstract
The emergence of smart cities aims at mitigating the challenges raised due to the continuous urbanization development and increasing population density in cities. To face these challenges, governments and decision makers undertake ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICPE '19: Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

April 2019

348 pages

ISBN:9781450362399

DOI:10.1145/3297663

General Chairs:
Varsha Apte
IIT Bombay, India
,
Antinisca Di Marco
University of L'Aquila, Italy
,
Program Chairs:
Marin Litoiu
York University, Canada
,
José Merseguer
Universidad de Zaragoza, Spain

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICPE '19

Sponsor:

ICPE '19: Tenth ACM/SPEC International Conference on Performance Engineering

April 7 - 11, 2019

Mumbai, India

Acceptance Rates

ICPE '19 Paper Acceptance Rate 13 of 71 submissions, 18%;

Overall Acceptance Rate 252 of 851 submissions, 30%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
417
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)8

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zou QZhu YChen JDeng YQin X(2023)Characterization of I/O Behaviors in Cloud Storage WorkloadsIEEE Transactions on Computers10.1109/TC.2023.326372672:10(2726-2739)Online publication date: Oct-2023
https://doi.org/10.1109/TC.2023.3263726
Zou QMao B(2022)Revisiting Temporal Storage I/O Behaviors of Smartphone Applications: Analysis and Synthesis2022 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC55918.2022.00027(215-227)Online publication date: Nov-2022
https://doi.org/10.1109/IISWC55918.2022.00027
Khallouli WHuang J(2021)Cluster resource scheduling in cloud computing: literature review and research challengesThe Journal of Supercomputing10.1007/s11227-021-04138-z78:5(6898-6943)Online publication date: 29-Oct-2021
https://doi.org/10.1007/s11227-021-04138-z
Toczé KLindqvist JNadjm-Tehrani S(2020)Characterization and modeling of an edge computing mixed reality workloadJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-020-00190-x9:1Online publication date: 22-Dec-2020
https://dl.acm.org/doi/10.1186/s13677-020-00190-x
zhao nTarasov VAlbahar HAnwar ARupprecht LSkourtis DPaul AChen KButt A(2020)Large-Scale Analysis of the Docker Images and Performance Implications to Container Storage SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.3034517(1-1)Online publication date: 2020
https://doi.org/10.1109/TPDS.2020.3034517
Versluis LMatha RTalluri SHegeman TProdan RDeelman EIosup A(2020)The Workflow Trace Archive: Open-Access Data From Public and Private Computing InfrastructuresIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.298482131:9(2170-2184)Online publication date: 1-Sep-2020
https://doi.org/10.1109/TPDS.2020.2984821
Talluri SIosup A(2019)Efficient Estimation of Read Density when Caching for Big Data ProcessingIEEE INFOCOM 2019 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS)10.1109/INFCOMW.2019.8845043(502-507)Online publication date: Apr-2019
https://doi.org/10.1109/INFCOMW.2019.8845043

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents