research-article

Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads

Authors:

Randy KatzAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 5, Issue 12

Pages 1802 - 1813

https://doi.org/10.14778/2367502.2367519

Published: 01 August 2012 Publication History

Abstract

Within the past few years, organizations in diverse industries have adopted MapReduce-based systems for large-scale data processing. Along with these new users, important new workloads have emerged which feature many small, short, and increasingly interactive jobs in addition to the large, long-running batch jobs for which MapReduce was originally designed. As interactive, large-scale query processing is a strength of the RDBMS community, it is important that lessons from that field be carried over and applied where possible in this new domain. However, these new workloads have not yet been described in the literature. We fill this gap with an empirical analysis of MapReduce traces from six separate business-critical deployments inside Facebook and at Cloudera customers in e-commerce, telecommunications, media, and retail. Our key contribution is a characterization of new MapReduce workloads which are driven in part by interactive analysis, and which make heavy use of query-like programming frameworks on top of MapReduce. These workloads display diverse behaviors which invalidate prior assumptions about MapReduce such as uniform data access, regular diurnal patterns, and prevalence of large jobs. A secondary contribution is a first step towards creating a TPC-like data processing benchmark for MapReduce.

References

[1]

Apache Hive. http://hive.apache.org/.

[2]

Apache Oozie(TM) Workflow Scheduler for Hadoop. http://incubator.apache.org/oozie/.

[3]

Apache Pig. http://pig.apache.org/.

[4]

Gridmix. HADOOP-HOME/mapred/src/benchmarks/gridmix in Hadoop 0.21.0 onwards.

[5]

Hadoop World 2011 Speakers. http://www.hadoopworld.com/speakers/.

[6]

Sort benchmark home page. http://sortbenchmark.org/.

[7]

M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center network architecture. In SIGCOMM, pages 63--74, 2008.

[8]

M. Alizadeh et al. Data Center TCP (DCTCP). In SIGCOMM, pages 63--74, 2010.

[9]

E. Alpaydin. Introduction to Machine Learning. MIT Press, 2004.

[10]

G. Ananthanarayanan et al. Reining in the outliers in MapReduce clusters using Mantri. In OSDI, pages 1--16, 2010.

[11]

G. Ananthanarayanan et al. Scarlett: coping with skewed content popularity in MapReduce clusters. In Eurosys, pages 287--300, 2011.

[12]

G. Ananthanarayanan et al. PACMan: coordinated memory caching for parallel jobs. In NSDI, pages 20--32, 2012.

[13]

L. Bairavasundaram et al. An analysis of data corruption in the storage stack. In FAST, pages 8:1--8:28, 2008.

[14]

D. Borthakur et al. Apache Hadoop goes realtime at Facebook. In SIGMOD, pages 1071--1080, 2011.

[15]

L. Breslau et al. Web caching and Zipf-like distributions: evidence and implications. In INFOCOM, pages 126--134, 1999.

[16]

Y. Bu et al. HaLoop: efficient iterative data processing on large clusters. In VLDB, pages 285--296, 2010.

[17]

Y. Chen et al. Design implications for enterprise storage systems via multi-dimensional trace analysis. In SOSP, pages 43--56, 2011.

[18]

Y. Chen et al. The case for evaluating MapReduce performance using workload suites. In MASCOTS, pages 390--399, 2011.

[19]

Y. Chen et al. Energy efficiency for large-scale MapReduce workloads with significant interactive analysis. In EuroSys, pages 43--56, 2012.

[20]

M. Chowdhury et al. Managing data transfers in computer clusters with orchestra. In SIGCOMM, pages 98--109, 2011.

[21]

Cloudera, Inc. Cloudera Manager Datasheet.

[22]

J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In OSDI, pages 107--113, 2004.

[23]

J. Dittrich et al. Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). In VLDB, pages 515--529, 2010.

[24]

EMC and IDC iView. Digital Universe. http://www.emc.com/leadership/programs/digital-universe.htm.

[25]

N. Feamster and H. Balakrishnan. Detecting BGP configuration faults with static analysis. In NSDI, pages 43--56, 2005.

[26]

A. Ganapathi et al. Statistics-driven workload modeling for the cloud. In SMDB, pages 87--92, 2010.

[27]

A. F. Gates et al. Building a high-level dataflow system on top of MapReduce: the Pig experience. In VLDB, pages 1414--1425, 2009.

[28]

J. Gray et al. Quickly generating billion-record synthetic databases. In SIGMOD, pages 243--252, 1994.

[29]

A. Greenberg et al. VL2: a scalable and flexible data center network. In SIGCOMM, pages 51--62, 2009.

[30]

J. Hellerstein. Quantitative data cleaning for large databases. Technical report, United Nations Economic Commission for Europe, 2008.

[31]

H. Herodotou and S. Babu. Profiling, What-if analysis, and cost-based optimization of MapReduce programs. In VLDB, pages 1111--1122, 2011.

[32]

B. Hindman et al. Mesos: A platform for fine-grained resource sharing in the data center. In NSDI, pages 22--22, 2011.

[33]

M. Isard et al. Quincy: fair scheduling for distributed computing clusters. In SOSP, pages 261--276, 2009.

[34]

E. Jahani et al. Automatic optimization for MapReduce programs. In VLDB, pages 385--396, 2011.

[35]

S. Krompass et al. Dynamic workload management for very large data warehouses: juggling feathers and bowling balls. In VLDB, pages 1105--1115, 2007.

[36]

W. Lang and J. Patel. Energy Management for MapReduce clusters. In VLDB, pages 129--139, 2010.

[37]

W. Leland et al. On the self-similar nature of Ethernet traffic. In SIGCOMM, pages 1--15, 1993.

[38]

D. Meisner et al. Power management of online data-intensive services. In ISCA, pages 319--330, 2011.

[39]

S. Melnik et al. Dremel: interactive analysis of web-scale datasets. In VLDB, pages 330--339, 2010.

[40]

M. Mesnier et al. File classification in self-* storage systems. In ICAC, pages 44--51, 2004.

[41]

A. Mishra et al. Towards characterizing cloud backend workloads: insights from Google compute clusters. SIGMETRICS, pages 34--41, 2010.

[42]

J. C. Mogul. The case for persistent-connection HTTP. In SIGCOMM, pages 299--313, 1995.

[43]

K. Morton et al. ParaTimer: A progress indicator for MapReduce DAGs. In SIGMOD, pages 507--518, 2010.

[44]

J. Ousterhout et al. A trace-driven analysis of the UNIX 4.2 BSD file system. In SOSP, pages 15--24, 1985.

[45]

A. Pavlo et al. A comparison of approaches to large-scale data analysis. In SIGMOD, pages 165--178, 2009.

[46]

V. Paxson. End-to-end Internet packet dynamics. In SIGCOMM, pages 139--152, 1997.

[47]

K. Srinivasan et al. The β-factor: measuring wireless link burstiness. In SenSys, pages 29--42, 2008.

[48]

E. Thereska, A. Donnelly, and D. Narayanan. Sierra: practical power-proportionality for data center storage. In EuroSys, pages 169--182, 2011.

[49]

A. Thusoo et al. Hive: a warehousing solution over a map-reduce framework. In VLDB, pages 1626--1629, 2009.

[50]

Transactional Processing Performance Council. The TPC-W Benchmark. http://www.tpc.org/tpcw/default.asp.

[51]

Transactional Processing Performance Council. TPC-* Benchmarks. http://www.tpc.org/.

Cited By

Zhang DWu TZhou XHu BZhang W(2023)Multi-source System Log Behavior Pattern Mining Method Based on FP-GrowthProceedings of the 2023 International Conference on Communication Network and Machine Learning10.1145/3640912.3640961(248-254)Online publication date: 27-Oct-2023
https://dl.acm.org/doi/10.1145/3640912.3640961
Herodotou HKakoulli E(2023)Cost-based Data Prefetching and Scheduling in Big Data Platforms over Tiered Storage SystemsACM Transactions on Database Systems10.1145/362538948:4(1-40)Online publication date: 13-Nov-2023
https://dl.acm.org/doi/10.1145/3625389
Carver BHan RZhang JZheng MCheng YAamodt TSwift MJerger N(2023)λFS: A Scalable and Elastic Distributed File System Metadata Service using Serverless FunctionsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624765(394-411)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3623278.3624765
Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 5, Issue 12

August 2012

340 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2012

Published in PVLDB Volume 5, Issue 12

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

127
Total Citations
View Citations
2,760
Total Downloads

Downloads (Last 12 months)26
Downloads (Last 6 weeks)3

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang DWu TZhou XHu BZhang W(2023)Multi-source System Log Behavior Pattern Mining Method Based on FP-GrowthProceedings of the 2023 International Conference on Communication Network and Machine Learning10.1145/3640912.3640961(248-254)Online publication date: 27-Oct-2023
https://dl.acm.org/doi/10.1145/3640912.3640961
Herodotou HKakoulli E(2023)Cost-based Data Prefetching and Scheduling in Big Data Platforms over Tiered Storage SystemsACM Transactions on Database Systems10.1145/362538948:4(1-40)Online publication date: 13-Nov-2023
https://dl.acm.org/doi/10.1145/3625389
Carver BHan RZhang JZheng MCheng YAamodt TSwift MJerger N(2023)λFS: A Scalable and Elastic Distributed File System Metadata Service using Serverless FunctionsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624765(394-411)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3623278.3624765
Park YTak BHan W(2023)QaaD (Query-as-a-Data): Scalable Execution of Massive Number of Small Queries in SparkProceedings of the ACM on Management of Data10.1145/35892791:2(1-26)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589279
Li CCai QLuo Y(2022)Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environmentThe Journal of Supercomputing10.1007/s11227-021-04000-278:3(3561-3604)Online publication date: 1-Feb-2022
https://dl.acm.org/doi/10.1007/s11227-021-04000-2
Herodotou HKakoulli E(2021)TridentProceedings of the VLDB Endowment10.14778/3461535.346154514:9(1570-1582)Online publication date: 22-Oct-2021
https://dl.acm.org/doi/10.14778/3461535.3461545
Zou JDas ABarhate PIyengar AYuan BJankov DJermaine C(2021)LachesisProceedings of the VLDB Endowment10.14778/3457390.345739214:8(1262-1275)Online publication date: 21-Oct-2021
https://dl.acm.org/doi/10.14778/3457390.3457392
Li SWang LWang WYu YLi B(2021)GeorgeProceedings of the ACM Symposium on Cloud Computing10.1145/3472883.3486971(258-272)Online publication date: 1-Nov-2021
https://dl.acm.org/doi/10.1145/3472883.3486971
Hu QSun PYan SWen YZhang Tde Supinski BHall MGamblin T(2021)Characterization and prediction of deep learning workloads in large-scale GPU datacentersProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476223(1-15)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476223
Papagiannis ASaloustros GXanthakis GKalaentzis GGonzalez-Ferez PBilas A(2021)KreonACM Transactions on Storage10.1145/341841417:1(1-32)Online publication date: 18-Jan-2021
https://dl.acm.org/doi/10.1145/3418414
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents