Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads

Published: 01 August 2012 Publication History
  • Get Citation Alerts
  • Abstract

    Within the past few years, organizations in diverse industries have adopted MapReduce-based systems for large-scale data processing. Along with these new users, important new workloads have emerged which feature many small, short, and increasingly interactive jobs in addition to the large, long-running batch jobs for which MapReduce was originally designed. As interactive, large-scale query processing is a strength of the RDBMS community, it is important that lessons from that field be carried over and applied where possible in this new domain. However, these new workloads have not yet been described in the literature. We fill this gap with an empirical analysis of MapReduce traces from six separate business-critical deployments inside Facebook and at Cloudera customers in e-commerce, telecommunications, media, and retail. Our key contribution is a characterization of new MapReduce workloads which are driven in part by interactive analysis, and which make heavy use of query-like programming frameworks on top of MapReduce. These workloads display diverse behaviors which invalidate prior assumptions about MapReduce such as uniform data access, regular diurnal patterns, and prevalence of large jobs. A secondary contribution is a first step towards creating a TPC-like data processing benchmark for MapReduce.

    References

    [1]
    Apache Hive. http://hive.apache.org/.
    [2]
    Apache Oozie(TM) Workflow Scheduler for Hadoop. http://incubator.apache.org/oozie/.
    [3]
    Apache Pig. http://pig.apache.org/.
    [4]
    Gridmix. HADOOP-HOME/mapred/src/benchmarks/gridmix in Hadoop 0.21.0 onwards.
    [5]
    Hadoop World 2011 Speakers. http://www.hadoopworld.com/speakers/.
    [6]
    Sort benchmark home page. http://sortbenchmark.org/.
    [7]
    M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center network architecture. In SIGCOMM, pages 63--74, 2008.
    [8]
    M. Alizadeh et al. Data Center TCP (DCTCP). In SIGCOMM, pages 63--74, 2010.
    [9]
    E. Alpaydin. Introduction to Machine Learning. MIT Press, 2004.
    [10]
    G. Ananthanarayanan et al. Reining in the outliers in MapReduce clusters using Mantri. In OSDI, pages 1--16, 2010.
    [11]
    G. Ananthanarayanan et al. Scarlett: coping with skewed content popularity in MapReduce clusters. In Eurosys, pages 287--300, 2011.
    [12]
    G. Ananthanarayanan et al. PACMan: coordinated memory caching for parallel jobs. In NSDI, pages 20--32, 2012.
    [13]
    L. Bairavasundaram et al. An analysis of data corruption in the storage stack. In FAST, pages 8:1--8:28, 2008.
    [14]
    D. Borthakur et al. Apache Hadoop goes realtime at Facebook. In SIGMOD, pages 1071--1080, 2011.
    [15]
    L. Breslau et al. Web caching and Zipf-like distributions: evidence and implications. In INFOCOM, pages 126--134, 1999.
    [16]
    Y. Bu et al. HaLoop: efficient iterative data processing on large clusters. In VLDB, pages 285--296, 2010.
    [17]
    Y. Chen et al. Design implications for enterprise storage systems via multi-dimensional trace analysis. In SOSP, pages 43--56, 2011.
    [18]
    Y. Chen et al. The case for evaluating MapReduce performance using workload suites. In MASCOTS, pages 390--399, 2011.
    [19]
    Y. Chen et al. Energy efficiency for large-scale MapReduce workloads with significant interactive analysis. In EuroSys, pages 43--56, 2012.
    [20]
    M. Chowdhury et al. Managing data transfers in computer clusters with orchestra. In SIGCOMM, pages 98--109, 2011.
    [21]
    Cloudera, Inc. Cloudera Manager Datasheet.
    [22]
    J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In OSDI, pages 107--113, 2004.
    [23]
    J. Dittrich et al. Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). In VLDB, pages 515--529, 2010.
    [24]
    EMC and IDC iView. Digital Universe. http://www.emc.com/leadership/programs/digital-universe.htm.
    [25]
    N. Feamster and H. Balakrishnan. Detecting BGP configuration faults with static analysis. In NSDI, pages 43--56, 2005.
    [26]
    A. Ganapathi et al. Statistics-driven workload modeling for the cloud. In SMDB, pages 87--92, 2010.
    [27]
    A. F. Gates et al. Building a high-level dataflow system on top of MapReduce: the Pig experience. In VLDB, pages 1414--1425, 2009.
    [28]
    J. Gray et al. Quickly generating billion-record synthetic databases. In SIGMOD, pages 243--252, 1994.
    [29]
    A. Greenberg et al. VL2: a scalable and flexible data center network. In SIGCOMM, pages 51--62, 2009.
    [30]
    J. Hellerstein. Quantitative data cleaning for large databases. Technical report, United Nations Economic Commission for Europe, 2008.
    [31]
    H. Herodotou and S. Babu. Profiling, What-if analysis, and cost-based optimization of MapReduce programs. In VLDB, pages 1111--1122, 2011.
    [32]
    B. Hindman et al. Mesos: A platform for fine-grained resource sharing in the data center. In NSDI, pages 22--22, 2011.
    [33]
    M. Isard et al. Quincy: fair scheduling for distributed computing clusters. In SOSP, pages 261--276, 2009.
    [34]
    E. Jahani et al. Automatic optimization for MapReduce programs. In VLDB, pages 385--396, 2011.
    [35]
    S. Krompass et al. Dynamic workload management for very large data warehouses: juggling feathers and bowling balls. In VLDB, pages 1105--1115, 2007.
    [36]
    W. Lang and J. Patel. Energy Management for MapReduce clusters. In VLDB, pages 129--139, 2010.
    [37]
    W. Leland et al. On the self-similar nature of Ethernet traffic. In SIGCOMM, pages 1--15, 1993.
    [38]
    D. Meisner et al. Power management of online data-intensive services. In ISCA, pages 319--330, 2011.
    [39]
    S. Melnik et al. Dremel: interactive analysis of web-scale datasets. In VLDB, pages 330--339, 2010.
    [40]
    M. Mesnier et al. File classification in self-* storage systems. In ICAC, pages 44--51, 2004.
    [41]
    A. Mishra et al. Towards characterizing cloud backend workloads: insights from Google compute clusters. SIGMETRICS, pages 34--41, 2010.
    [42]
    J. C. Mogul. The case for persistent-connection HTTP. In SIGCOMM, pages 299--313, 1995.
    [43]
    K. Morton et al. ParaTimer: A progress indicator for MapReduce DAGs. In SIGMOD, pages 507--518, 2010.
    [44]
    J. Ousterhout et al. A trace-driven analysis of the UNIX 4.2 BSD file system. In SOSP, pages 15--24, 1985.
    [45]
    A. Pavlo et al. A comparison of approaches to large-scale data analysis. In SIGMOD, pages 165--178, 2009.
    [46]
    V. Paxson. End-to-end Internet packet dynamics. In SIGCOMM, pages 139--152, 1997.
    [47]
    K. Srinivasan et al. The β-factor: measuring wireless link burstiness. In SenSys, pages 29--42, 2008.
    [48]
    E. Thereska, A. Donnelly, and D. Narayanan. Sierra: practical power-proportionality for data center storage. In EuroSys, pages 169--182, 2011.
    [49]
    A. Thusoo et al. Hive: a warehousing solution over a map-reduce framework. In VLDB, pages 1626--1629, 2009.
    [50]
    Transactional Processing Performance Council. The TPC-W Benchmark. http://www.tpc.org/tpcw/default.asp.
    [51]
    Transactional Processing Performance Council. TPC-* Benchmarks. http://www.tpc.org/.

    Cited By

    View all
    • (2023)Multi-source System Log Behavior Pattern Mining Method Based on FP-GrowthProceedings of the 2023 International Conference on Communication Network and Machine Learning10.1145/3640912.3640961(248-254)Online publication date: 27-Oct-2023
    • (2023)Cost-based Data Prefetching and Scheduling in Big Data Platforms over Tiered Storage SystemsACM Transactions on Database Systems10.1145/362538948:4(1-40)Online publication date: 13-Nov-2023
    • (2023)λFS: A Scalable and Elastic Distributed File System Metadata Service using Serverless FunctionsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624765(394-411)Online publication date: 25-Mar-2023
    • Show More Cited By

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 5, Issue 12
    August 2012
    340 pages

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 August 2012
    Published in PVLDB Volume 5, Issue 12

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)26
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Multi-source System Log Behavior Pattern Mining Method Based on FP-GrowthProceedings of the 2023 International Conference on Communication Network and Machine Learning10.1145/3640912.3640961(248-254)Online publication date: 27-Oct-2023
    • (2023)Cost-based Data Prefetching and Scheduling in Big Data Platforms over Tiered Storage SystemsACM Transactions on Database Systems10.1145/362538948:4(1-40)Online publication date: 13-Nov-2023
    • (2023)λFS: A Scalable and Elastic Distributed File System Metadata Service using Serverless FunctionsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624765(394-411)Online publication date: 25-Mar-2023
    • (2023)QaaD (Query-as-a-Data): Scalable Execution of Massive Number of Small Queries in SparkProceedings of the ACM on Management of Data10.1145/35892791:2(1-26)Online publication date: 20-Jun-2023
    • (2022)Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environmentThe Journal of Supercomputing10.1007/s11227-021-04000-278:3(3561-3604)Online publication date: 1-Feb-2022
    • (2021)TridentProceedings of the VLDB Endowment10.14778/3461535.346154514:9(1570-1582)Online publication date: 22-Oct-2021
    • (2021)LachesisProceedings of the VLDB Endowment10.14778/3457390.345739214:8(1262-1275)Online publication date: 21-Oct-2021
    • (2021)GeorgeProceedings of the ACM Symposium on Cloud Computing10.1145/3472883.3486971(258-272)Online publication date: 1-Nov-2021
    • (2021)Characterization and prediction of deep learning workloads in large-scale GPU datacentersProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476223(1-15)Online publication date: 14-Nov-2021
    • (2021)KreonACM Transactions on Storage10.1145/341841417:1(1-32)Online publication date: 18-Jan-2021
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media