research-article

Semantics-Aware Prediction for Analytic Queries in MapReduce Environment

Authors:

Xiaoning DingAuthors Info & Claims

ICPP Workshops '18: Workshop Proceedings of the 47th International Conference on Parallel Processing

Article No.: 27, Pages 1 - 9

https://doi.org/10.1145/3229710.3229713

Published: 13 August 2018 Publication History

Abstract

MapReduce has emerged as a powerful data processing engine that supports large-scale complex analytics applications, most of which are written in declarative query languages such as HiveQL and Pig Latin. Analytic queries are typically compiled into execution plans in the form of directed acyclic graphs (DAGs) of MapReduce jobs. Jobs in the DAGs are dispatched to the MapReduce processing engine as soon as their dependencies are satisfied. MapReduce adopts a job-level scheduling policy to strive for balanced distribution of tasks and effective utilization of resources. However, there is a lack of query-level semantics in the purely task-based scheduling algorithms, resulting in resource thrashing among queries and an overall degradation of performance. Therefore, we introduce a semantic-aware query prediction framework to address these problems systematically. Our framework includes three major techniques: cross-layer semantics percolation, selectivity estimation, and multivariate time prediction for analytic queries. Multivariate query prediction allows us not only to gauge the dynamic size of analytics datasets, but also to accurately predict the resource usage (e.g., numbers of map and reduce tasks) of individual MapReduce jobs and whole queries. In addition, the accurate prediction and queuing of queries can be potentially exploited by Hadoop scheduling for optimizing overall query performance. Based on the query prediction, our case study scheduler demonstrates significant performance improvement compared to traditional Hadoop schedulers.

References

[1]

{n. d.}. Apache Hadoop Project. http://hadoop.apache.org/.

[2]

{n. d.}. Apache Tez. http://hortonworks.com/hadoop/tez/.

[3]

{n. d.}. TPC. http://www.tpc.org/.

[4]

Thomas L Adam, K. Mani Chandy, and JR Dickson. 1974. A comparison of list schedules for parallel processing systems. Commun. ACM 17, 12 (1974), 685--690.

Digital Library

[5]

Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker, and Ion Stoica. 2012. PACMan: Coordinated memory caching for parallel jobs. In USENIX NSDI.

Digital Library

[6]

David A Bell, DHO Link, and S McClean. 1989. Pragmatic estimation of join sizes and attribute correlations. In Data Engineering, 1989. Proceedings. Fifth International Conference on. IEEE, 76--84.

Digital Library

[7]

Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: simplified data processing on large clusters. In Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6 (OSDI'04). USENIX Association, Berkeley, CA, USA, 10--10.

Digital Library

[8]

Carlo DellâĂ&Zacute;aquila, Ezio Lefons, and Filippo Tangorra. 2005. Analytic-based estimation of query result sizes. In Proceedings of the 4th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering Data Bases. WSEAS, 24.

Digital Library

[9]

Jennie Duggan, Ugur Cetintemel, Olga Papaemmanouil, and Eli Upfal. 2011. Performance prediction for concurrent database workloads. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 337--348.

Digital Library

[10]

Archana Ganapathi, Yanpei Chen, Armando Fox, Randy Katz, and David Patterson. 2010. Statistics-driven workload modeling for the cloud. In Data Engineering Workshops (ICDEW), 2010 IEEE 26th International Conference on. IEEE, 87--92.

[11]

Archana Ganapathi, Harumi Kuno, Umeshwar Dayal, Janet L Wiener, Armando Fox, Michael Jordan, and David Patterson. 2009. Predicting multiple metrics for queries: Better decisions enabled by machine learning. In Data Engineering, 2009. ICDE'09. IEEE 25th International Conference on. IEEE, 592--603.

Digital Library

[12]

John Gantz and David Reinsel. 2012. The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east. IDC iView: IDC Analyze the Future (2012).

[13]

Alan F Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, and Utkarsh Srivastava. 2009. Building a high-level dataflow system on top of MapReduce: the Pig experience. Proceedings of the VLDB Endowment 2, 2 (2009), 1414--1425.

Digital Library

[14]

Te C Hu. 1961. Parallel sequencing and assembly line problems. Operations research 9, 6 (1961), 841--848.

Digital Library

[15]

Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. {n. d.}. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In Proceedings of the 2Nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 (EuroSys '07). 59--72.

Digital Library

[16]

Selmer Martin Johnson. 1954. Optimal two-and three-stage production schedules with setup times included. In Naval research logistics quarterly, Vol. 1. Wiley Online Library, 61--68.

[17]

Qifa Ke, Michael Isard, and Yuan Yu. 2013. Optimus: a dynamic rewriting framework for data-parallel execution plans. In Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 15--28.

Digital Library

[18]

Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. 2011. Ysmart: Yet another sql-to-mapreduce translator. In Distributed Computing Systems (ICDCS), 31st International Conference on. IEEE, 25--36.

Digital Library

[19]

Jiexing Li, Arnd Christian König, Vivek Narasayya, and Surajit Chaudhuri. 2012. Robust estimation of resource consumption for sql queries using statistical techniques. Proceedings of the VLDB Endowment 5, 11 (2012), 1555--1566.

Digital Library

[20]

Tian Luo, Rubao Lee, Michael Mesnier, Feng Chen, and Xiaodong Zhang. 2012. hStorage-DB: heterogeneity-aware data management to exploit the full capability of hybrid storage systems. Proceedings of the VLDB Endowment 5, 10 (2012), 1076--1087.

Digital Library

[21]

Kristi Morton, Magdalena Balazinska, and Dan Grossman. 2010. ParaTimer: a progress indicator for MapReduce DAGs. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 507--518.

Digital Library

[22]

James K Mullin. 1993. Estimating the size of a relational join. Information Systems 18, 3 (1993), 189--196.

Digital Library

[23]

Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig latin: a not-so-foreign language for data processing. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, New York, NY, USA, 1099--1110.

Digital Library

[24]

Gregory Piatetsky-Shapiro and Charles Connell. 1984. Accurate estimation of the number of tuples satisfying a condition. In ACM SIGMOD Record, Vol. 14. ACM, 256--276.

Digital Library

[25]

Scott Shenker, Ion Stoica, Matei Zaharia, Reynold Xin, Josh Rosen, and Michael J Franklin. 2013. Shark: SQL and Rich Analytics at Scale. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data.

Digital Library

[26]

A. Swami and K.B. Schiefer. 1993. On the estimation of join result sizes. IBM Technical Report (1993).

[27]

Jian Tan, Xiaoqiao Meng, and Li Zhang. 2012. Delay tails in MapReduce scheduling. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems (SIGMETRICS '12). ACM, New York, NY, USA, 5--16.

Digital Library

[28]

Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang 0002, Suresh Anthony, Hao Liu, and Raghotham Murthy. 2010. Hive - a petabyte scale data warehouse using Hadoop. In ICDE. 996--1005.

[29]

Abhishek Verma, Ludmila Cherkasova, and Roy H Campbell. 2011. ARIA: automatic resource inference and allocation for mapreduce environments. In Proceedings of the 8th ACM international conference on Autonomic computing. ACM, 235--244.

Digital Library

[30]

Yandong Wang, Jian Tan, Weikuan Yu, Xiaoqiao Meng, and Li Zhang. 2013. Preemptive reducetask scheduling for fair and fast job completion. In Proceedings of the 10th International Conference on Autonomic Computing, ICAC, Vol. 13.

[31]

Joel Wolf, Deepak Rajan, Kirsten Hildrum, Rohit Khandekar, Vibhore Kumar, Sujay Parekh, Kun-Lung Wu, and Andrey Balmin. 2010. Flex: A slot allocation scheduling optimizer for mapreduce workloads. In Middleware'10. Springer, 1--20.

Digital Library

[32]

Sai Wu, Feng Li, Sharad Mehrotra, and Beng Chin Ooi. 2011. Query optimization for massively parallel data processing. In Proceedings of the 2nd ACM Symposium on Cloud Computing. ACM, 12.

Digital Library

[33]

Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. 2008. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. In OSDI, Vol. 8. 1--14.

Digital Library

[34]

Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. 2010. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European conference on Computer systems (EuroSys'10). ACM, New York, NY, USA, 265--278.

Digital Library

[35]

Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing. 10--10.

Digital Library

[36]

Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, and Boon Thau Loo. 2012. Automated profiling and resource management of pig programs for meeting service level objectives. In Proceedings of the 9th international conference on Autonomic computing. ACM, 53--62.

Digital Library

Cited By

Liu ZNath ADing XFu HMuhib Khan MYu W(2019)Multivariate modeling and two-level scheduling of analytic queriesParallel Computing10.1016/j.parco.2019.01.00685:C(66-78)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.1016/j.parco.2019.01.006

Index Terms

Semantics-Aware Prediction for Analytic Queries in MapReduce Environment
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments
  2. Software organization and properties
    1. Extra-functional properties
      1. Software performance

Recommendations

TaskTracker aware scheduler with resource availability control for Hadoop MapReduce

Schedulers are playing a vital role in task assignment for Hadoop MapReduce. In some scenario, the default schedulers of Hadoop spawn tasks in TaskTracker without checking the external dependency and may fail. As a result, Hadoop should rerun the tasks in ...
MapReduce: Review and open challenges

The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
A load-aware scheduler for MapReduce framework in heterogeneous cloud environments
SAC '11: Proceedings of the 2011 ACM Symposium on Applied Computing

MapReduce is becoming a popular programming model for large-scale data processing in cloud computing environments. Hadoop MapReduce is the most popular open-source implementation of MapReduce framework. Hadoop MapReduce comes with a pluggable task ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP Workshops '18: Workshop Proceedings of the 47th International Conference on Parallel Processing

August 2018

409 pages

ISBN:9781450365239

DOI:10.1145/3229710

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

University of Oregon: University of Oregon

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICPP '18 Comp

ICPP '18 Comp: 47th International Conference on Parallel Processing Companion

August 13 - 16, 2018

OR, Eugene, USA

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
74
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu ZNath ADing XFu HMuhib Khan MYu W(2019)Multivariate modeling and two-level scheduling of analytic queriesParallel Computing10.1016/j.parco.2019.01.00685:C(66-78)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.1016/j.parco.2019.01.006

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents