Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3229710.3229713acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Semantics-Aware Prediction for Analytic Queries in MapReduce Environment

Published: 13 August 2018 Publication History

Abstract

MapReduce has emerged as a powerful data processing engine that supports large-scale complex analytics applications, most of which are written in declarative query languages such as HiveQL and Pig Latin. Analytic queries are typically compiled into execution plans in the form of directed acyclic graphs (DAGs) of MapReduce jobs. Jobs in the DAGs are dispatched to the MapReduce processing engine as soon as their dependencies are satisfied. MapReduce adopts a job-level scheduling policy to strive for balanced distribution of tasks and effective utilization of resources. However, there is a lack of query-level semantics in the purely task-based scheduling algorithms, resulting in resource thrashing among queries and an overall degradation of performance. Therefore, we introduce a semantic-aware query prediction framework to address these problems systematically. Our framework includes three major techniques: cross-layer semantics percolation, selectivity estimation, and multivariate time prediction for analytic queries. Multivariate query prediction allows us not only to gauge the dynamic size of analytics datasets, but also to accurately predict the resource usage (e.g., numbers of map and reduce tasks) of individual MapReduce jobs and whole queries. In addition, the accurate prediction and queuing of queries can be potentially exploited by Hadoop scheduling for optimizing overall query performance. Based on the query prediction, our case study scheduler demonstrates significant performance improvement compared to traditional Hadoop schedulers.

References

[1]
{n. d.}. Apache Hadoop Project. http://hadoop.apache.org/.
[2]
{n. d.}. Apache Tez. http://hortonworks.com/hadoop/tez/.
[3]
{n. d.}. TPC. http://www.tpc.org/.
[4]
Thomas L Adam, K. Mani Chandy, and JR Dickson. 1974. A comparison of list schedules for parallel processing systems. Commun. ACM 17, 12 (1974), 685--690.
[5]
Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker, and Ion Stoica. 2012. PACMan: Coordinated memory caching for parallel jobs. In USENIX NSDI.
[6]
David A Bell, DHO Link, and S McClean. 1989. Pragmatic estimation of join sizes and attribute correlations. In Data Engineering, 1989. Proceedings. Fifth International Conference on. IEEE, 76--84.
[7]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: simplified data processing on large clusters. In Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6 (OSDI'04). USENIX Association, Berkeley, CA, USA, 10--10.
[8]
Carlo DellâĂŹaquila, Ezio Lefons, and Filippo Tangorra. 2005. Analytic-based estimation of query result sizes. In Proceedings of the 4th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering Data Bases. WSEAS, 24.
[9]
Jennie Duggan, Ugur Cetintemel, Olga Papaemmanouil, and Eli Upfal. 2011. Performance prediction for concurrent database workloads. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 337--348.
[10]
Archana Ganapathi, Yanpei Chen, Armando Fox, Randy Katz, and David Patterson. 2010. Statistics-driven workload modeling for the cloud. In Data Engineering Workshops (ICDEW), 2010 IEEE 26th International Conference on. IEEE, 87--92.
[11]
Archana Ganapathi, Harumi Kuno, Umeshwar Dayal, Janet L Wiener, Armando Fox, Michael Jordan, and David Patterson. 2009. Predicting multiple metrics for queries: Better decisions enabled by machine learning. In Data Engineering, 2009. ICDE'09. IEEE 25th International Conference on. IEEE, 592--603.
[12]
John Gantz and David Reinsel. 2012. The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east. IDC iView: IDC Analyze the Future (2012).
[13]
Alan F Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, and Utkarsh Srivastava. 2009. Building a high-level dataflow system on top of MapReduce: the Pig experience. Proceedings of the VLDB Endowment 2, 2 (2009), 1414--1425.
[14]
Te C Hu. 1961. Parallel sequencing and assembly line problems. Operations research 9, 6 (1961), 841--848.
[15]
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. {n. d.}. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In Proceedings of the 2Nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 (EuroSys '07). 59--72.
[16]
Selmer Martin Johnson. 1954. Optimal two-and three-stage production schedules with setup times included. In Naval research logistics quarterly, Vol. 1. Wiley Online Library, 61--68.
[17]
Qifa Ke, Michael Isard, and Yuan Yu. 2013. Optimus: a dynamic rewriting framework for data-parallel execution plans. In Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 15--28.
[18]
Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. 2011. Ysmart: Yet another sql-to-mapreduce translator. In Distributed Computing Systems (ICDCS), 31st International Conference on. IEEE, 25--36.
[19]
Jiexing Li, Arnd Christian König, Vivek Narasayya, and Surajit Chaudhuri. 2012. Robust estimation of resource consumption for sql queries using statistical techniques. Proceedings of the VLDB Endowment 5, 11 (2012), 1555--1566.
[20]
Tian Luo, Rubao Lee, Michael Mesnier, Feng Chen, and Xiaodong Zhang. 2012. hStorage-DB: heterogeneity-aware data management to exploit the full capability of hybrid storage systems. Proceedings of the VLDB Endowment 5, 10 (2012), 1076--1087.
[21]
Kristi Morton, Magdalena Balazinska, and Dan Grossman. 2010. ParaTimer: a progress indicator for MapReduce DAGs. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 507--518.
[22]
James K Mullin. 1993. Estimating the size of a relational join. Information Systems 18, 3 (1993), 189--196.
[23]
Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig latin: a not-so-foreign language for data processing. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, New York, NY, USA, 1099--1110.
[24]
Gregory Piatetsky-Shapiro and Charles Connell. 1984. Accurate estimation of the number of tuples satisfying a condition. In ACM SIGMOD Record, Vol. 14. ACM, 256--276.
[25]
Scott Shenker, Ion Stoica, Matei Zaharia, Reynold Xin, Josh Rosen, and Michael J Franklin. 2013. Shark: SQL and Rich Analytics at Scale. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data.
[26]
A. Swami and K.B. Schiefer. 1993. On the estimation of join result sizes. IBM Technical Report (1993).
[27]
Jian Tan, Xiaoqiao Meng, and Li Zhang. 2012. Delay tails in MapReduce scheduling. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems (SIGMETRICS '12). ACM, New York, NY, USA, 5--16.
[28]
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang 0002, Suresh Anthony, Hao Liu, and Raghotham Murthy. 2010. Hive - a petabyte scale data warehouse using Hadoop. In ICDE. 996--1005.
[29]
Abhishek Verma, Ludmila Cherkasova, and Roy H Campbell. 2011. ARIA: automatic resource inference and allocation for mapreduce environments. In Proceedings of the 8th ACM international conference on Autonomic computing. ACM, 235--244.
[30]
Yandong Wang, Jian Tan, Weikuan Yu, Xiaoqiao Meng, and Li Zhang. 2013. Preemptive reducetask scheduling for fair and fast job completion. In Proceedings of the 10th International Conference on Autonomic Computing, ICAC, Vol. 13.
[31]
Joel Wolf, Deepak Rajan, Kirsten Hildrum, Rohit Khandekar, Vibhore Kumar, Sujay Parekh, Kun-Lung Wu, and Andrey Balmin. 2010. Flex: A slot allocation scheduling optimizer for mapreduce workloads. In Middleware'10. Springer, 1--20.
[32]
Sai Wu, Feng Li, Sharad Mehrotra, and Beng Chin Ooi. 2011. Query optimization for massively parallel data processing. In Proceedings of the 2nd ACM Symposium on Cloud Computing. ACM, 12.
[33]
Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. 2008. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. In OSDI, Vol. 8. 1--14.
[34]
Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. 2010. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European conference on Computer systems (EuroSys'10). ACM, New York, NY, USA, 265--278.
[35]
Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing. 10--10.
[36]
Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, and Boon Thau Loo. 2012. Automated profiling and resource management of pig programs for meeting service level objectives. In Proceedings of the 9th international conference on Autonomic computing. ACM, 53--62.

Cited By

View all
  • (2019)Multivariate modeling and two-level scheduling of analytic queriesParallel Computing10.1016/j.parco.2019.01.00685:C(66-78)Online publication date: 1-Jul-2019

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP Workshops '18: Workshop Proceedings of the 47th International Conference on Parallel Processing
August 2018
409 pages
ISBN:9781450365239
DOI:10.1145/3229710
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • University of Oregon: University of Oregon

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Analytics Query
  2. MapReduce
  3. Scheduling
  4. Semantics-Aware

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICPP '18 Comp

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2019)Multivariate modeling and two-level scheduling of analytic queriesParallel Computing10.1016/j.parco.2019.01.00685:C(66-78)Online publication date: 1-Jul-2019

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media