Article

Re-optimizing data-parallel computing

Authors:

Sameer Agarwal,

Srikanth Kandula,

Jingren ZhouAuthors Info & Claims

NSDI'12: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation

Page 21

Published: 25 April 2012 Publication History

Abstract

Performant execution of data-parallel jobs needs good execution plans. Certain properties of the code, the data, and the interaction between them are crucial to generate these plans. Yet, these properties are difficult to estimate due to the highly distributed nature of these frameworks, the freedom that allows users to specify arbitrary code as operations on the data, and since jobs in modern clusters have evolved beyond single map and reduce phases to logical graphs of operations. Using fixed apriori estimates of these properties to choose execution plans, as modern systems do, leads to poor performance in several instances. We present RoPE, a first step towards re-optimizing data-parallel jobs. RoPE collects certain code and data properties by piggybacking on job execution. It adapts execution plans by feeding these properties to a query optimizer. We show how this improves the future invocations of the same (and similar) jobs and characterize the scenarios of benefit. Experiments on Bing's production clusters show up to 2× improvement across response time for production jobs at the 75th percentile while using 1.5× fewer resources.

References

[1]

G. Ananthanarayanan, S. Kandula, A. Greenberg, et al. Reining in the Outliers in MapReduce Clusters Using Mantri. In OSDI, 2010.

Digital Library

[2]

R. Avnur and J. M. Hellerstein. Eddies: continuously adaptive query processing. SIGMOD Rec., 29, May 2000.

Digital Library

[3]

S. Babu, P. Bizarro, and D. DeWitt. Proactive re-optimization. In SIGMOD, 2005.

Digital Library

[4]

N. Bruno and S. Chaudhuri. Exploiting statistics on query expressions for optimization. In SIGMOD, 2002.

Digital Library

[5]

R. Chaiken, B. Jenkins, P. Larson, et al. SCOPE: Easy and Efficient Parallel Processing of Massive Datasets. In VLDB, 2008.

Digital Library

[6]

C. Chambers, A. Raniwala, F. Perry, et al. Flumejava: easy, e.cient data-parallel pipelines. In PLDI, 2010.

Digital Library

[7]

B. Chattopadhyay, L. Lin, W. Liu, et al. Tenzing a sql implementation on the mapreduce framework. In VLDB, 2010.

[8]

M. Chowdhury, M. Zaharia, J. Ma, et al. Managing data transfers in computer clusters with orchestra. In SIGCOMM, 2011.

Digital Library

[9]

M. Durand and P. Flajolet. Loglog counting of large cardinalities. In ESA, 2003.

[10]

C. A. Galindo-Legaria, M. M. Joshi, F. Waas, and M.-C. Wu. Statistics on views. In VLDB, 2003.

Digital Library

[11]

P. K. Gunda, L. Ravindranath, C. A. Thekkath, et al. Nectar: Automatic management of data and computation in datacenters. In OSDI, 2010.

Digital Library

[12]

H. Herodotou, H. Lim, G. Luo, et al. Starfish: A self-tuning system for big data analytics. In CIDR, 2011.

[13]

B. Hindman, A. Konwinski, M. Zaharia, et al. Mesos: a platform for fine-grained resource sharing in the data center. In NSDI, 2011.

Digital Library

[14]

M. Isard et al. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In Eurosys, 2007.

Digital Library

[15]

M. Isard, V. Prabhakaran, J. Currey, et al. Quincy: Fair scheduling for distributed computing clusters. In SOSP, 2009.

Digital Library

[16]

N. Kabra and D. J. DeWitt. E.cient mid-query re-optimization of sub-optimal query execution plans. In SIGMOD, 1998.

Digital Library

[17]

Q. Ke, V. Prabhakaran, Y. Xie, et al. Optimizing data partitioning for data-parallel computing. In HotOS, 2011.

Digital Library

[18]

P.-A. Larson et al. Cardinality estimation using sample views with quality assurance. In SIGMOD, 2007.

Digital Library

[19]

G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In VLDB, 2002.

Digital Library

[20]

D. G. Murray and S. Hand. CIEL: a universal execution engine for distributed data-flow computing. In NSDI, 2011.

Digital Library

[21]

C. Olston, B. Reed, U. Srivastava, et al. Pig Latin: A Language for Data Processing. In SIGMOD, 2008.

Digital Library

[22]

A. Rasmussen, G. Porter, M. Conley, et al. Tritonsort: a balanced large-scale sorting system. In NSDI, 2011.

Digital Library

[23]

A. Shieh, S. Kandula, A. Greenberg, et al. Sharing the data center network. In NSDI, 2011.

Digital Library

[24]

A. Thusoo, J. S. Sarma, N. Jain, et al. Hive- a warehousing solution over a map-reduce framework. In VLDB, 2009.

Digital Library

[25]

M. Zaharia, D. Borthakur, J. S. Sarma, et al. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In EuroSys, 2010.

Digital Library

Cited By

Yan CLin YHe Y(2023)Predicate Pushdown for Data Science PipelinesProceedings of the ACM on Management of Data10.1145/35892811:2(1-28)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589281
Negi PInterlandi MMarcus RAlizadeh MKraska TFriedman MJindal ALi GLi ZIdreos SSrivastava D(2021)Steering Query Optimizers: A Practical Take on Big Data WorkloadsProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457568(2557-2569)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457568
Bag MJindal APatel HPhanishayee AStutsman R(2020)Towards plan-aware resource allocation in serverless query processingProceedings of the 12th USENIX Conference on Hot Topics in Cloud Computing10.5555/3485849.3485858(9-9)Online publication date: 13-Jul-2020
https://dl.acm.org/doi/10.5555/3485849.3485858
Show More Cited By

Re-optimizing data-parallel computing
1. Information systems
  1. Data management systems
    1. Database management system engines
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory

Recommendations

Optimizing busy time on parallel machines

We consider the following fundamental parallel machines scheduling problem in which the input consists of n jobs to be scheduled on a set of identical machines of bounded capacity g, which is the maximal number of jobs that can be processed ...
Optimizing Busy Time on Parallel Machines
IPDPS '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium

We consider the following fundamental scheduling problem in which the input consists of n jobs to be scheduled on a set of identical machines of bounded capacity g (which is the maximal number of jobs that can be processed simultaneously by a single ...
Optimizing Task Scheduling in Cloud Computing: An Enhanced Shortest Job First Algorithm
Abstract
In the dynamic landscape of cloud computing, efficient task scheduling plays a pivotal role in optimizing resource utilization and enhancing overall system performance. This research introduces a groundbreaking approach to task scheduling in cloud ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

NSDI'12: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation

April 2012

30 pages

Program Chairs:
Steven Gribble
University of Washington
,
Dina Katabi
Massachusetts Institute of Technology

Sponsors

VMware
NSF: National Science Foundation
Google Inc.
Infosys
Microsoft Reasearch: Microsoft Reasearch

Publisher

USENIX Association

United States

Publication History

Published: 25 April 2012

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

62
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yan CLin YHe Y(2023)Predicate Pushdown for Data Science PipelinesProceedings of the ACM on Management of Data10.1145/35892811:2(1-28)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589281
Negi PInterlandi MMarcus RAlizadeh MKraska TFriedman MJindal ALi GLi ZIdreos SSrivastava D(2021)Steering Query Optimizers: A Practical Take on Big Data WorkloadsProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457568(2557-2569)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457568
Bag MJindal APatel HPhanishayee AStutsman R(2020)Towards plan-aware resource allocation in serverless query processingProceedings of the 12th USENIX Conference on Hot Topics in Cloud Computing10.5555/3485849.3485858(9-9)Online publication date: 13-Jul-2020
https://dl.acm.org/doi/10.5555/3485849.3485858
Jindal A(2020)Applied Research Lessons from CloudViews ProjectACM SIGMOD Record10.1145/3444831.344483949:3(37-42)Online publication date: 17-Dec-2020
https://dl.acm.org/doi/10.1145/3444831.3444839
Bilal MCanini MRodrigues RFonseca RDelimitrou COoi B(2020)Finding the right cloud configuration for analytics clustersProceedings of the 11th ACM Symposium on Cloud Computing10.1145/3419111.3421305(208-222)Online publication date: 12-Oct-2020
https://dl.acm.org/doi/10.1145/3419111.3421305
Siddiqui TJindal AQiao SPatel HLe WMaier DPottinger RDoan ATan WAlawini ANgo H(2020)Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our FindingsProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380584(99-113)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3380584
Mahgoub AWood PMedoff AMitra SMeyer FChaterji SBagchi SDan TDahlia M(2019)SOPHIAProceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference10.5555/3358807.3358827(223-239)Online publication date: 10-Jul-2019
https://dl.acm.org/doi/10.5555/3358807.3358827
Kandula SLee KChaudhuri SFriedman M(2019)Experiences with approximating queries in Microsoft's production big-data clustersProceedings of the VLDB Endowment10.14778/3352063.335213012:12(2131-2142)Online publication date: 1-Aug-2019
https://dl.acm.org/doi/10.14778/3352063.3352130
Jindal APatel HRoy AQiao SYin ZSen RKrishnan S(2019)PeregrineProceedings of the ACM Symposium on Cloud Computing10.1145/3357223.3362726(416-427)Online publication date: 20-Nov-2019
https://dl.acm.org/doi/10.1145/3357223.3362726
Mao HSchwarzkopf MVenkatakrishnan SMeng ZAlizadeh MWu JHall W(2019)Learning scheduling algorithms for data processing clustersProceedings of the ACM Special Interest Group on Data Communication10.1145/3341302.3342080(270-288)Online publication date: 19-Aug-2019
https://dl.acm.org/doi/10.1145/3341302.3342080
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents