Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/2228298.2228327guideproceedingsArticle/Chapter ViewAbstractPublication PagesnsdiConference Proceedingsconference-collections
Article

Re-optimizing data-parallel computing

Published: 25 April 2012 Publication History

Abstract

Performant execution of data-parallel jobs needs good execution plans. Certain properties of the code, the data, and the interaction between them are crucial to generate these plans. Yet, these properties are difficult to estimate due to the highly distributed nature of these frameworks, the freedom that allows users to specify arbitrary code as operations on the data, and since jobs in modern clusters have evolved beyond single map and reduce phases to logical graphs of operations. Using fixed apriori estimates of these properties to choose execution plans, as modern systems do, leads to poor performance in several instances. We present RoPE, a first step towards re-optimizing data-parallel jobs. RoPE collects certain code and data properties by piggybacking on job execution. It adapts execution plans by feeding these properties to a query optimizer. We show how this improves the future invocations of the same (and similar) jobs and characterize the scenarios of benefit. Experiments on Bing's production clusters show up to 2× improvement across response time for production jobs at the 75th percentile while using 1.5× fewer resources.

References

[1]
G. Ananthanarayanan, S. Kandula, A. Greenberg, et al. Reining in the Outliers in MapReduce Clusters Using Mantri. In OSDI, 2010.
[2]
R. Avnur and J. M. Hellerstein. Eddies: continuously adaptive query processing. SIGMOD Rec., 29, May 2000.
[3]
S. Babu, P. Bizarro, and D. DeWitt. Proactive re-optimization. In SIGMOD, 2005.
[4]
N. Bruno and S. Chaudhuri. Exploiting statistics on query expressions for optimization. In SIGMOD, 2002.
[5]
R. Chaiken, B. Jenkins, P. Larson, et al. SCOPE: Easy and Efficient Parallel Processing of Massive Datasets. In VLDB, 2008.
[6]
C. Chambers, A. Raniwala, F. Perry, et al. Flumejava: easy, e.cient data-parallel pipelines. In PLDI, 2010.
[7]
B. Chattopadhyay, L. Lin, W. Liu, et al. Tenzing a sql implementation on the mapreduce framework. In VLDB, 2010.
[8]
M. Chowdhury, M. Zaharia, J. Ma, et al. Managing data transfers in computer clusters with orchestra. In SIGCOMM, 2011.
[9]
M. Durand and P. Flajolet. Loglog counting of large cardinalities. In ESA, 2003.
[10]
C. A. Galindo-Legaria, M. M. Joshi, F. Waas, and M.-C. Wu. Statistics on views. In VLDB, 2003.
[11]
P. K. Gunda, L. Ravindranath, C. A. Thekkath, et al. Nectar: Automatic management of data and computation in datacenters. In OSDI, 2010.
[12]
H. Herodotou, H. Lim, G. Luo, et al. Starfish: A self-tuning system for big data analytics. In CIDR, 2011.
[13]
B. Hindman, A. Konwinski, M. Zaharia, et al. Mesos: a platform for fine-grained resource sharing in the data center. In NSDI, 2011.
[14]
M. Isard et al. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In Eurosys, 2007.
[15]
M. Isard, V. Prabhakaran, J. Currey, et al. Quincy: Fair scheduling for distributed computing clusters. In SOSP, 2009.
[16]
N. Kabra and D. J. DeWitt. E.cient mid-query re-optimization of sub-optimal query execution plans. In SIGMOD, 1998.
[17]
Q. Ke, V. Prabhakaran, Y. Xie, et al. Optimizing data partitioning for data-parallel computing. In HotOS, 2011.
[18]
P.-A. Larson et al. Cardinality estimation using sample views with quality assurance. In SIGMOD, 2007.
[19]
G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In VLDB, 2002.
[20]
D. G. Murray and S. Hand. CIEL: a universal execution engine for distributed data-flow computing. In NSDI, 2011.
[21]
C. Olston, B. Reed, U. Srivastava, et al. Pig Latin: A Language for Data Processing. In SIGMOD, 2008.
[22]
A. Rasmussen, G. Porter, M. Conley, et al. Tritonsort: a balanced large-scale sorting system. In NSDI, 2011.
[23]
A. Shieh, S. Kandula, A. Greenberg, et al. Sharing the data center network. In NSDI, 2011.
[24]
A. Thusoo, J. S. Sarma, N. Jain, et al. Hive- a warehousing solution over a map-reduce framework. In VLDB, 2009.
[25]
M. Zaharia, D. Borthakur, J. S. Sarma, et al. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In EuroSys, 2010.

Cited By

View all
  • (2023)Predicate Pushdown for Data Science PipelinesProceedings of the ACM on Management of Data10.1145/35892811:2(1-28)Online publication date: 20-Jun-2023
  • (2021)Steering Query Optimizers: A Practical Take on Big Data WorkloadsProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457568(2557-2569)Online publication date: 9-Jun-2021
  • (2020)Towards plan-aware resource allocation in serverless query processingProceedings of the 12th USENIX Conference on Hot Topics in Cloud Computing10.5555/3485849.3485858(9-9)Online publication date: 13-Jul-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
NSDI'12: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
April 2012
30 pages

Sponsors

  • VMware
  • NSF: National Science Foundation
  • Google Inc.
  • Infosys
  • Microsoft Reasearch: Microsoft Reasearch

Publisher

USENIX Association

United States

Publication History

Published: 25 April 2012

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Predicate Pushdown for Data Science PipelinesProceedings of the ACM on Management of Data10.1145/35892811:2(1-28)Online publication date: 20-Jun-2023
  • (2021)Steering Query Optimizers: A Practical Take on Big Data WorkloadsProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457568(2557-2569)Online publication date: 9-Jun-2021
  • (2020)Towards plan-aware resource allocation in serverless query processingProceedings of the 12th USENIX Conference on Hot Topics in Cloud Computing10.5555/3485849.3485858(9-9)Online publication date: 13-Jul-2020
  • (2020)Applied Research Lessons from CloudViews ProjectACM SIGMOD Record10.1145/3444831.344483949:3(37-42)Online publication date: 17-Dec-2020
  • (2020)Finding the right cloud configuration for analytics clustersProceedings of the 11th ACM Symposium on Cloud Computing10.1145/3419111.3421305(208-222)Online publication date: 12-Oct-2020
  • (2020)Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our FindingsProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380584(99-113)Online publication date: 11-Jun-2020
  • (2019)SOPHIAProceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference10.5555/3358807.3358827(223-239)Online publication date: 10-Jul-2019
  • (2019)Experiences with approximating queries in Microsoft's production big-data clustersProceedings of the VLDB Endowment10.14778/3352063.335213012:12(2131-2142)Online publication date: 1-Aug-2019
  • (2019)PeregrineProceedings of the ACM Symposium on Cloud Computing10.1145/3357223.3362726(416-427)Online publication date: 20-Nov-2019
  • (2019)Learning scheduling algorithms for data processing clustersProceedings of the ACM Special Interest Group on Data Communication10.1145/3341302.3342080(270-288)Online publication date: 19-Aug-2019
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media