Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Abstract cost models for distributed data-intensive computations

Published: 01 September 2019 Publication History

Abstract

We consider data analytics workloads on distributed architectures, in particular clusters of commodity machines. To find a job partitioning that minimizes running time, a cost model, which we more accurately refer to as makespan model, is needed. In attempting to find the simplest possible, but sufficiently accurate, such model, we explore piecewise linear functions of input, output, and computational complexity. They are abstract in the sense that they capture fundamental algorithm properties, but do not require explicit modeling of system and implementation details such as the number of disk accesses. We show how the simplified functional structure can be exploited to reduce optimization cost. In the general case, we identify a lower bound that can be used for search-space pruning. For applications with homogeneous tasks, we further demonstrate how to directly integrate the model into the makespan optimization process, reducing search-space dimensionality and thus complexity by orders of magnitude. Experimental results provide evidence of good prediction quality and successful makespan optimization across a variety of operators and cluster architectures.

References

[1]
Agarwal, R.C., Balle, S.M., Gustavson, F.G., Joshi, M., Palkar, P.: A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Dev. 39(5), 575–582 (1995)
[2]
Akdere, M., Cetintemel, U., Riondato, M., Upfal, E., Zdonik, S.: Learning-based query performance modeling and prediction. In: ICDE, pp. 390–401 (2012)
[3]
Asanovic, K., Bodik, R., Demmel, J., Keaveny, T., Keutzer, K., Kubiatowicz, J., Morgan, N., Patterson, D., Sen, K., Wawrzynek, J., Wessel, D., Yelick, K.: A view of the parallel computing landscape. Commun. ACM 52(10), 56–67 (2009)
[4]
Ballard, G., Buluc, A., Demmel, J., Grigori, L., Lipshitz, B., Schwartz, O., Toledo, S.: Communication optimal parallel multiplication of sparse random matrices. In: SPAA, pp. 222–231 (2013)
[5]
Duggan, J., Cetintemel, U., Papaemmanouil, O., Upfal, E.: Performance prediction for concurrent database workloads. In: SIGMOD, pp. 337–348 (2011)
[6]
Duggan, J., Papaemmanouil, O., Çetintemel, U., Upfal, E.: Contender: a resource modeling approach for concurrent query performance prediction. In: EDBT, pp. 109–120 (2014)
[7]
Elmroth, E., Gustavson, F., Jonsson, I., K$\mathring{\text{a}}$gstr$\ddot{\text{ o }}$m, B.: Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Rev. 46(1), 3–45 (2004)
[8]
Ganapathi, A., Kuno, H.A., Dayal, U., Wiener, J.L., Fox, A., Jordan, M.I., Patterson, D.A.: Predicting multiple metrics for queries: Better decisions enabled by machine learning. In: ICDE, pp. 592–603 (2009)
[9]
Goodrich, M.T., Sitchinava, N., Zhang, Q.: Sorting, searching, and simulation in the mapreduce framework. In: ISAAC, pp. 374–383 (2011)
[10]
Gounaris, A., Kougka, G., Tous, R., Montes, C.T., Torres, J.: Dynamic configuration of partitioning in spark applications. IEEE Trans. Parallel Distrib. Syst. 28(7), 1891–1904 (2017)
[11]
Gunther, N.J., Puglia, P., Tomasette, K.: Hadoop superlinear scalability. Commun. ACM 58(4), 46–55 (2015)
[12]
Hahn, C., Warren, S., Eastman, R.: Extended edited synoptic cloud reports from ships and land stations over the globe, 1952–2009 (ndp-026c) (2012)
[13]
Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of mapreduce programs. VLDB 4(11), 1111–1122 (2011)
[14]
Huang, B., Babu, S., Yang, J.: Cumulon: optimizing statistical data analysis in the cloud. In: Proceedings of SIGMOD, pp. 1–12 (2013)
[15]
Irony, D., Toledo, S., Tiskin, A.: Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput. 64(9), 1017–1026 (2004)
[16]
Kaoudi, Z., Quiane-Ruiz, JA., Thirumuruganathan, S., Chawla, S., Agrawal, D.: A cost-based optimizer for gradient descent optimization. In: SIGMOD, pp. 977–992 (2017)
[17]
Karloff, H., Suri, S., Vassilvitskii, S.: A model of computation for mapreduce. In: SODA, pp. 938–948 (2010)
[18]
Li, J., Naughton, J.F., Nehme, R.V.: Resource bricolage and resource selection for parallel database systems. VLDB J. 26(1), 31–54 (2017)
[19]
Li, R., Riedewald, M., Deng, X.: Submodularity of distributed join computation. In: (Upcoming) SIGMOD (2018)
[20]
Lichman, M.: UCI machine learning repository (2013)
[21]
Morton, K., Balazinska, M., Grossman, D.: Paratimer: a progress indicator for mapreduce dags. In: SIGMOD, pp. 507–518 (2010)
[22]
Munson, A.M., Webb, K., Sheldon, D., Fink, D., Hochachka, W.M., Iliff, M., Riedewald, M., Sorokina, D., Sullivan, B., Wood, C., Kelling, S.: The Ebird Reference Dataset, Version 2014. Cornell Lab of Ornithology and National Audubon Society, Ithaca, NY (2014)
[23]
Quinlan, J.R., et al.: Learning with continuous classes. Aust. Jt. Conf. Artif. Intell. 92, 343–348 (1992)
[24]
Ramakrishnan, R., Gehrke, J.: Database Management Systems, 3rd edn. McGraw-Hill, New York (2003)
[25]
Shi, J., Zou, J., Lu, J., Cao, Z., Li, S., Wang, C.: Mrtuner: a toolkit to enable holistic optimization for mapreduce jobs. In: VLDB, pp. 1319–1330 (2014)
[26]
Solomonik, E., Demmel, J.: Communication-optimal parallel 2.5d matrix multiplication and LU factorization algorithms. In: Euro-Par 2011 Parallel Processing, pp. 90–109 (2011)
[27]
Valsalam, V., Skjellum, A.: A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. Concurr. Comput. 14(10), 805–839 (2002)
[28]
van de Geijn, R.A., Watts, J.: Summa: Scalable Universal Matrix Multiplication Algorithm. University of Texas at Austin, Tech. rep. (1995)
[29]
Venkataraman, S., Yang, Z., Franklin, M., Recht, B., Stoica, I.: Ernest: efficient performance prediction for large-scale advanced analytics. In: NSDI, pp. 363–378 (2016)
[30]
Vieth, E.: Fitting piecewise linear regression functions to biological responses. J. Appl. Physiol. 67(1), 390–396 (1989)
[31]
Wang, G., Chan, CY.: Multi-query optimization in mapreduce framework. In: VLDB, pp. 145–156 (2013)
[32]
White, B., Lepreau, J., Stoller, L., Ricci, R., Guruprasad, S., Newbold, M., Hibler, M., Barb, C, Joglekar, A.: An integrated experimental environment for distributed systems and networks. In: OSDI, pp. 255–270 (2002)
[33]
Wu, W., Chi, Y., Hacígümüş, H., Naughton, J.F.: Towards predicting query execution time for concurrent and dynamic database workloads. VLDB 6(10), 925–936 (2013)
[34]
Zhang, X., Chen, L., Wang, M.: Efficient multi-way theta-join processing using mapreduce. In: VLDB, pp. 1184–1195 (2012)

Cited By

View all
  • (2024)A Spark Optimizer for Adaptive, Fine-Grained Parameter TuningProceedings of the VLDB Endowment10.14778/3681954.368202117:11(3565-3579)Online publication date: 1-Jul-2024
  • (2024)Saving Money for Analytical Workloads in the CloudProceedings of the VLDB Endowment10.14778/3681954.368201817:11(3524-3537)Online publication date: 1-Jul-2024
  • (2020)Near-Optimal Distributed Band-Joins through Recursive PartitioningProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389750(2375-2390)Online publication date: 11-Jun-2020

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Distributed and Parallel Databases
Distributed and Parallel Databases  Volume 37, Issue 3
Sep 2019
144 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 September 2019

Author Tags

  1. Distributed analytics
  2. Makespan minimization
  3. Cost model
  4. Data partitioning

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Spark Optimizer for Adaptive, Fine-Grained Parameter TuningProceedings of the VLDB Endowment10.14778/3681954.368202117:11(3565-3579)Online publication date: 1-Jul-2024
  • (2024)Saving Money for Analytical Workloads in the CloudProceedings of the VLDB Endowment10.14778/3681954.368201817:11(3524-3537)Online publication date: 1-Jul-2024
  • (2020)Near-Optimal Distributed Band-Joins through Recursive PartitioningProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389750(2375-2390)Online publication date: 11-Jun-2020

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media