article

PREDIcT: towards predicting the runtime of large scale iterative analytics

Authors:

Adrian Daniel Popescu,

Anastasia AilamakiAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 6, Issue 14

Pages 1678 - 1689

https://doi.org/10.14778/2556549.2556553

Published: 01 September 2013 Publication History

Abstract

Machine learning algorithms are widely used today for analytical tasks such as data cleaning, data categorization, or data filtering. At the same time, the rise of social media motivates recent uptake in large scale graph processing. Both categories of algorithms are dominated by iterative subtasks, i.e., processing steps which are executed repetitively until a convergence condition is met. Optimizing cluster resource allocations among multiple workloads of iterative algorithms motivates the need for estimating their runtime, which in turn requires: i) predicting the number of iterations, and ii) predicting the processing time of each iteration. As both parameters depend on the characteristics of the dataset and on the convergence function, estimating their values before execution is difficult.

This paper proposes PREDIcT, an experimental methodology for predicting the runtime of iterative algorithms. PREDIcT uses sample runs for capturing the algorithm's convergence trend and per-iteration key input features that are well correlated with the actual processing requirements of the complete input dataset. Using this combination of characteristics we predict the runtime of iterative algorithms, including algorithms with very different runtime patterns among subsequent iterations. Our experimental evaluation of multiple algorithms on scale-free graphs shows a relative prediction error of 10%-30% for predicting runtime, including algorithms with up to 100× runtime variability among consecutive iterations.

References

[1]

F. N. Afrati, V. Borkar, M. Carey, N. Polyzotis, and J. D. Ullman. Map-Reduce Extensions and Recursive Queries. In EDBT, pages 1-8, 2011.

[2]

M. Ahmad, S. Duan, A. Aboulnaga, and S. Babu. Interaction-aware Prediction of Business Intelligence Workload Completion Times. In ICDE, pages 413-416, 2010.

[3]

D. Arthur and S. Vassilvitskii. How Slow is the k-means Method? In SCG, pages 144-153, 2006.

[4]

F. Bancilhon and R. Ramakrishnan. An Amateur's Introduction to Recursive Query Processing Strategies. In SIGMOD, pages 16-52, 1986.

[5]

Y. Bu, V. R. Borkar, M. J. Carey, J. Rosen, N. Polyzotis, T. Condie, M. Weimer, and R. Ramakrishnan. Scaling Datalog for Machine Learning on Big Data. CoRR, abs/1203.0160, 2012.

[6]

Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: Efficient Iterative Data Processing on Large Clusters. PVLDB, 3:285-296, 2010.

[7]

M. Cha, H. Haddadi, F. Benevenuto, and K. P. Gummadi. Measuring User Influence in Twitter: The Million Follower Fallacy. In ICWSM, 2010.

[8]

S. Chaudhuri, R. Kaushik, and R. Ramamurthy. When Can We Trust Progress Estimators for SQL Queries? In SIGMOD, pages 575-586, 2005.

[9]

S. Chaudhuri, R. Motwani, and V. Narasayya. Random Sampling for Histogram Construction: How Much is Enough? In SIGMOD, pages 436-447, 1998.

[10]

J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD Skills: New Analysis Practices for Big Data. PVLDB, 2(2):1481-1492, 2009.

[11]

A. Deshpande, Z. Ives, and V. Raman. Adaptive query processing. Found. Trends databases, 1(1):1-140, 2007.

[12]

J. Duggan, U. Cetintemel, O. Papaemmanouil, and E. Upfal. Performance Prediction for Concurrent Database Workloads. In SIGMOD, pages 337-348, 2011.

[13]

S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl. Spinning Fast Iterative Data Flows. PVLDB, 5(11):1268-1279, 2012.

[14]

A. Ganapathi, H. Kuno, U. Dayal, J. L. Wiener, A. Fox, M. Jordan, and D. Patterson. Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning. In ICDE, pages 592-603, 2009.

[15]

M. Gjoka, M. Kurant, C. T. Butts, and A. Markopoulou. Walking in Facebook: A Case Study of Unbiased Sampling of OSNs. In INFOCOM, pages 2498-2506, 2010.

[16]

S. Har-Peled and B. Sadri. How Fast is the k-means Method? In SODA, pages 185-202, 2005.

[17]

T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference and Prediction, Second Edition. Springer, 2008.

[18]

H. Herodotou and S. Babu. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. PVLDB, 4(11):1111-1122, 2011.

[19]

H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A Self-tuning System for Big Data Analytics. In CIDR, pages 261-272, 2011.

[20]

U. Kang, C. E. Tsourakakis, and C. Faloutsos. PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations. In ICDM, pages 229-238, 2009.

[21]

Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, and P. Kalnis. Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing. In EuroSys, pages 169-182, 2013.

[22]

A. N. Langville and C. D. Meyer. Deeper Inside PageRank. Internet Mathematics, 1(3):335-380, 2003.

[23]

J. Leskovec and C. Faloutsos. Sampling from Large Graphs. In KDD, pages 631-636, 2006.

[24]

J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Statistical Properties of Community Structure in Large Social and Information Networks. In WWW, pages 695-704, 2008.

[25]

J. Li, A. C. König, V. Narasayya, and S. Chaudhuri. Robust Estimation of Resource Consumption for SQL Queries Using Statistical Techniques. PVLDB, 5(11):1555-1566, 2012.

[26]

Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. PVLDB, 5(8):716-727, 2012.

[27]

G. Luo, J. F. Naughton, and P. S. Yu. Multiquery SQL Progress Indicators. In EDBT, pages 921-941, 2006.

[28]

G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A System for Large-Scale Graph Processing. In SIGMOD, pages 135-146, 2010.

[29]

K. Morton, M. Balazinska, and D. Grossman. ParaTimer: A Progress Indicator for MapReduce DAGs. In SIGMOD, pages 507-518, 2010.

[30]

L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report 1999-66, Stanford InfoLab, 1999.

[31]

A. D. Popescu, A. Balmin, V. Ercegovac, and A. Ailamaki. Towards Predicting the Runtime of Iterative Analytics with PREDIcT. Technical Report 187356, EPFL, 2013.

[32]

A. D. Popescu, V. Ercegovac, A. Balmin, M. Branco, and A. Ailamaki. Same Queries, Different Data: Can We Predict Runtime Performance? In ICDE Workshops, pages 275-280, 2012.

[33]

M. Stillger, G. M. Lohman, V. Markl, and M. Kandil. LEO - DB2's LEarning Optimizer. In VLDB, pages 19-28, 2001.

[34]

L. G. Valiant. A Bridging Model for Parallel Computation. CACM, 33(8):103-111, 1990.

[35]

S. van Dongen. A Cluster Algorithm for Graphs. Technical Report INS-R0010, CWI, 2000.

[36]

A. Verma, L. Cherkasova, and R. H. Campbell. ARIA: Automatic Resource Inference and Allocation for MapReduce Environments. In ICAC, pages 235-244, 2011.

[37]

G. Wang, W. Xie, A. J. Demers, and J. Gehrke. Asynchronous Large-Scale Graph Processing Made Easy. In CIDR, 2013.

[38]

J. Wolf, D. Rajan, K. Hildrum, R. Khandekar, V. Kumar, S. Parekh, K.-L. Wu, and A. Balmin. FLEX: A Slot Allocation Scheduling Optimizer for MapReduce Workloads. In Middleware, pages 1-20, 2010.

[39]

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster Computing with Working Sets. In HotCloud, 2010.

Cited By

Al-Sayeh HJibril MSattler K(2024)Agile-Ant: Self-Managing Distributed Cache Management for Cost Optimization of Big Data ApplicationsProceedings of the VLDB Endowment10.14778/3681954.368199017:11(3151-3164)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3681990
Yue XDing QZhu JDing Y(2024)TimeLink: enabling dynamic runtime prediction for Flink iterative jobsThe Journal of Supercomputing10.1007/s11227-024-06085-x80:11(16546-16573)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s11227-024-06085-x
Yokelson DCharest MLi YScully-Allison CLiem RVeroneze Solorzano AAfzal AKousha P(2023)HPC Application Performance Prediction with Machine Learning on New ArchitecturesProceedings of the 2023 on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn Strategy10.1145/3588993.3597262(1-8)Online publication date: 28-Jul-2023
https://dl.acm.org/doi/10.1145/3588993.3597262
Show More Cited By

Recommendations

SARNA-Predict: Accuracy Improvement of RNA Secondary Structure Prediction Using Permutation-Based Simulated Annealing

Ribonucleic acid (RNA), a single-stranded linear molecule, is essential to all biological systems. Different regions of the same RNA strand will fold together via base pair interactions to make intricate secondary and tertiary structures that guide ...
Using Historical Data to Predict Application Runtimes on Backfilling Parallel Systems
PDP '10: Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing

We present in this paper a novel method to predict application runtimes on backfilling parallel systems. The method is based on mining historical data to obtain important parameters. These parameters are then applied to predict the runtime of future ...
Semismooth Karush-Kuhn-Tucker Equations and Convergence Analysis of Newton and Quasi-Newton Methods for Solving these Equations

There are several forms of systems of nonsmooth equations which are equivalent to the Karush-Kuhn-Tucker KKT system of a nonlinearly constrained optimization problem NLP. If the NLP is twice continuously differentiable and the Hessian functions of its ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 6, Issue 14

September 2013

384 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 September 2013

Published in PVLDB Volume 6, Issue 14

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
218
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Al-Sayeh HJibril MSattler K(2024)Agile-Ant: Self-Managing Distributed Cache Management for Cost Optimization of Big Data ApplicationsProceedings of the VLDB Endowment10.14778/3681954.368199017:11(3151-3164)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3681990
Yue XDing QZhu JDing Y(2024)TimeLink: enabling dynamic runtime prediction for Flink iterative jobsThe Journal of Supercomputing10.1007/s11227-024-06085-x80:11(16546-16573)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s11227-024-06085-x
Yokelson DCharest MLi YScully-Allison CLiem RVeroneze Solorzano AAfzal AKousha P(2023)HPC Application Performance Prediction with Machine Learning on New ArchitecturesProceedings of the 2023 on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn Strategy10.1145/3588993.3597262(1-8)Online publication date: 28-Jul-2023
https://dl.acm.org/doi/10.1145/3588993.3597262
Narasayya VChaudhuri SIves ZBonifati AEl Abbadi A(2022)Multi-Tenant Cloud Data Services: State-of-the-Art, Challenges and OpportunitiesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3522566(2465-2473)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3522566
Al-Sayeh HMemishi BJibril MParadies MSattler KIves ZBonifati AEl Abbadi A(2022)Juggler: Autonomous Cost Optimization and Performance Prediction of Big Data ApplicationsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517892(1840-1854)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517892
Trummer IWang JWei ZMaram DMoseley SJo SAntonakakis JRayabhari A(2021)SkinnerDB: Regret-bounded Query Evaluation via Reinforcement LearningACM Transactions on Database Systems10.1145/346438946:3(1-45)Online publication date: 28-Sep-2021
https://dl.acm.org/doi/10.1145/3464389
Ventura FKaoudi ZQuiané-Ruiz JMarkl VLi GLi ZIdreos SSrivastava D(2021)Expand your Training Limits! Generating Training Data for ML-based Data ManagementProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457286(1865-1878)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457286
Yue XShi LZhao YJi HWang G(2021)Online Runtime Prediction Method for Distributed Iterative JobsWeb Information Systems and Applications10.1007/978-3-030-87571-8_14(156-168)Online publication date: 24-Sep-2021
https://dl.acm.org/doi/10.1007/978-3-030-87571-8_14
Jeong MPark SHan H(2020)Caching Cost Model for In-memory Data Analytics FrameworkThe 9th International Conference on Smart Media and Applications10.1145/3426020.3426070(187-189)Online publication date: 17-Sep-2020
https://dl.acm.org/doi/10.1145/3426020.3426070
Al-Sayeh HHagedorn SSattler K(2020)A gray-box modeling methodology for runtime prediction of Apache Spark jobsDistributed and Parallel Databases10.1007/s10619-020-07286-y38:4(819-839)Online publication date: 10-Mar-2020
https://dl.acm.org/doi/10.1007/s10619-020-07286-y
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents