Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

PREDIcT: towards predicting the runtime of large scale iterative analytics

Published: 01 September 2013 Publication History

Abstract

Machine learning algorithms are widely used today for analytical tasks such as data cleaning, data categorization, or data filtering. At the same time, the rise of social media motivates recent uptake in large scale graph processing. Both categories of algorithms are dominated by iterative subtasks, i.e., processing steps which are executed repetitively until a convergence condition is met. Optimizing cluster resource allocations among multiple workloads of iterative algorithms motivates the need for estimating their runtime, which in turn requires: i) predicting the number of iterations, and ii) predicting the processing time of each iteration. As both parameters depend on the characteristics of the dataset and on the convergence function, estimating their values before execution is difficult.
This paper proposes PREDIcT, an experimental methodology for predicting the runtime of iterative algorithms. PREDIcT uses sample runs for capturing the algorithm's convergence trend and per-iteration key input features that are well correlated with the actual processing requirements of the complete input dataset. Using this combination of characteristics we predict the runtime of iterative algorithms, including algorithms with very different runtime patterns among subsequent iterations. Our experimental evaluation of multiple algorithms on scale-free graphs shows a relative prediction error of 10%-30% for predicting runtime, including algorithms with up to 100× runtime variability among consecutive iterations.

References

[1]
F. N. Afrati, V. Borkar, M. Carey, N. Polyzotis, and J. D. Ullman. Map-Reduce Extensions and Recursive Queries. In EDBT, pages 1-8, 2011.
[2]
M. Ahmad, S. Duan, A. Aboulnaga, and S. Babu. Interaction-aware Prediction of Business Intelligence Workload Completion Times. In ICDE, pages 413-416, 2010.
[3]
D. Arthur and S. Vassilvitskii. How Slow is the k-means Method? In SCG, pages 144-153, 2006.
[4]
F. Bancilhon and R. Ramakrishnan. An Amateur's Introduction to Recursive Query Processing Strategies. In SIGMOD, pages 16-52, 1986.
[5]
Y. Bu, V. R. Borkar, M. J. Carey, J. Rosen, N. Polyzotis, T. Condie, M. Weimer, and R. Ramakrishnan. Scaling Datalog for Machine Learning on Big Data. CoRR, abs/1203.0160, 2012.
[6]
Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: Efficient Iterative Data Processing on Large Clusters. PVLDB, 3:285-296, 2010.
[7]
M. Cha, H. Haddadi, F. Benevenuto, and K. P. Gummadi. Measuring User Influence in Twitter: The Million Follower Fallacy. In ICWSM, 2010.
[8]
S. Chaudhuri, R. Kaushik, and R. Ramamurthy. When Can We Trust Progress Estimators for SQL Queries? In SIGMOD, pages 575-586, 2005.
[9]
S. Chaudhuri, R. Motwani, and V. Narasayya. Random Sampling for Histogram Construction: How Much is Enough? In SIGMOD, pages 436-447, 1998.
[10]
J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD Skills: New Analysis Practices for Big Data. PVLDB, 2(2):1481-1492, 2009.
[11]
A. Deshpande, Z. Ives, and V. Raman. Adaptive query processing. Found. Trends databases, 1(1):1-140, 2007.
[12]
J. Duggan, U. Cetintemel, O. Papaemmanouil, and E. Upfal. Performance Prediction for Concurrent Database Workloads. In SIGMOD, pages 337-348, 2011.
[13]
S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl. Spinning Fast Iterative Data Flows. PVLDB, 5(11):1268-1279, 2012.
[14]
A. Ganapathi, H. Kuno, U. Dayal, J. L. Wiener, A. Fox, M. Jordan, and D. Patterson. Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning. In ICDE, pages 592-603, 2009.
[15]
M. Gjoka, M. Kurant, C. T. Butts, and A. Markopoulou. Walking in Facebook: A Case Study of Unbiased Sampling of OSNs. In INFOCOM, pages 2498-2506, 2010.
[16]
S. Har-Peled and B. Sadri. How Fast is the k-means Method? In SODA, pages 185-202, 2005.
[17]
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference and Prediction, Second Edition. Springer, 2008.
[18]
H. Herodotou and S. Babu. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. PVLDB, 4(11):1111-1122, 2011.
[19]
H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A Self-tuning System for Big Data Analytics. In CIDR, pages 261-272, 2011.
[20]
U. Kang, C. E. Tsourakakis, and C. Faloutsos. PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations. In ICDM, pages 229-238, 2009.
[21]
Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, and P. Kalnis. Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing. In EuroSys, pages 169-182, 2013.
[22]
A. N. Langville and C. D. Meyer. Deeper Inside PageRank. Internet Mathematics, 1(3):335-380, 2003.
[23]
J. Leskovec and C. Faloutsos. Sampling from Large Graphs. In KDD, pages 631-636, 2006.
[24]
J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Statistical Properties of Community Structure in Large Social and Information Networks. In WWW, pages 695-704, 2008.
[25]
J. Li, A. C. König, V. Narasayya, and S. Chaudhuri. Robust Estimation of Resource Consumption for SQL Queries Using Statistical Techniques. PVLDB, 5(11):1555-1566, 2012.
[26]
Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. PVLDB, 5(8):716-727, 2012.
[27]
G. Luo, J. F. Naughton, and P. S. Yu. Multiquery SQL Progress Indicators. In EDBT, pages 921-941, 2006.
[28]
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A System for Large-Scale Graph Processing. In SIGMOD, pages 135-146, 2010.
[29]
K. Morton, M. Balazinska, and D. Grossman. ParaTimer: A Progress Indicator for MapReduce DAGs. In SIGMOD, pages 507-518, 2010.
[30]
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report 1999-66, Stanford InfoLab, 1999.
[31]
A. D. Popescu, A. Balmin, V. Ercegovac, and A. Ailamaki. Towards Predicting the Runtime of Iterative Analytics with PREDIcT. Technical Report 187356, EPFL, 2013.
[32]
A. D. Popescu, V. Ercegovac, A. Balmin, M. Branco, and A. Ailamaki. Same Queries, Different Data: Can We Predict Runtime Performance? In ICDE Workshops, pages 275-280, 2012.
[33]
M. Stillger, G. M. Lohman, V. Markl, and M. Kandil. LEO - DB2's LEarning Optimizer. In VLDB, pages 19-28, 2001.
[34]
L. G. Valiant. A Bridging Model for Parallel Computation. CACM, 33(8):103-111, 1990.
[35]
S. van Dongen. A Cluster Algorithm for Graphs. Technical Report INS-R0010, CWI, 2000.
[36]
A. Verma, L. Cherkasova, and R. H. Campbell. ARIA: Automatic Resource Inference and Allocation for MapReduce Environments. In ICAC, pages 235-244, 2011.
[37]
G. Wang, W. Xie, A. J. Demers, and J. Gehrke. Asynchronous Large-Scale Graph Processing Made Easy. In CIDR, 2013.
[38]
J. Wolf, D. Rajan, K. Hildrum, R. Khandekar, V. Kumar, S. Parekh, K.-L. Wu, and A. Balmin. FLEX: A Slot Allocation Scheduling Optimizer for MapReduce Workloads. In Middleware, pages 1-20, 2010.
[39]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster Computing with Working Sets. In HotCloud, 2010.

Cited By

View all
  • (2024)Agile-Ant: Self-Managing Distributed Cache Management for Cost Optimization of Big Data ApplicationsProceedings of the VLDB Endowment10.14778/3681954.368199017:11(3151-3164)Online publication date: 1-Jul-2024
  • (2024)TimeLink: enabling dynamic runtime prediction for Flink iterative jobsThe Journal of Supercomputing10.1007/s11227-024-06085-x80:11(16546-16573)Online publication date: 1-Jul-2024
  • (2023)HPC Application Performance Prediction with Machine Learning on New ArchitecturesProceedings of the 2023 on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn Strategy10.1145/3588993.3597262(1-8)Online publication date: 28-Jul-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 6, Issue 14
September 2013
384 pages

Publisher

VLDB Endowment

Publication History

Published: 01 September 2013
Published in PVLDB Volume 6, Issue 14

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Agile-Ant: Self-Managing Distributed Cache Management for Cost Optimization of Big Data ApplicationsProceedings of the VLDB Endowment10.14778/3681954.368199017:11(3151-3164)Online publication date: 1-Jul-2024
  • (2024)TimeLink: enabling dynamic runtime prediction for Flink iterative jobsThe Journal of Supercomputing10.1007/s11227-024-06085-x80:11(16546-16573)Online publication date: 1-Jul-2024
  • (2023)HPC Application Performance Prediction with Machine Learning on New ArchitecturesProceedings of the 2023 on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn Strategy10.1145/3588993.3597262(1-8)Online publication date: 28-Jul-2023
  • (2022)Multi-Tenant Cloud Data Services: State-of-the-Art, Challenges and OpportunitiesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3522566(2465-2473)Online publication date: 10-Jun-2022
  • (2022)Juggler: Autonomous Cost Optimization and Performance Prediction of Big Data ApplicationsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517892(1840-1854)Online publication date: 10-Jun-2022
  • (2021)SkinnerDB: Regret-bounded Query Evaluation via Reinforcement LearningACM Transactions on Database Systems10.1145/346438946:3(1-45)Online publication date: 28-Sep-2021
  • (2021)Expand your Training Limits! Generating Training Data for ML-based Data ManagementProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457286(1865-1878)Online publication date: 9-Jun-2021
  • (2021)Online Runtime Prediction Method for Distributed Iterative JobsWeb Information Systems and Applications10.1007/978-3-030-87571-8_14(156-168)Online publication date: 24-Sep-2021
  • (2020)Caching Cost Model for In-memory Data Analytics FrameworkThe 9th International Conference on Smart Media and Applications10.1145/3426020.3426070(187-189)Online publication date: 17-Sep-2020
  • (2020)A gray-box modeling methodology for runtime prediction of Apache Spark jobsDistributed and Parallel Databases10.1007/s10619-020-07286-y38:4(819-839)Online publication date: 10-Mar-2020
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media