Speeding up AutoTuning of the Memory Management Options in Data Analytics

Kunjir, Mayuresh

doi:10.1007/s10619-019-07281-y

Speeding up AutoTuning of the Memory Management Options in Data Analytics

Published: 03 January 2020

Volume 38, pages 841–863, (2020)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Mayuresh Kunjir¹

358 Accesses
2 Citations
Explore all metrics

Abstract

Many solutions used towards building autonomous (or, self-driving) data processing systems today are trying to leverage the “black box” algorithm of Bayesian Optimization (BO) both due to its wider applicability and the theoretical guarantees provided on the quality of results produced. The black-box approach, however, could be time and labor-intensive; or otherwise get stuck in a local minima. We study an important problem of auto-tuning the memory allocation for applications running on modern distributed data processing systems. A simple “white-box” model is developed which can quickly separate good configurations from bad ones. To combine the benefits of the two approaches to tuning, we build a framework called Guided Bayesian Optimization (GBO) that uses the white-box model as a guide during the Bayesian Optimization exploration process. An evaluation carried out on Apache Spark using industry-standard benchmark applications shows that GBO consistently provides performance speedups across the application workload with the magnitude of savings being close to 2x.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Blink: Lightweight Sample Runs for Cost Optimization of Big Data Applications

Optimising Darwinian Data Structures on Google Guava

Auto-tuning for HPC storage stack: an optimization perspective

Article Open access 13 December 2024

References

Agrawal, S., Chaudhuri, S., Narasayya, V.R.: Automated selection of materialized views and indexes in SQL databases. In: Proceedings of the 26th International Conference on Very Large Data Bases (VLDB ’00), pp. 496–505. Morgan Kaufmann Publishers, San Francisco (2000). ISBN 1-55860-715-3. http://dl.acm.org/citation.cfm?id=645926.671701
Aken, D.V., Pavlo, A., Gordon, G.J., Zhang, B.: Automatic database management system tuning through large-scale machine learning. In: Salihoglu, S., Zhou, W., Chirkova, R., Yang, J., Suciu, D. (eds.) Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, 14–19 May 2017, pp. 1009–1024. ACM, New York (2017). ISBN 978-1-4503-4197-4. https://doi.org/10.1145/3035918.3064029
Alipourfard, O., Liu, H.H., Chen, J., Venkataraman, S., Yu, M., Zhang, M.: Cherrypick: Adaptively unearthing the best cloud configurations for big data analytics. In: 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pp. 469–482, Boston, MA. USENIX Association, Berkeley (2017). ISBN 978-1-931971-37-9. https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/alipourfard
Arvanitis, A., Babu, S., Chu, E., Popescu, A., Simitsis, A., Wilkinson, K.: Automated performance management for the big data stack. In: CIDR 2019, 9th Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 13–16 January 2019, Online Proceedings (2019). www.cidrdb.org, http://cidrdb.org/cidr2019/papers/p150-arvanitis-cidr19.pdf
Bao, L., Liu, X., Chen, W.: Learning-based automatic parameter tuning for big data analytics frameworks. CoRR (2018). arXiv:1808.06008
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Article MATH Google Scholar
Byrd, R.H., Lu, P., Nocedal, J.: A limited-memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 16, 1190–1208 (1994)
Article MathSciNet Google Scholar
Cao, Z., Tarasov, V., Tiwari, S., Zadok, E.: Towards better understanding of black-box auto-tuning: A comparative analysis for storage systems. In: Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference. USENIX ATC ’18, pp. 893–907. USENIX Association, Berkeley (2018). ISBN 978-1-931971-44-7. URL http://dl.acm.org/citation.cfm?id=3277355.3277441
Chaudhuri, S., Narasayya, V.: Self-tuning database systems: a decade of progress. In: Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB ’07), pp. 3–14. VLDB Endowment, 2007. ISBN 978-1-59593-649-3. http://dl.acm.org/citation.cfm?id=1325851.1325856
Chaudhuri, S., Narasayya, V.R.: An efficient cost-driven index selection tool for microsoft sql server. In: Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB ’97), pp. 146–155. Morgan Kaufmann Publishers, San Francisco (1997). ISBN 1-55860-470-7. http://dl.acm.org/citation.cfm?id=645923.673646
Dalibard, V., Schaarschmidt, M., Yoneki, E.: BOAT: Building auto-tuners with structured bayesian optimization. In: Proceedings of the 26th International Conference on World Wide Web (WWW ’17), Republic and Canton of Geneva, Switzerland, pp. 479–488. International World Wide Web Conferences Steering Committee (2017). ISBN 978-1-4503-4913-0. https://doi.org/10.1145/3038912.3052662
Dias, K., Ramacher, M., Shaft, U., Venkataramani, V., Wood, G.: Automatic performance diagnosis and tuning in oracle. In: CIDR 2005, Second Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 4–7 January 2005, Online Proceedings. pp. 84–94 (2005). www.cidrdb.org, http://cidrdb.org/cidr2005/papers/P07.pdf
Draper, N., Smith, H.: Applied Regression Analysis. Wiley Series in Probability and Statistics: Texts and References Section, vol. 1. Wiley, New York (1998). ISBN 9780471170822. URL https://books.google.co.in/books?id=8n8pAQAAMAAJ
Duan, S., Thummala, V., Babu, S.: Tuning database configuration parameters with ituned. PVLDB 2(1), 1246–1257 (2009). https://doi.org/10.14778/1687627.1687767
Article Google Scholar
Gounaris, A., Torres, J.: A methodology for spark parameter tuning. Big Data Res. 11, 22–32 (2018)
Article Google Scholar
Herodotou, H., Dong, F., Babu, S.: No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-Intensive Analytics (SOCC ’11), pp. 18:1–18:14. ACM, New York (2011). ISBN 978-1-4503-0976-9. https://doi.org/10.1145/2038916.2038934
Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: A self-tuning system for big data analytics. In: CIDR, Asilomar, pp. 261–272 (2011)
Hsu, C., Nair, V., Freeh, V.W., Menzies, T.: Arrow: low-level augmented Bayesian optimization for finding the best cloud VM. In: 38th IEEE International Conference on Distributed Computing Systems (ICDCS 2018), Vienna, Austria, 2–6 July 2018, pp. 660–670. IEEE Computer Society (2018). ISBN 978-1-5386-6871-9. https://doi.org/10.1109/ICDCS.2018.00070
Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), pp. 41–51, March 2010.https://doi.org/10.1109/ICDEW.2010.5452747
Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: International Conference on Learning and Intelligent Optimization, pp. 507–523. Springer, Heidelberg (2011)
Ireland, C.: Fundamental concepts in the design of experiments. Technometrics 7(4), 652–653 (1965). https://doi.org/10.1080/00401706.1965.10490308
Article Google Scholar
Jamshidi, P., Casale, G.: An uncertainty-aware approach to optimal configuration of stream processing systems. In: 24th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS 2016), London, UK, 19–21 September 2016, pp. 39–48. IEEE Computer Society (2016). ISBN 978-1-5090-3432-1. https://doi.org/10.1109/MASCOTS.2016.17
Kunjir, M., Babu, S.: Thoth in action: memory management in modern data analytics. Proc. VLDB Endow. 10(12), 1917–1920 (2017). https://doi.org/10.14778/3137765.3137808
Article Google Scholar
Kwan, E., Lightstone, S., Schiefer, K.B., Storm, A.J., Wu, L.: Automatic database configuration for DB2 universal database: Compressing years of performance expertise into seconds of execution. In: Weikum, G., Schöning, H., Rahm, E. (eds.) BTW 2003, Datenbanksysteme für Business, Technologie und Web, Tagungsband der 10. BTW-Konferenz, 26–28 Februar 2003, Leipzig, LNI, vol. 26, pp. 620–629. GI (2003). ISBN 3-88579-355-5. http://subs.emis.de/LNI/Proceedings/Proceedings26/article665.html
Leskovec, J., Krevl, A.: SNAP Datasets: Stanford large network dataset collection (2014). http://snap.stanford.edu/data
Li, G., Zhou, X., Li, S., Gao, B.: Qtune: A query-aware database tuning system with deep reinforcement learning. Proc. VLDB Endow. 12(12), 2118–2130 (2019). https://doi.org/10.14778/3352063.3352129
Article Google Scholar
Li, M., Zeng, L., Meng, S., Tan, J., Zhang, L., Butt, A.R., Fuller, N.: Mronline: Mapreduce online performance tuning. In: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, pp. 165–176. ACM, New York (2014)
Mao, H., Alizadeh, M., Menache, I., Kandula, S.: Resource management with deep reinforcement learning. In: Proceedings of the 15th ACM Workshop on Hot Topics in Networks (HotNets ’16), pp. 50–56. ACM, New York (2016). ISBN 978-1-4503-4661-0. https://doi.org/10.1145/3005745.3005750
Marcus, R., Negi, P., Mao, H., Zhang, C., Alizadeh, M., Kraska, T., Papaemmanouil, O., Tatbul, N.: Neo: A learned query optimizer. Proc. VLDB Endow. 12(11), 1705–1718 (2019). https://doi.org/10.14778/3342263.3342644
Article Google Scholar
Mockus, J.: Bayesian Approach to Global Optimization: Theory and Applications. Mathematics and Its Applications . Soviet Series. Kluwer, Dordrecht (1989). ISBN 9780792301158
Online: Java garbage collection basics (2019). https://bit.ly/2N8JyOp. Accessed 10 July 2019
Online: Java management extensions (jmx) (2019). https://bit.ly/2KIvbNn. Accessed 10 July 2019
Online: Intel’s performance analysis tool (2019). https://github.com/intel-hadoop/PAT. Accessed 10 July 2019
Online: RelM Technical Report (2019). https://www.dropbox.com/s/2wwmdmw7a77qz03/main.pdf?dl=0. Accessed 10 July 2019
Online: Amazon EMR documentation (2019). https://amzn.to/2zrpNtt. Accessed 10 July 2019
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Rao, J., Zhang, C., Megiddo, N., Lohman, G.: Automating physical database design in a parallel database. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (SIGMOD ’02), pp. 558–569. ACM, New York (2002). ISBN 1-58113-497-5.https://doi.org/10.1145/564691.564757
Rasmussen, C.E.: Gaussian Processes for Machine Learning. MIT, Cambridge (2006)
MATH Google Scholar
Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., de Freitas, N.: Taking the human out of the loop: a review of bayesian optimization. Proc. IEEE 104, 148–175 (2016)
Article Google Scholar
Shi, J., Zou, J., Lu, J., Cao, Z., Li, S., Wang, C.: Mrtuner: a toolkit to enable holistic optimization for mapreduce jobs. Proc. VLDB Endow. 7(13), 1319–1330 (2014). https://doi.org/10.14778/2733004.2733005
Article Google Scholar
Storm, A.J., Garcia-Arellano, C., Lightstone, S.S., Diao, Y., Surendra, M.: Adaptive self-tuning memory in DB2. In: Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB ’06), pp. 1081–1092. VLDB Endowment (2006). http://dl.acm.org/citation.cfm?id=1182635.1164220
Tan, Z., Babu, S.: Tempo: Robust and self-tuning resource management in multi-tenant parallel databases. Proc. VLDB Endow. 9(10), 720–731 (2016). https://doi.org/10.14778/2977797.2977799
Article Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)
MathSciNet MATH Google Scholar
Venkataraman, S., Yang, Z., Franklin, M., Recht, B., Stoica, I.: Ernest: Efficient performance prediction for large-scale advanced analytics. In: 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), pp. 363–378, Santa Clara, CA. USENIX Association, Berkeley (2016). ISBN 978-1-931971-29-4. https://www.usenix.org/conference/nsdi16/technical-sessions/presentation/venkataraman
Wang, G., Xu, J., He, B.: A novel method for tuning configuration parameters of spark based on machine learning. In: 2016 IEEE 18th International Conference on High Performance Computing and Communications, pp. 586–593. IEEE, Piscataway (2016). https://doi.org/10.1109/HPCC-SmartCity-DSS.2016.0088
Wang, M., Au, K., Ailamaki, A., Brockwell, A., Faloutsos, C., Ganger, G.R.: Storage device performance prediction with cart models. In: Proceedings of the IEEE Computer Society’s 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004 (MASCOTS 2004), pp. 588–595. IEEE, Washington DC (2004)
Weikum, G., Moenkeberg, A., Hasse, C., Zabback, P.: Self-tuning database technology and information services: from wishful thinking to viable engineering. In: Proceedings of the 28th International Conference on Very Large Data Bases (VLDB ’02), pp. 20–31. VLDB Endowment (2002). http://dl.acm.org/citation.cfm?id=1287369.1287373
Wikipedia Contributors: Pearson correlation coefficient—Wikipedia, the free encyclopedia (2019). https://en.wikipedia.org/w/index.php?title=Pearson_correlation_coefficient&oldid=905965350. Accessed 10 July 2019
Xi, B., Liu, Z., Raghavachari, M., Xia, C.H., Zhang, L.: A smart hill-climbing algorithm for application server configuration. In: Proceedings of the 13th International Conference on World Wide Web, WWW ’04, pp. 287–296, New York, NY, USA, 2004. ACM. ISBN 1-58113-844-X. https://doi.org/10.1145/988672.988711
Yadwadkar, N.J., Hariharan, B., Gonzalez, J.E., Smith, B., Katz, R.H.: Selecting the best vm across multiple public clouds: A data-driven performance modeling approach. In: Proceedings of the 2017 Symposium on Cloud Computing, SoCC ’17, pp. 452–465, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-5028-0. https://doi.org/10.1145/3127479.3131614
Ye, T., Kalyanaraman, S.: A recursive random search algorithm for large-scale network parameter configuration. SIGMETRICS Perform. Eval. Rev. 31(1), 196–205 (2003). https://doi.org/10.1145/885651.781052
Article Google Scholar
Zhang, J., Liu, Y., Zhou, K., Li, G., Xiao, Z., Cheng, B., Xing, J., Wang, Y., Cheng, T., Liu, L., Ran, M., Li, Z.: An end-to-end automatic cloud database tuning system using deep reinforcement learning. In: Proceedings of the 2019 International Conference on Management of Data (SIGMOD ’19), pp. 415–432. ACM, New York (2019). ISBN 978-1-4503-5643-5. https://doi.org/10.1145/3299869.3300085
Zhu, Y., Liu, J., Guo, M., Bao, Y., Ma, W., Liu, Z., Song, K., Yang, Y.: Bestconfig: tapping the performance potential of systems via automatic configuration tuning. In: Proceedings of the 2017 Symposium on Cloud Computing, pp. 338–350. ACM, New York (2017)

Download references

Acknowledgements

The article was supported by National Science Foundation (Grant No. CNS-1423128). The author would like to thank Dr. Shivnath Babu for his expert feedback on the system design.

Author information

Authors and Affiliations

Duke University, Durham, NC, USA
Mayuresh Kunjir

Authors

Mayuresh Kunjir
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mayuresh Kunjir.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kunjir, M. Speeding up AutoTuning of the Memory Management Options in Data Analytics. Distrib Parallel Databases 38, 841–863 (2020). https://doi.org/10.1007/s10619-019-07281-y

Download citation

Published: 03 January 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s10619-019-07281-y

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speeding up AutoTuning of the Memory Management Options in Data Analytics

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Blink: Lightweight Sample Runs for Cost Optimization of Big Data Applications

Optimising Darwinian Data Structures on Google Guava

Auto-tuning for HPC storage stack: an optimization perspective

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Navigation

Speeding up AutoTuning of the Memory Management Options in Data Analytics

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Blink: Lightweight Sample Runs for Cost Optimization of Big Data Applications

Optimising Darwinian Data Structures on Google Guava

Auto-tuning for HPC storage stack: an optimization perspective

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Search

Navigation