Abstract
Many solutions used towards building autonomous (or, self-driving) data processing systems today are trying to leverage the “black box” algorithm of Bayesian Optimization (BO) both due to its wider applicability and the theoretical guarantees provided on the quality of results produced. The black-box approach, however, could be time and labor-intensive; or otherwise get stuck in a local minima. We study an important problem of auto-tuning the memory allocation for applications running on modern distributed data processing systems. A simple “white-box” model is developed which can quickly separate good configurations from bad ones. To combine the benefits of the two approaches to tuning, we build a framework called Guided Bayesian Optimization (GBO) that uses the white-box model as a guide during the Bayesian Optimization exploration process. An evaluation carried out on Apache Spark using industry-standard benchmark applications shows that GBO consistently provides performance speedups across the application workload with the magnitude of savings being close to 2x.
Similar content being viewed by others
References
Agrawal, S., Chaudhuri, S., Narasayya, V.R.: Automated selection of materialized views and indexes in SQL databases. In: Proceedings of the 26th International Conference on Very Large Data Bases (VLDB ’00), pp. 496–505. Morgan Kaufmann Publishers, San Francisco (2000). ISBN 1-55860-715-3. http://dl.acm.org/citation.cfm?id=645926.671701
Aken, D.V., Pavlo, A., Gordon, G.J., Zhang, B.: Automatic database management system tuning through large-scale machine learning. In: Salihoglu, S., Zhou, W., Chirkova, R., Yang, J., Suciu, D. (eds.) Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, 14–19 May 2017, pp. 1009–1024. ACM, New York (2017). ISBN 978-1-4503-4197-4. https://doi.org/10.1145/3035918.3064029
Alipourfard, O., Liu, H.H., Chen, J., Venkataraman, S., Yu, M., Zhang, M.: Cherrypick: Adaptively unearthing the best cloud configurations for big data analytics. In: 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pp. 469–482, Boston, MA. USENIX Association, Berkeley (2017). ISBN 978-1-931971-37-9. https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/alipourfard
Arvanitis, A., Babu, S., Chu, E., Popescu, A., Simitsis, A., Wilkinson, K.: Automated performance management for the big data stack. In: CIDR 2019, 9th Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 13–16 January 2019, Online Proceedings (2019). www.cidrdb.org, http://cidrdb.org/cidr2019/papers/p150-arvanitis-cidr19.pdf
Bao, L., Liu, X., Chen, W.: Learning-based automatic parameter tuning for big data analytics frameworks. CoRR (2018). arXiv:1808.06008
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Byrd, R.H., Lu, P., Nocedal, J.: A limited-memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 16, 1190–1208 (1994)
Cao, Z., Tarasov, V., Tiwari, S., Zadok, E.: Towards better understanding of black-box auto-tuning: A comparative analysis for storage systems. In: Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference. USENIX ATC ’18, pp. 893–907. USENIX Association, Berkeley (2018). ISBN 978-1-931971-44-7. URL http://dl.acm.org/citation.cfm?id=3277355.3277441
Chaudhuri, S., Narasayya, V.: Self-tuning database systems: a decade of progress. In: Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB ’07), pp. 3–14. VLDB Endowment, 2007. ISBN 978-1-59593-649-3. http://dl.acm.org/citation.cfm?id=1325851.1325856
Chaudhuri, S., Narasayya, V.R.: An efficient cost-driven index selection tool for microsoft sql server. In: Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB ’97), pp. 146–155. Morgan Kaufmann Publishers, San Francisco (1997). ISBN 1-55860-470-7. http://dl.acm.org/citation.cfm?id=645923.673646
Dalibard, V., Schaarschmidt, M., Yoneki, E.: BOAT: Building auto-tuners with structured bayesian optimization. In: Proceedings of the 26th International Conference on World Wide Web (WWW ’17), Republic and Canton of Geneva, Switzerland, pp. 479–488. International World Wide Web Conferences Steering Committee (2017). ISBN 978-1-4503-4913-0. https://doi.org/10.1145/3038912.3052662
Dias, K., Ramacher, M., Shaft, U., Venkataramani, V., Wood, G.: Automatic performance diagnosis and tuning in oracle. In: CIDR 2005, Second Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 4–7 January 2005, Online Proceedings. pp. 84–94 (2005). www.cidrdb.org, http://cidrdb.org/cidr2005/papers/P07.pdf
Draper, N., Smith, H.: Applied Regression Analysis. Wiley Series in Probability and Statistics: Texts and References Section, vol. 1. Wiley, New York (1998). ISBN 9780471170822. URL https://books.google.co.in/books?id=8n8pAQAAMAAJ
Duan, S., Thummala, V., Babu, S.: Tuning database configuration parameters with ituned. PVLDB 2(1), 1246–1257 (2009). https://doi.org/10.14778/1687627.1687767
Gounaris, A., Torres, J.: A methodology for spark parameter tuning. Big Data Res. 11, 22–32 (2018)
Herodotou, H., Dong, F., Babu, S.: No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-Intensive Analytics (SOCC ’11), pp. 18:1–18:14. ACM, New York (2011). ISBN 978-1-4503-0976-9. https://doi.org/10.1145/2038916.2038934
Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: A self-tuning system for big data analytics. In: CIDR, Asilomar, pp. 261–272 (2011)
Hsu, C., Nair, V., Freeh, V.W., Menzies, T.: Arrow: low-level augmented Bayesian optimization for finding the best cloud VM. In: 38th IEEE International Conference on Distributed Computing Systems (ICDCS 2018), Vienna, Austria, 2–6 July 2018, pp. 660–670. IEEE Computer Society (2018). ISBN 978-1-5386-6871-9. https://doi.org/10.1109/ICDCS.2018.00070
Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), pp. 41–51, March 2010.https://doi.org/10.1109/ICDEW.2010.5452747
Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: International Conference on Learning and Intelligent Optimization, pp. 507–523. Springer, Heidelberg (2011)
Ireland, C.: Fundamental concepts in the design of experiments. Technometrics 7(4), 652–653 (1965). https://doi.org/10.1080/00401706.1965.10490308
Jamshidi, P., Casale, G.: An uncertainty-aware approach to optimal configuration of stream processing systems. In: 24th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS 2016), London, UK, 19–21 September 2016, pp. 39–48. IEEE Computer Society (2016). ISBN 978-1-5090-3432-1. https://doi.org/10.1109/MASCOTS.2016.17
Kunjir, M., Babu, S.: Thoth in action: memory management in modern data analytics. Proc. VLDB Endow. 10(12), 1917–1920 (2017). https://doi.org/10.14778/3137765.3137808
Kwan, E., Lightstone, S., Schiefer, K.B., Storm, A.J., Wu, L.: Automatic database configuration for DB2 universal database: Compressing years of performance expertise into seconds of execution. In: Weikum, G., Schöning, H., Rahm, E. (eds.) BTW 2003, Datenbanksysteme für Business, Technologie und Web, Tagungsband der 10. BTW-Konferenz, 26–28 Februar 2003, Leipzig, LNI, vol. 26, pp. 620–629. GI (2003). ISBN 3-88579-355-5. http://subs.emis.de/LNI/Proceedings/Proceedings26/article665.html
Leskovec, J., Krevl, A.: SNAP Datasets: Stanford large network dataset collection (2014). http://snap.stanford.edu/data
Li, G., Zhou, X., Li, S., Gao, B.: Qtune: A query-aware database tuning system with deep reinforcement learning. Proc. VLDB Endow. 12(12), 2118–2130 (2019). https://doi.org/10.14778/3352063.3352129
Li, M., Zeng, L., Meng, S., Tan, J., Zhang, L., Butt, A.R., Fuller, N.: Mronline: Mapreduce online performance tuning. In: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, pp. 165–176. ACM, New York (2014)
Mao, H., Alizadeh, M., Menache, I., Kandula, S.: Resource management with deep reinforcement learning. In: Proceedings of the 15th ACM Workshop on Hot Topics in Networks (HotNets ’16), pp. 50–56. ACM, New York (2016). ISBN 978-1-4503-4661-0. https://doi.org/10.1145/3005745.3005750
Marcus, R., Negi, P., Mao, H., Zhang, C., Alizadeh, M., Kraska, T., Papaemmanouil, O., Tatbul, N.: Neo: A learned query optimizer. Proc. VLDB Endow. 12(11), 1705–1718 (2019). https://doi.org/10.14778/3342263.3342644
Mockus, J.: Bayesian Approach to Global Optimization: Theory and Applications. Mathematics and Its Applications . Soviet Series. Kluwer, Dordrecht (1989). ISBN 9780792301158
Online: Java garbage collection basics (2019). https://bit.ly/2N8JyOp. Accessed 10 July 2019
Online: Java management extensions (jmx) (2019). https://bit.ly/2KIvbNn. Accessed 10 July 2019
Online: Intel’s performance analysis tool (2019). https://github.com/intel-hadoop/PAT. Accessed 10 July 2019
Online: RelM Technical Report (2019). https://www.dropbox.com/s/2wwmdmw7a77qz03/main.pdf?dl=0. Accessed 10 July 2019
Online: Amazon EMR documentation (2019). https://amzn.to/2zrpNtt. Accessed 10 July 2019
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Rao, J., Zhang, C., Megiddo, N., Lohman, G.: Automating physical database design in a parallel database. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (SIGMOD ’02), pp. 558–569. ACM, New York (2002). ISBN 1-58113-497-5.https://doi.org/10.1145/564691.564757
Rasmussen, C.E.: Gaussian Processes for Machine Learning. MIT, Cambridge (2006)
Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., de Freitas, N.: Taking the human out of the loop: a review of bayesian optimization. Proc. IEEE 104, 148–175 (2016)
Shi, J., Zou, J., Lu, J., Cao, Z., Li, S., Wang, C.: Mrtuner: a toolkit to enable holistic optimization for mapreduce jobs. Proc. VLDB Endow. 7(13), 1319–1330 (2014). https://doi.org/10.14778/2733004.2733005
Storm, A.J., Garcia-Arellano, C., Lightstone, S.S., Diao, Y., Surendra, M.: Adaptive self-tuning memory in DB2. In: Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB ’06), pp. 1081–1092. VLDB Endowment (2006). http://dl.acm.org/citation.cfm?id=1182635.1164220
Tan, Z., Babu, S.: Tempo: Robust and self-tuning resource management in multi-tenant parallel databases. Proc. VLDB Endow. 9(10), 720–731 (2016). https://doi.org/10.14778/2977797.2977799
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)
Venkataraman, S., Yang, Z., Franklin, M., Recht, B., Stoica, I.: Ernest: Efficient performance prediction for large-scale advanced analytics. In: 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), pp. 363–378, Santa Clara, CA. USENIX Association, Berkeley (2016). ISBN 978-1-931971-29-4. https://www.usenix.org/conference/nsdi16/technical-sessions/presentation/venkataraman
Wang, G., Xu, J., He, B.: A novel method for tuning configuration parameters of spark based on machine learning. In: 2016 IEEE 18th International Conference on High Performance Computing and Communications, pp. 586–593. IEEE, Piscataway (2016). https://doi.org/10.1109/HPCC-SmartCity-DSS.2016.0088
Wang, M., Au, K., Ailamaki, A., Brockwell, A., Faloutsos, C., Ganger, G.R.: Storage device performance prediction with cart models. In: Proceedings of the IEEE Computer Society’s 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004 (MASCOTS 2004), pp. 588–595. IEEE, Washington DC (2004)
Weikum, G., Moenkeberg, A., Hasse, C., Zabback, P.: Self-tuning database technology and information services: from wishful thinking to viable engineering. In: Proceedings of the 28th International Conference on Very Large Data Bases (VLDB ’02), pp. 20–31. VLDB Endowment (2002). http://dl.acm.org/citation.cfm?id=1287369.1287373
Wikipedia Contributors: Pearson correlation coefficient—Wikipedia, the free encyclopedia (2019). https://en.wikipedia.org/w/index.php?title=Pearson_correlation_coefficient&oldid=905965350. Accessed 10 July 2019
Xi, B., Liu, Z., Raghavachari, M., Xia, C.H., Zhang, L.: A smart hill-climbing algorithm for application server configuration. In: Proceedings of the 13th International Conference on World Wide Web, WWW ’04, pp. 287–296, New York, NY, USA, 2004. ACM. ISBN 1-58113-844-X. https://doi.org/10.1145/988672.988711
Yadwadkar, N.J., Hariharan, B., Gonzalez, J.E., Smith, B., Katz, R.H.: Selecting the best vm across multiple public clouds: A data-driven performance modeling approach. In: Proceedings of the 2017 Symposium on Cloud Computing, SoCC ’17, pp. 452–465, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-5028-0. https://doi.org/10.1145/3127479.3131614
Ye, T., Kalyanaraman, S.: A recursive random search algorithm for large-scale network parameter configuration. SIGMETRICS Perform. Eval. Rev. 31(1), 196–205 (2003). https://doi.org/10.1145/885651.781052
Zhang, J., Liu, Y., Zhou, K., Li, G., Xiao, Z., Cheng, B., Xing, J., Wang, Y., Cheng, T., Liu, L., Ran, M., Li, Z.: An end-to-end automatic cloud database tuning system using deep reinforcement learning. In: Proceedings of the 2019 International Conference on Management of Data (SIGMOD ’19), pp. 415–432. ACM, New York (2019). ISBN 978-1-4503-5643-5. https://doi.org/10.1145/3299869.3300085
Zhu, Y., Liu, J., Guo, M., Bao, Y., Ma, W., Liu, Z., Song, K., Yang, Y.: Bestconfig: tapping the performance potential of systems via automatic configuration tuning. In: Proceedings of the 2017 Symposium on Cloud Computing, pp. 338–350. ACM, New York (2017)
Acknowledgements
The article was supported by National Science Foundation (Grant No. CNS-1423128). The author would like to thank Dr. Shivnath Babu for his expert feedback on the system design.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kunjir, M. Speeding up AutoTuning of the Memory Management Options in Data Analytics. Distrib Parallel Databases 38, 841–863 (2020). https://doi.org/10.1007/s10619-019-07281-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-019-07281-y