Abstract
Data analysts predict that the GPU as a service (GPUaaS) market will grow to support 3D models, animated video processing, gaming, and deep learning model training. The main cloud providers already offer in their catalogs VMs with different type and number of GPUs. Because of the significant difference in terms of performance and cost of this type of VMs, correctly selecting the most appropriate one to execute the required job is mandatory to minimize the training cost. Motivated by these considerations, this paper proposes performance models to predict GPU-deployed neural networks (NNs) training. The proposed approach is based on machine learning and exploits two main sets of features, thus capturing both NNs properties and hardware characteristics. Such data enable the learning of multiple linear regression models that, coupled with an established feature selection technique, become accurate prediction tools, with errors below 12% on average. An extensive experimental campaign, performed both on public and in-house private cloud deployments, considers popular deep NNs used for image classification and speech transcription. The results show that prediction errors remain small even when extrapolating outside the range spanned by the input data, with important implications for the models’ applicability.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The datasets generated and analysed during the current study and the source code supporting the experiments are available in the Zenodo repository at https://doi.org/10.5281/zenodo.5327342.
Notes
Experiment data and our scripts source code for models training and testing are available at https://doi.org/10.5281/zenodo.5327342.
An implementation of ResNet with a variable number of inner modules is available at https://github.com/KellerJordan/ResNet-PyTorch-CIFAR10.
References
Jun, T.J., Kang, D., Kim, D., Kim, D.: GPU enabled serverless computing framework. In: 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2018), Cambridge, 21–23 March 2018, pp. 533–540 (2018)
Global Market Insights (2019) GPU as a service market size by product. www.gminsights.com/industry-analysis/gpu-as-a-service-market
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of 25th International Conference on Neural Information Processing Systems (NIPS 12), vol. 1, pp 1097–1105 (2012)
Khomenko, V., Shyshkov, O., Radyvonenko, O., Bokhan, K.: Accelerating recurrent neural network training using sequence bucketing and multi-gpu data parallelization. CoRR abs/1708.05604, (2017) arXiv:1708.05604
PyTorch: Tensors and dynamic neural networks in Python with strong GPU acceleration (2018). https://pytorch.org
TensorFlow: An open source machine learning framework for everyone (2018). www.tensorflow.org
Amazon (2018) Amazon EC2 elastic GPUs. https://aws.amazon.com/ec2/elastic-gpus/
Google: GPUs on Compute Engine (2018). https://cloud.google.com/compute/docs/gpus/
Microsoft: GPU optimized virtual machine sizes (2018). https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu
Hadjis, S., Zhang, C., Mitliagkas, I., Ré, C.: Omnivore: an optimizer for multi-device deep learning on CPUs and GPUs. Arxiv Preprint (2016). arXiv:1606.04487
Shawi, R.E., Wahab, A., Barnawi, A., Sakr, S.: Dlbench: a comprehensive experimental evaluation of deep learning frameworks. Clust. Comput. 24(3), 2017–2038 (2021)
Draper, N.R., Smith, H.: Applied Regression Analysis. Wiley, New York (1966)
Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Lin, M., Chen, Q., Yan, S.: Network in network. In: 2nd International Conference on Learning Representations (ICLR 2014). arXiv preprint (2013). arXiv:1312.4400
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010)
Jia, W., Shaw, K.A., Martonosi, M.: Stargazer: automated regression-based GPU design space exploration. In: International Symposium on Performance Analysis of Systems & Software (ISPASS 12). IEEE (2012) https://doi.org/10.1109/ISPASS.2012.6189201
Shafiabadi, M.H., Pedram, H., Reshadi, M., Reza, A.: Comprehensive regression-based model to predict performance of general-purpose graphics processing unit. Clust. Comput. 23(2), 1505–1516 (2020)
Zhang, Y., Owens, J.D.: A quantitative performance analysis model for GPU architectures. In: 17th International Symposium on High Performance Computer Architecture (HPCA 11). IEEE (2011). https://doi.org/10.1109/HPCA.2011.5749745
Song, S., Su, C., Rountree, B., Cameron, K.W.: A simplified and accurate model of power-performance efficiency on emergent GPU architectures. In: 27th International Symposium on Parallel and Distributed Processing (IPDPS 13). IEEE (2013) https://doi.org/10.1109/IPDPS.2013.73
Dao, T.T., Kim, J., Seo, S., Egger, B., Lee, J.: A performance model for GPUs with caches. IEEE Trans. Parallel Distrib. Syst. 26(7), 1800–1813 (2015)
Lu, Z., Rallapalli, S., Chan, K., La Porta, T.: Modeling the resource requirements of convolutional neural networks on mobile devices. In: Proceedings of Conference on Multimedia (MM 17). ACM, New York (2017). https://doi.org/10.1145/3123266.3123389
Gupta, U., Babu, M., Ayoub, R., Kishinevsky, M., Paterna, F., Gumussoy, S., Ogras, U.Y.: An misc learning methodology for performance modeling of graphics processors. IEEE Trans. Comput. (2018). https://doi.org/10.1109/TC.2018.2840710
Peng, Y., Bao, Y., Chen, Y., Wu, C., Guo, C.: Optimus: an efficient dynamic resource scheduler for deep learning clusters. In: Proceedings of the Thirteenth EuroSys Conference (EuroSys 2018), Porto, Portugal, 23–26 April 2018. ACM, New York, pp. 3:1–3:14 (2018)
Dube, P., Suk, T., Wang, C.: AI gauge: Runtime estimation for deep learning in the cloud. In: 31st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2019), Campo Grande, Brazil, 15–18 October 2019, pp. 160–167. IEEE (2019) https://doi.org/10.1109/SBAC-PAD.2019.00035
Madougou, S., Varbanescu, A., de Laat, C., van Nieuwpoort, R.: The landscape of GPGPU performance modeling tools. J. Parallel Comput. 56, 18–33 (2016)
Kerr, A., Diamos, G., Yalamanchili, S.: Modeling GPU-CPU workloads and systems. In: Proceedings of 3rd Workshop General-Purpose Computation on Graphics Processing Units (GPGPU-3). ACM, New York (2010). https://doi.org/10.1145/1735688.1735696
Diamos, G.F., Kerr, A., Yalamanchili, S., Clark, N.: Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In: 19th International Conference on Parallel Architecture and Compilation Techniques (PACT 10). IEEE, pp. 353–364 (2010). https://doi.org/10.1145/1854273.1854318
Gianniti, E., Zhang, L., Ardagna, D.: Performance prediction of GPU-based deep learning applications. In: 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 18) (2018)
Gianniti, E., Zhang, L., Ardagna, D.: Performance prediction of gpu-based deep learning applications. In: Proceedings of the 9th International Conference on Cloud Computing and Services Science, CLOSER 2019, Heraklion, Crete, Greece, May 2-4, 2019, SciTePress, pp 279–286 (2019)
Mendoza, D., Romero, F., Li, Q., Yadwadkar, N.J., Kozyrakis, C.: Interference-aware scheduling for inference serving. In: Proceedings of the 1st Workshop on Machine Learning and Systemsg Virtual Event (EuroMLSys@EuroSys 2021), Edinburgh, Scotland, UK, 26 April 2021. ACM, New York, pp. 80–88 (2021)
Yeung, G., Borowiec, D., Yang, R., Friday, A., Harper, R., Garraghan, P.: Horus: Interference-aware and prediction-based scheduling in deep learning systems. IEEE Trans. Parallel Distrib. Syst. 33(1), 88–100 (2022)
Lee, B.C., Brooks, D.M., de Supinski, B.R., Schulz, M., Singh, K., McKee, S.A.: Methods of inference and learning for performance modeling of parallel applications. In: Proceedings of 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 07) (2007). https://doi.org/10.1145/1229428.1229479
Didona, D., Romano, P.: On bootstrapping machine learning performance predictors via analytical models. CoRR (2014). arXiv:1410.5102v1
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint (2014). arXiv:1409.1556
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Conference on Computer Vision and Pattern Recognition (CVPR 16). IEEE (2016). https://doi.org/10.1109/CVPR.2016.90
Foundation, M.: Project deepspeech (2019b). https://github.com/mozilla/DeepSpeech
Csurka, G.: Domain adaptation for visual applications: a comprehensive survey. Arxiv Preprint (2017). arXiv:1702.05374
Hannun, A.Y., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., Ng, A.Y.: Deep speech: Scaling up end-to-end speech recognition. CoRR (2014). arXiv:1412.5567
Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Chen, J., Chrzanowski, M., Coates, A., Diamos, G., Elsen, E., Engel, J., Fan, L., Fougner, C., Han, T., Hannun, A.Y., Jun, B., LeGresley, P., Lin, L., Narang, S., Ng, A.Y., Ozair, S., Prenger, R., Raiman, J., Satheesh, S., Seetapun, D., Sengupta, S., Wang, Y., Wang, Z., Wang, C., Xiao, B., Yogatama, D., Zhan, J., Zhu, Z.: Deep speech 2: End-to-end speech recognition in english and mandarin. CoRR (2015). arXiv:1512.02595
Morais, R.: A journey to <10% word error rate (2019). https://voice.mozilla.org/
Maas, A.L., Hannun, A.Y., Jurafsky, D., Ng, A.Y.: First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs. CoRR (2014). arXiv:1408.2873
Gianniti, E., Ciavotta, M., Ardagna, D.: Optimizing quality-aware big data applications in the cloud. IEEE Trans. Cloud Comput. (2018). https://doi.org/10.1109/TCC.2018.2874944
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in PyTorch. In: 31st Conference on Neural Information Processing Systems (NIPS 17) (2017)
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I.J., Harp, A., Irving, G., Isard, M., Jia, Y., Józefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D.G., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P.A., Vanhoucke, V., Vasudevan, V., Viégas, F.B., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. Arxiv Preprint (2016). arXiv:1603.04467
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. Arxiv Preprint (2015). arXiv:1512.03385
Deng, J., Dong, W., Socher, R., L.-J. Li, K. Li, L. Fei-Fei: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp. 248–255 (2009) https://doi.org/10.1109/CVPR.2009.5206848
Foundation, M.: Common voice (2019a). https://voice.mozilla.org/
Lazowska, E.D., Zahorjan, J., Graham, G.S., Sevcik, K.C.: Quantitative System Performance. Prentice-Hall, Hoboken (1984)
Acknowledgements
The results of this work have been partially funded by ATMOSPHERE (grant agreement no. 777154), a Research and Innovation Action funded by the European Commission under the Cooperation Programme, Horizon 2020 and the Ministério de Ciência, Tecnologia e Inovação, RNP/Brazil. Danilo Ardagna’s work is also supported by the AI-SPRINT (grant agreement no. 101016577) H2020 project. GPU cloud experiments have been supported by Microsoft under the Top Compsci University Azure Adoption program.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: Number of samples in training data used in performance prediction models
This appendix presents the number of samples that have been used to train each of the models presented in Sect. 5. Table 10 reports the data about the hold-out scenario, while the remaining tables details the size of the training dataset for extrapolation experiments. For example, Table 11 shows how the number of samples of the training set of the extrapolation experiment on the batch AlexNet implemented on PyTorch considering 1 and 2 P600 is 72 and 204 respectively (Tables 12, 13, 14).
Appendix 2: Root mean square error (RMSE) of performance prediction models
This appendix present the RMSE on the estimation of training time per iteration of the models presented in Sect. 5 (Tables 15, 16, 17, 18, 19).
Appendix 3: Mean absolute error (MAE) of performance prediction models
This appendix present the MAE on the estimation of training time per iteration of the models presented in Sect. 5 (Tables 20, 21, 22, and 23, 24).
Rights and permissions
About this article
Cite this article
Lattuada, M., Gianniti, E., Ardagna, D. et al. Performance prediction of deep learning applications training in GPU as a service systems. Cluster Comput 25, 1279–1302 (2022). https://doi.org/10.1007/s10586-021-03428-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-021-03428-8