Performance prediction of deep learning applications training in GPU as a service systems

Lattuada, Marco; Gianniti, Eugenio; Ardagna, Danilo; Zhang, Li

doi:10.1007/s10586-021-03428-8

Performance prediction of deep learning applications training in GPU as a service systems

Published: 14 January 2022

Volume 25, pages 1279–1302, (2022)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Marco Lattuada¹,
Eugenio Gianniti¹,
Danilo Ardagna ORCID: orcid.org/0000-0003-4224-927X¹ &
…
Li Zhang²

847 Accesses
12 Citations
Explore all metrics

Abstract

Data analysts predict that the GPU as a service (GPUaaS) market will grow to support 3D models, animated video processing, gaming, and deep learning model training. The main cloud providers already offer in their catalogs VMs with different type and number of GPUs. Because of the significant difference in terms of performance and cost of this type of VMs, correctly selecting the most appropriate one to execute the required job is mandatory to minimize the training cost. Motivated by these considerations, this paper proposes performance models to predict GPU-deployed neural networks (NNs) training. The proposed approach is based on machine learning and exploits two main sets of features, thus capturing both NNs properties and hardware characteristics. Such data enable the learning of multiple linear regression models that, coupled with an established feature selection technique, become accurate prediction tools, with errors below 12% on average. An extensive experimental campaign, performed both on public and in-house private cloud deployments, considers popular deep NNs used for image classification and speech transcription. The results show that prediction errors remain small even when extrapolating outside the range spanned by the input data, with important implications for the models’ applicability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

How Deep Learning Model Architecture and Software Stack Impacts Training Performance in the Cloud

Machine Learning Using Virtualized GPUs in Cloud Environments

How GPUs Kill Threads in Neural Network Training

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

The datasets generated and analysed during the current study and the source code supporting the experiments are available in the Zenodo repository at https://doi.org/10.5281/zenodo.5327342.

Notes

Experiment data and our scripts source code for models training and testing are available at https://doi.org/10.5281/zenodo.5327342.
An implementation of ResNet with a variable number of inner modules is available at https://github.com/KellerJordan/ResNet-PyTorch-CIFAR10.

References

Jun, T.J., Kang, D., Kim, D., Kim, D.: GPU enabled serverless computing framework. In: 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2018), Cambridge, 21–23 March 2018, pp. 533–540 (2018)
Global Market Insights (2019) GPU as a service market size by product. www.gminsights.com/industry-analysis/gpu-as-a-service-market
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of 25th International Conference on Neural Information Processing Systems (NIPS 12), vol. 1, pp 1097–1105 (2012)
Khomenko, V., Shyshkov, O., Radyvonenko, O., Bokhan, K.: Accelerating recurrent neural network training using sequence bucketing and multi-gpu data parallelization. CoRR abs/1708.05604, (2017) arXiv:1708.05604
PyTorch: Tensors and dynamic neural networks in Python with strong GPU acceleration (2018). https://pytorch.org
TensorFlow: An open source machine learning framework for everyone (2018). www.tensorflow.org
Amazon (2018) Amazon EC2 elastic GPUs. https://aws.amazon.com/ec2/elastic-gpus/
Google: GPUs on Compute Engine (2018). https://cloud.google.com/compute/docs/gpus/
Microsoft: GPU optimized virtual machine sizes (2018). https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu
Hadjis, S., Zhang, C., Mitliagkas, I., Ré, C.: Omnivore: an optimizer for multi-device deep learning on CPUs and GPUs. Arxiv Preprint (2016). arXiv:1606.04487
Shawi, R.E., Wahab, A., Barnawi, A., Sakr, S.: Dlbench: a comprehensive experimental evaluation of deep learning frameworks. Clust. Comput. 24(3), 2017–2038 (2021)
Article Google Scholar
Draper, N.R., Smith, H.: Applied Regression Analysis. Wiley, New York (1966)
MATH Google Scholar
Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Lin, M., Chen, Q., Yan, S.: Network in network. In: 2nd International Conference on Learning Representations (ICLR 2014). arXiv preprint (2013). arXiv:1312.4400
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010)
MathSciNet MATH Google Scholar
Jia, W., Shaw, K.A., Martonosi, M.: Stargazer: automated regression-based GPU design space exploration. In: International Symposium on Performance Analysis of Systems & Software (ISPASS 12). IEEE (2012) https://doi.org/10.1109/ISPASS.2012.6189201
Shafiabadi, M.H., Pedram, H., Reshadi, M., Reza, A.: Comprehensive regression-based model to predict performance of general-purpose graphics processing unit. Clust. Comput. 23(2), 1505–1516 (2020)
Article Google Scholar
Zhang, Y., Owens, J.D.: A quantitative performance analysis model for GPU architectures. In: 17th International Symposium on High Performance Computer Architecture (HPCA 11). IEEE (2011). https://doi.org/10.1109/HPCA.2011.5749745
Song, S., Su, C., Rountree, B., Cameron, K.W.: A simplified and accurate model of power-performance efficiency on emergent GPU architectures. In: 27th International Symposium on Parallel and Distributed Processing (IPDPS 13). IEEE (2013) https://doi.org/10.1109/IPDPS.2013.73
Dao, T.T., Kim, J., Seo, S., Egger, B., Lee, J.: A performance model for GPUs with caches. IEEE Trans. Parallel Distrib. Syst. 26(7), 1800–1813 (2015)
Google Scholar
Lu, Z., Rallapalli, S., Chan, K., La Porta, T.: Modeling the resource requirements of convolutional neural networks on mobile devices. In: Proceedings of Conference on Multimedia (MM 17). ACM, New York (2017). https://doi.org/10.1145/3123266.3123389
Gupta, U., Babu, M., Ayoub, R., Kishinevsky, M., Paterna, F., Gumussoy, S., Ogras, U.Y.: An misc learning methodology for performance modeling of graphics processors. IEEE Trans. Comput. (2018). https://doi.org/10.1109/TC.2018.2840710
Article MATH Google Scholar
Peng, Y., Bao, Y., Chen, Y., Wu, C., Guo, C.: Optimus: an efficient dynamic resource scheduler for deep learning clusters. In: Proceedings of the Thirteenth EuroSys Conference (EuroSys 2018), Porto, Portugal, 23–26 April 2018. ACM, New York, pp. 3:1–3:14 (2018)
Dube, P., Suk, T., Wang, C.: AI gauge: Runtime estimation for deep learning in the cloud. In: 31st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2019), Campo Grande, Brazil, 15–18 October 2019, pp. 160–167. IEEE (2019) https://doi.org/10.1109/SBAC-PAD.2019.00035
Madougou, S., Varbanescu, A., de Laat, C., van Nieuwpoort, R.: The landscape of GPGPU performance modeling tools. J. Parallel Comput. 56, 18–33 (2016)
Article Google Scholar
Kerr, A., Diamos, G., Yalamanchili, S.: Modeling GPU-CPU workloads and systems. In: Proceedings of 3rd Workshop General-Purpose Computation on Graphics Processing Units (GPGPU-3). ACM, New York (2010). https://doi.org/10.1145/1735688.1735696
Diamos, G.F., Kerr, A., Yalamanchili, S., Clark, N.: Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In: 19th International Conference on Parallel Architecture and Compilation Techniques (PACT 10). IEEE, pp. 353–364 (2010). https://doi.org/10.1145/1854273.1854318
Gianniti, E., Zhang, L., Ardagna, D.: Performance prediction of GPU-based deep learning applications. In: 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 18) (2018)
Gianniti, E., Zhang, L., Ardagna, D.: Performance prediction of gpu-based deep learning applications. In: Proceedings of the 9th International Conference on Cloud Computing and Services Science, CLOSER 2019, Heraklion, Crete, Greece, May 2-4, 2019, SciTePress, pp 279–286 (2019)
Mendoza, D., Romero, F., Li, Q., Yadwadkar, N.J., Kozyrakis, C.: Interference-aware scheduling for inference serving. In: Proceedings of the 1st Workshop on Machine Learning and Systemsg Virtual Event (EuroMLSys@EuroSys 2021), Edinburgh, Scotland, UK, 26 April 2021. ACM, New York, pp. 80–88 (2021)
Yeung, G., Borowiec, D., Yang, R., Friday, A., Harper, R., Garraghan, P.: Horus: Interference-aware and prediction-based scheduling in deep learning systems. IEEE Trans. Parallel Distrib. Syst. 33(1), 88–100 (2022)
Article Google Scholar
Lee, B.C., Brooks, D.M., de Supinski, B.R., Schulz, M., Singh, K., McKee, S.A.: Methods of inference and learning for performance modeling of parallel applications. In: Proceedings of 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 07) (2007). https://doi.org/10.1145/1229428.1229479
Didona, D., Romano, P.: On bootstrapping machine learning performance predictors via analytical models. CoRR (2014). arXiv:1410.5102v1
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint (2014). arXiv:1409.1556
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Conference on Computer Vision and Pattern Recognition (CVPR 16). IEEE (2016). https://doi.org/10.1109/CVPR.2016.90
Foundation, M.: Project deepspeech (2019b). https://github.com/mozilla/DeepSpeech
Csurka, G.: Domain adaptation for visual applications: a comprehensive survey. Arxiv Preprint (2017). arXiv:1702.05374
Hannun, A.Y., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., Ng, A.Y.: Deep speech: Scaling up end-to-end speech recognition. CoRR (2014). arXiv:1412.5567
Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Chen, J., Chrzanowski, M., Coates, A., Diamos, G., Elsen, E., Engel, J., Fan, L., Fougner, C., Han, T., Hannun, A.Y., Jun, B., LeGresley, P., Lin, L., Narang, S., Ng, A.Y., Ozair, S., Prenger, R., Raiman, J., Satheesh, S., Seetapun, D., Sengupta, S., Wang, Y., Wang, Z., Wang, C., Xiao, B., Yogatama, D., Zhan, J., Zhu, Z.: Deep speech 2: End-to-end speech recognition in english and mandarin. CoRR (2015). arXiv:1512.02595
Morais, R.: A journey to <10% word error rate (2019). https://voice.mozilla.org/
Maas, A.L., Hannun, A.Y., Jurafsky, D., Ng, A.Y.: First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs. CoRR (2014). arXiv:1408.2873
Gianniti, E., Ciavotta, M., Ardagna, D.: Optimizing quality-aware big data applications in the cloud. IEEE Trans. Cloud Comput. (2018). https://doi.org/10.1109/TCC.2018.2874944
Article Google Scholar
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in PyTorch. In: 31st Conference on Neural Information Processing Systems (NIPS 17) (2017)
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I.J., Harp, A., Irving, G., Isard, M., Jia, Y., Józefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D.G., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P.A., Vanhoucke, V., Vasudevan, V., Viégas, F.B., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. Arxiv Preprint (2016). arXiv:1603.04467
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. Arxiv Preprint (2015). arXiv:1512.03385
Deng, J., Dong, W., Socher, R., L.-J. Li, K. Li, L. Fei-Fei: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp. 248–255 (2009) https://doi.org/10.1109/CVPR.2009.5206848
Foundation, M.: Common voice (2019a). https://voice.mozilla.org/
Lazowska, E.D., Zahorjan, J., Graham, G.S., Sevcik, K.C.: Quantitative System Performance. Prentice-Hall, Hoboken (1984)
Google Scholar

Download references

Acknowledgements

The results of this work have been partially funded by ATMOSPHERE (grant agreement no. 777154), a Research and Innovation Action funded by the European Commission under the Cooperation Programme, Horizon 2020 and the Ministério de Ciência, Tecnologia e Inovação, RNP/Brazil. Danilo Ardagna’s work is also supported by the AI-SPRINT (grant agreement no. 101016577) H2020 project. GPU cloud experiments have been supported by Microsoft under the Top Compsci University Azure Adoption program.

Author information

Authors and Affiliations

Politecnico di Milano, Milan, Italy
Marco Lattuada, Eugenio Gianniti & Danilo Ardagna
Amazon, Seattle, USA
Li Zhang

Authors

Marco Lattuada
View author publications
You can also search for this author in PubMed Google Scholar
Eugenio Gianniti
View author publications
You can also search for this author in PubMed Google Scholar
Danilo Ardagna
View author publications
You can also search for this author in PubMed Google Scholar
Li Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Danilo Ardagna.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Number of samples in training data used in performance prediction models

This appendix presents the number of samples that have been used to train each of the models presented in Sect. 5. Table 10 reports the data about the hold-out scenario, while the remaining tables details the size of the training dataset for extrapolation experiments. For example, Table 11 shows how the number of samples of the training set of the extrapolation experiment on the batch AlexNet implemented on PyTorch considering 1 and 2 P600 is 72 and 204 respectively (Tables 12, 13, 14).

Table 10 Number of samples in training data used in interpolation models

Performance prediction of deep learning applications training in GPU as a service systems

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

How Deep Learning Model Architecture and Software Stack Impacts Training Performance in the Cloud

Machine Learning Using Virtualized GPUs in Cloud Environments

How GPUs Kill Threads in Neural Network Training

Explore related subjects

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix 1: Number of samples in training data used in performance prediction models

Appendix 2: Root mean square error (RMSE) of performance prediction models

Appendix 3: Mean absolute error (MAE) of performance prediction models

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation