Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3673038.3673070acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article
Open access

Viper: A High-Performance I/O Framework for Transparently Updating, Storing, and Transferring Deep Neural Network Models

Published: 12 August 2024 Publication History

Abstract

Scientific workflows increasingly need to train a DNN model in real-time during an experiment (e.g. using ground truth from a simulation), while using it at the same time for inferences. Instead of sharing the same model instance, the training (producer) and inference server (consumer) often use different model replicas that are kept synchronized. In addition to efficient I/O techniques to keep the model replica of the producer and consumer synchronized, there is another important trade-off: frequent model updates enhance inference quality but may slow down training; infrequent updates may lead to less precise inference results. To address these challenges, we introduce Viper: a new I/O framework designed to determine a near-optimal checkpoint schedule and accelerate the delivery of the latest model updates. Viper builds an inference performance predictor to identify the optimal checkpoint schedule to balance the trade-off between training slowdown and inference quality improvement. It also creates a memory-first model transfer engine to accelerate model delivery through direct memory-to-memory communication. Our experiments show that Viper can reduce the model update latency by ≈ 9x using the GPU-to-GPU data transfer engine and ≈ 3x using the DRAM-to-DRAM host data transfer. The checkpoint schedule obtained from Viper’s predictor also demonstrates improved cumulative inference accuracy compared to the baseline of epoch-based solutions.

References

[1]
Anakha V Babu, Tekin Bicer, Saugat Kandel, Tao Zhou, Daniel J Ching, Steven Henke, Siniša Veseli, Ryan Chard, Antonino Miceli, and Mathew Joseph Cherukara. 2023. AI-assisted automated workflow for real-time x-ray ptychography data analysis via federated resources. arXiv preprint arXiv:2304.04297 (2023).
[2]
Nathan Baker, Frank Alexander, Timo Bremer, Aric Hagberg, Yannis Kevrekidis, Habib Najm, Manish Parashar, Abani Patra, James Sethian, Stefan Wild, 2019. Workshop report on basic research needs for scientific machine learning: Core technologies for artificial intelligence. Technical Report. USDOE Office of Science (SC), Washington, DC (United States).
[3]
Dimitri Bourilkov. 2019. Machine and deep learning applications in particle physics. International Journal of Modern Physics A 34, 35 (2019), 1930019.
[4]
Fahim Chowdhury, Yue Zhu, Todd Heer, Saul Paredes, Adam Moody, Robin Goldstone, Kathryn Mohror, and Weikuan Yu. 2019. I/o characterization and performance evaluation of beegfs for deep learning. In Proceedings of the 48th International Conference on Parallel Processing. 1–10.
[5]
Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. 2017. Clipper: A Low-Latency Online Prediction Serving System. In NSDI, Vol. 17. 613–627.
[6]
Hariharan Devarajan, Anthony Kougkas, Huihuo Zheng, Venkatram Vishwanath, and Xian-He Sun. 2022. Stimulus: Accelerate Data Management for Scientific AI applications in HPC. In 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid). IEEE, 109–118.
[7]
Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. 2015. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In Twenty-fourth international joint conference on artificial intelligence.
[8]
Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa Mudigere, Raghuraman Krishnamoorthi, Krishnakumar Nair, Misha Smelyanskiy, and Murali Annavaram. 2022. Check-N-Run: a checkpointing system for training deep learning recommendation models. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 929–943.
[9]
Raia Hadsell, Dushyant Rao, Andrei A Rusu, and Razvan Pascanu. 2020. Embracing change: Continual learning in deep neural networks. Trends in Cognitive Sciences 24, 12 (2020), 1028–1040.
[10]
Junchen Jiang, Ganesh Ananthanarayanan, Peter Bodik, Siddhartha Sen, and Ion Stoica. 2018. Chameleon: scalable adaptation of video analytics. In Proceedings of the 2018 conference of the ACM special interest group on data communication. 253–266.
[11]
Zhengchun Liu, Hemant Sharma, J-S Park, Peter Kenesei, Antonino Miceli, Jonathan Almer, Rajkumar Kettimuthu, and Ian Foster. 2022. BraggNN: fast X-ray Bragg peak analysis using deep learning. IUCrJ 9, 1 (2022), 104–113.
[12]
Meghana Madhyastha, Robert Underwood, Randal Burns, and Bogdan Nicolae. 2023. DStore: A Lightweight Scalable Learning Model Repository with Fine-Grain Tensor-Level Access. In Proceedings of the 37th International Conference on Supercomputing. 133–143.
[13]
Avinash Maurya, Robert Underwood, Mustafa Rafique, Franck Cappello, and Bogdan Nicolae. 2024. DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models. In HPDC’24: The 33nd International Symposium on High-Performance Parallel and Distributed Computing. Pisa, Italy.
[14]
Martial Mermillod, Aurélia Bugaiska, and Patrick Bonin. 2013. The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects. Frontiers in Psychology 4 (2013), 504.
[15]
microsoft. 2023. ONNX Runtime: A cross-platform inference and training machine-learning accelerator. https://github.com/microsoft/onnxruntime
[16]
Jayashree Mohan, Amar Phanishayee, and Vijay Chidambaram. 2021. CheckFreq: Frequent, Fine-Grained DNN Checkpointing. In FAST, Vol. 21. 203–216.
[17]
Bogdan Nicolae, Jiali Li, Justin M Wozniak, George Bosilca, Matthieu Dorier, and Franck Cappello. 2020. Deepfreeze: Towards scalable asynchronous checkpointing of deep learning models. In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID). IEEE, 172–181.
[18]
Bogdan Nicolae, Justin M Wozniak, Matthieu Dorier, and Franck Cappello. 2020. DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training. In CLUSTER’20: The 2020 IEEE International Conference on Cluster Computing. Kobe, Japan.
[19]
NVIDIA. 2023. NVIDIA Triton Inference Server. https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/architecture.html
[20]
Christopher Olston, Noah Fiedel, Kiril Gorovoy, Jeremiah Harmsen, Li Lao, Fangwei Li, Vinu Rajashekhar, Sukriti Ramesh, and Jordan Soyke. 2017. Tensorflow-serving: Flexible, high-performance ml serving. arXiv preprint arXiv:1712.06139 (2017).
[21]
pytorch. 2023. TorchServe: a flexible and easy to use tool for serving and scaling PyTorch models in production. https://github.com/pytorch/serve
[22]
Stephan Rasp, Peter D Dueben, Sebastian Scher, Jonathan A Weyn, Soukayna Mouatadid, and Nils Thuerey. 2020. WeatherBench: a benchmark data set for data-driven weather forecasting. Journal of Advances in Modeling Earth Systems 12, 11 (2020), e2020MS002203.
[23]
Dong In Shin, Young Jin Yu, Hyeong S Kim, Jae Woo Choi, Heon Y Yeom, 2013. Dynamic Interval Polling and Pipelined Post { I/O} Processing for { Low-Latency} Storage Class Memory. In 5th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 13).
[24]
Chijun Sima, Yao Fu, Man-Kit Sit, Liyi Guo, Xuri Gong, Feng Lin, Junyu Wu, Yongsheng Li, Haidong Rong, Pierre-Louis Aublin, 2022. Ekko: A { Large-Scale} Deep Learning Recommender System with { Low-Latency} Model Update. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 821–839.
[25]
Robert Underwood, Meghana Madhyastha, Randal Burns, and Bogdan Nicolae. 2024. EvoStore: Towards Scalable Storage of Evolving Learning Models. In HPDC’24: The 33nd International Symposium on High-Performance Parallel and Distributed Computing. Pisa, Italy.
[26]
Tom Viering and Marco Loog. 2022. The shape of learning curves: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
[27]
Ricardo Vinuesa and Steven L Brunton. 2022. Enhancing computational fluid dynamics with machine learning. Nature Computational Science 2, 6 (2022), 358–366.
[28]
Yingcan Wei, Matthias Langer, Fan Yu, Minseok Lee, Jie Liu, Ji Shi, and Zehuan Wang. 2022. A GPU-specialized Inference Parameter Server for Large-Scale Deep Recommendation Models. In Proceedings of the 16th ACM Conference on Recommender Systems. 408–419.
[29]
Justin M Wozniak, Rajeev Jain, Prasanna Balaprakash, Jonathan Ozik, Nicholson T Collier, John Bauer, Fangfang Xia, Thomas Brettin, Rick Stevens, Jamaludin Mohd-Yusof, 2018. CANDLE/Supervisor: A workflow framework for machine learning applied to cancer research. BMC bioinformatics 19, 18 (2018), 59–69.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing
August 2024
1279 pages
ISBN:9798400717932
DOI:10.1145/3673038
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2024

Check for updates

Author Tags

  1. AI Workflows
  2. Adaptive AI Model Checkpointing
  3. Coupled Training and Inferences
  4. Inferences During Partial Training

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICPP '24

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 144
    Total Downloads
  • Downloads (Last 12 months)144
  • Downloads (Last 6 weeks)52
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media