Export Citations
1 Results for: Author: Dang, Hoang-Vu
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
Searched The ACM Guide to Computing Literature (3,833,031 records)|Limit your search to The ACM Full-Text Collection (773,095 records)
Showing 1 - 1of1 Results
- research-articleJanuary 2025
Resiliency at scale: managing google's TPUv4 machine learning supercomputer
- Yazhou Zu,
- Alireza Ghaffarkhah,
- Hoang-Vu Dang,
- Brian Towles,
- Steven Hand,
- Safeen Huda,
- Adekunle Bello,
- Alexander Kolbasov,
- Arash Rezaei,
- Dayou Du,
- Steve Lacy,
- Hang Wang,
- Aaron Wisner,
- Chris Lewis,
- Henri Bahini
NSDI'24: Proceedings of the 21st USENIX Symposium on Networked Systems Design and ImplementationArticle No.: 42, Pages 761–774TPUv4 (Tensor Processing Unit) is Google's 3rd generation accelerator for machine learning training, deployed as a 4096-node supercomputer with a custom 3D torus interconnect. In this paper, we describe our experience designing and operating the software ...