Article

Project Adam: building an efficient and scalable deep learning training system

Authors:

Trishul Chilimbi,

Johnson Apacible,

Karthik KalyanaramanAuthors Info & Claims

OSDI'14: Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation

Pages 571 - 582

Published: 06 October 2014 Publication History

Abstract

Large deep neural network models have recently demonstrated state-of-the-art accuracy on hard visual recognition tasks. Unfortunately such models are extremely time consuming to train and require large amount of compute cycles. We describe the design and implementation of a distributed system called Adam comprised of commodity server machines to train such models that exhibits world-class performance, scaling and task accuracy on visual recognition tasks. Adam achieves high efficiency and scalability through whole system co-design that optimizes and balances workload computation and communication. We exploit asynchrony throughout the system to improve performance and show that it additionally improves the accuracy of trained models. Adam is significantly more efficient and scalable than was previously thought possible and used 30x fewer machines to train a large 2 billion connection model to 2x higher accuracy in comparable time on the ImageNet 22,000 category image classification task than the system that previously held the record for this benchmark. We also show that task accuracy improves with larger models. Our results provide compelling evidence that a distributed systems-driven approach to deep learning using current training algorithms is worth pursuing.

References

[1]

Bengio, Y., and LeCun, Y. 2007. Scaling Learning Algorithms towards AI. In Large-Scale Kernel Machines, Bottou, L. et al. (Eds), MIT Press.

[2]

Bottou, L., 1991. Stochastic gradient learning in neural networks. In Proceedings of Neuro-Nîmes 91, EC2, Nimes, France.

[3]

S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi. 2010. A dynamically configurable coprocessor for convolutional neural networks. In International symposium on Computer Architecture, ISCA'10.

Digital Library

[4]

Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., and Temam, O. 2014. DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning. In International Conference on Architectural Support for Programming Languages and Operating Systems. ASPLOS'14.

Digital Library

[5]

Ciresan, D. C, Meier, U., and Schmidhuber, J. 2012. Multicolumn deep neural networks for image classification. In Computer Vision and Pattern Recognition. CVPR'12.

[6]

Coates, A., Huval, B., Wang, T., Wu, D., Ng, A., and Catanzaro, B. 2013. Deep Learning with COTS HPC. In International Conference on Machine Learning. ICML'13.

[7]

Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang, K., Le, Q., and Ng, A. 2012. Large Scale Distributed Deep Networks. In Advances in Neural Information Processing Systems. NIPS'12.

[8]

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In Computer Vision and Pattern Recognition. CVPR '09.

[9]

Faerber, P., and Asanovic, K. 1997. Parallel neural network training on Multi-Spert. In IEEE 3rd International Conference on Algorithms and Architectures for Parallel Processing (Melbourne, Australia, December 1997).

[10]

Farabet, C., Martini, B., Corda, B., Akselrod, P., Culurciello, E., and LeCun, Y. 2011. NeuFlow: A runtime reconfigurable dataflow processor for vision. In Computer Vision and Pattern Recognition Workshop (June 2011), pages 109-116.

[11]

Fukushima, K. 1980. Neocognitron: A self-organizing neural network for a mechanism of pattern recognition unaffected by shift in position. In Biological Cybernetics, 36(4): 93-202.

[12]

Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. 2013. Maxout Networks. In International Conference on Machine Learning. ICML'13.

[13]

Hahnloser, R. 2003. Permitted and Forbidden Sets in Symmetric Threshold-Linear Networks. In Neural Computing. (Mar. 2003), 15(3):621-38.

Digital Library

[14]

Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., and Kingsbury, B. 2012. Deep neural networks for acoustic modeling in speech recognition. In IEEE Signal Processing Magazine, 2012.

[15]

Hubel, D. and Wiesel, T. 1968. Receptive fields and functional architecture of monkey striate cortex. In Journal of Physiology (London), 195, 215-243.

[16]

Kim, J., Member, S., Kim, M., Lee, S., Oh, J., Kim, K. and Yoo, H. 2010. A 201.4 GOPS 496 mW Real-Time Multi-Object Recognition Processor with Bio-Inspired Neural Perception Engine. In IEEE Journal of Solid-State Circuits, (Jan. 2010), 45(1):32-45.

[17]

Krizhevsky, A., Sutskever, I., and Hinton, G. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems. NIPS'12.

[18]

Le, Q., Ranzato, M., Monga, R., Devin, M., Chen, K., Corrado, G., Dean, J., and Ng, A. 2012. Building high-level features using large scale unsupervised learning. In International Conference on Machine Learning. ICML'12.

[19]

LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., and Jackel, L. 1989. Backpropagation Applied to Handwritten Zip Code Recognition. In Neural Computation, 1(4):541-551.

Digital Library

[20]

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. 1998. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 86(11):2278-2324, (Nov. 1998).

[21]

Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., and Hellerstein, J. 2012. Distributed GraphLab: A framework for machine learning in the cloud. In International Conference on Very Large Databases. VLDB'12.

Digital Library

[22]

Maashri, A., Debole, M., Cotter, M., Chandramoorthy, N., Xiao, Y., Narayanan, V., and Chakrabarti, C. 2012. Accelerating neuromorphic vision algorithms for recognition. In Proceedings of the 49th Annual Design Automation Conference, DAC'12.

Digital Library

[23]

Niu, F., Retcht, B., Re, C., and Wright, S. 2011. Hogwild! A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems. NIPS'11.

[24]

Raina, R., Madhavan, A., and Ng., A. 2009. Large-scale deep unsupervised learning using graphics processors. In International Conference on Machine Learning. ICML'09.

Digital Library

[25]

Rumelhart, D., Hinton, G., and Williams, R. 1986. Learning representations by backpropagating errors. In Nature 323 (6088): 533-536.

[26]

Simard, P., Steinkraus, D., and Platt, J. 2003. Best Practices for Convolutional Neural Networks applied to Visual Document Analysis. In ICDAR, vol. 3, pp. 958-962.

Digital Library

[27]

Zeiler, M. and Fergus, R. 2013. Visualizing and Understanding Convolutional Networks. In Arxiv 1311.2901. http://arxiv.org/abs/1311.2901

Cited By

Rebai AOjewale MUllah ACanini MFahmy S(2024)SqueezeNIC: Low-Latency In-NIC Compression for Distributed Deep LearningProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing10.1145/3672198.3673801(61-68)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3672198.3673801
Strati FMa XKlimovic A(2024)Orion: Interference-aware, Fine-grained GPU Sharing for ML ApplicationsProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629578(1075-1092)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3629578
Wang HWang LXu HWang YLi YHan YTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model TrainingProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651357(801-817)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651357
Show More Cited By

Project Adam: building an efficient and scalable deep learning training system
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches

Recommendations

The turtles project: design and implementation of nested virtualization
OSDI'10: Proceedings of the 9th USENIX conference on Operating systems design and implementation

In classical machine virtualization, a hypervisor runs multiple operating systems simultaneously, each on its own virtual machine. In nested virtualization, a hypervisor can run multiple other hypervisors with their associated virtual machines. As ...
Silver Bullet Talks with Adam Shostack

Gary McGraw interviews Adam Shostack. Shostack is a member of Microsoft's Secure Development Lifecycle Team. He's worked for Zero Knowledge as Most Evil Genius and Reflective where, as CTO, he focused on static analysis for software security. Shostack ...
Improved CNN Based on Batch Normalization and Adam Optimizer
Computational Science and Its Applications – ICCSA 2022 Workshops
Abstract
After evaluating the difficulty of CNNs in extracting convolution features, this paper suggested an improved convolutional neural network (CNN) method (ICNN-BNDOA), which is based on Batch Normalization (BN), Dropout (DO), and Adaptive Moment ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

OSDI'14: Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation

October 2014

676 pages

ISBN:9781931971164

Program Chairs:
Jason Flinn
University of Michigan
,
Hank Levy
University of Washington

Sponsors

USENIX Assoc: USENIX Assoc

In-Cooperation

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

USENIX Association

United States

Publication History

Published: 06 October 2014

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

139
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Rebai AOjewale MUllah ACanini MFahmy S(2024)SqueezeNIC: Low-Latency In-NIC Compression for Distributed Deep LearningProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing10.1145/3672198.3673801(61-68)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3672198.3673801
Strati FMa XKlimovic A(2024)Orion: Interference-aware, Fine-grained GPU Sharing for ML ApplicationsProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629578(1075-1092)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3629578
Wang HWang LXu HWang YLi YHan YTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model TrainingProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651357(801-817)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651357
Gu HZhang XLi JWei HLi BHuang XChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Federated Learning Vulnerabilities: Privacy Attacks with Denoising Diffusion Probabilistic ModelsProceedings of the ACM on Web Conference 202410.1145/3589334.3645514(1149-1157)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645514
Ryabinin MDettmers TDiskin MBorzunov AKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)SWARM parallelismProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619631(29416-29440)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3619631
Kim TPark CMukimbekov MHong HKim MJin ZKim CShin JJeon M(2023)FusionFlow: Accelerating Data Preprocessing for Machine Learning with CPU-GPU CooperationProceedings of the VLDB Endowment10.14778/3636218.363623817:4(863-876)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.14778/3636218.3636238
Nurnoby MShawarib KKhan A(2023)Distributed Deep Learning-based Model for Large Image Data ClassificationProceedings of the 7th International Conference on Future Networks and Distributed Systems10.1145/3644713.3644750(283-291)Online publication date: 21-Dec-2023
https://dl.acm.org/doi/10.1145/3644713.3644750
Sun JYang YXun GZhang A(2023)Scheduling Hyperparameters to Improve Generalization: From Centralized SGD to Asynchronous SGDACM Transactions on Knowledge Discovery from Data10.1145/354478217:2(1-37)Online publication date: 20-Mar-2023
https://dl.acm.org/doi/10.1145/3544782
Yuan BWolfe CDun CTang YKyrillidis AJermaine C(2022)Distributed learning of fully connected neural networks using independent subnet trainingProceedings of the VLDB Endowment10.14778/3529337.352934315:8(1581-1590)Online publication date: 1-Apr-2022
https://dl.acm.org/doi/10.14778/3529337.3529343
Adnan MMaboud YMahajan DNair P(2022)Accelerating recommendation system training by leveraging popular choicesProceedings of the VLDB Endowment10.14778/3485450.348546215:1(127-140)Online publication date: 14-Jan-2022
https://dl.acm.org/doi/10.14778/3485450.3485462
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents