Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Crossbow: scaling deep learning with small batch sizes on multi-GPU servers

Published: 01 July 2019 Publication History
  • Get Citation Alerts
  • Abstract

    Deep learning models are trained on servers with many GPUs, and training must scale with the number of GPUs. Systems such as TensorFlow and Caffe2 train models with parallel synchronous stochastic gradient descent: they process a batch of training data at a time, partitioned across GPUs, and average the resulting partial gradients to obtain an updated global model. To fully utilise all GPUs, systems must increase the batch size, which hinders statistical efficiency. Users tune hyper-parameters such as the learning rate to compensate for this, which is complex and model-specific.
    We describe Crossbow, a new single-server multi-GPU system for training deep learning models that enables users to freely choose their preferred batch size---however small---while scaling to multiple GPUs. Crossbow uses many parallel model replicas and avoids reduced statistical efficiency through a new synchronous training method. We introduce SMA, a synchronous variant of model averaging in which replicas independently explore the solution space with gradient descent, but adjust their search synchronously based on the trajectory of a globally-consistent average model. Crossbow achieves high hardware efficiency with small batch sizes by potentially training multiple model replicas per GPU, automatically tuning the number of replicas to maximise throughput. our experiments show that Crossbow improves the training time of deep learning models on an 8-GPU server by 1.3--4X compared to TensorFlow.

    References

    [1]
    M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016.
    [2]
    Amazon EC2 Instance Types, 2017. https://aws.amazon.com/ec2/instance-types/.
    [3]
    S. Ö. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, J. Raiman, S. Sengupta, and M. Shoeybi. Deep Voice: Real-time Neural Text-to-Speech. arXiv:1702.07825 {cs.CL}, Feb. 2017.
    [4]
    L. Bottou. On-line Learning and Stochastic Approximations. In D. Saad, editor, On-line Learning in Neural Networks, pages 9--42. Cambridge University Press, New York, NY, USA, 1998.
    [5]
    L. Bottou, F. Curtis, and J. Nocedal. Optimization Methods for Large-Scale Machine Learning. SIAM Review, 60(2):223--311, 2018.
    [6]
    S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning, 3(1):1--122, Jan. 2011.
    [7]
    S. Chaturapruek, J. C. Duchi, and C. Ré. Asynchronous Stochastic Convex Optimization: The Noise Is in the Noise and SGD Don't Care. In 28th International Conference on Neural Information Processing Systems (NIPS), 2015.
    [8]
    J. Chen, R. Monga, S. Bengio, and R. Józefowicz. Revisiting Distributed Synchronous SGD. arXiv:1604.00981 {cs.LG}, Apr. 2016.
    [9]
    T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. arXiv:1512.01274 {cs.DC}, Dec. 2015.
    [10]
    A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The Loss Surfaces of Multilayer Networks. In 18th International Conference on Artificial Intelligence and Statistics (AISTATS), 2015.
    [11]
    H. Cui, J. Cipar, Q. Ho, J. K. Kim, S. Lee, A. Kumar, J. Wei, W. Dai, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing. Exploiting Bounded Staleness to Speed Up Big Data Analytics. In 2014 USENIX Annual Technical Conference (ATC), 2014.
    [12]
    H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing. GeePS: Scalable Deep Learning on Distributed GPUs with a GPU-specialized Parameter Server. In 11th European Conference on Computer Systems (EuroSys), 2016.
    [13]
    J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. aurelio Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng. Large Scale Distributed Deep Networks. In 25th International Conference on Neural Information Processing Systems (NIPS), 2012.
    [14]
    J. Dean, D. Patterson, and C. Young. A New Golden Age in Computer Architecture: Empowering the Machine-Learning Revolution. IEEE Micro, 38(2):21--29, Mar. 2018.
    [15]
    A. Defossez and F. Bach. Averaged Least-Mean-Squares: Bias-Variance Trade-offs and Optimal Sampling Distributions. In 18th International Conference on Artificial Intelligence and Statistics (AISTATS), 2015.
    [16]
    P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv:1706.02677 {cs.CV}, June 2017.
    [17]
    A. Harlap, H. Cui, W. Dai, J. Wei, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing. Addressing the Straggler Problem for Iterative Convergent Parallel ML. In 7th ACM Symposium on Cloud Computing (SoCC), 2016.
    [18]
    K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. arXiv:1512.03385 {cs.CV}, Dec. 2015.
    [19]
    G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 29(6):82--97, Nov. 2012.
    [20]
    Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In 26th International Conference on Neural Information Processing Systems (NIPS), 2013.
    [21]
    S. Hochreiter and J. Schmidhuber. Flat Minima. Neural Computation, 9(1):1--42, Jan. 1997.
    [22]
    E. Hoffer, I. Hubara, and D. Soudry. Train Longer, Generalize Better: Closing the Generalization Gap in Large Batch Training of Neural Networks. In 30th International Conference on Neural Information Processing Systems (NIPS), 2017.
    [23]
    Y. Huang, T. Jin, Y. Wu, Z. Cai, X. Yan, F. Yang, J. Li, Y. Guo, and J. Cheng. FlexPS: Flexible Parallelism Control in Parameter Server Architecture. PVLDB, 11(5):566--579, 2018.
    [24]
    A. Jain, A. Phanishayee, J. Mars, L. Tang, and G. Pekhimenko. Gist: Efficient Data Encoding for Deep Neural Network Training. In 45th Annual International Symposium on Computer Architecture (ISCA), 2018.
    [25]
    P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford. Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification. Journal of Machine Learning Research, 18(223):1--42, 2018.
    [26]
    S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. J. Storkey. Three Factors Influencing Minimain SGD. arXiv:1711.04623 {cs.LG}, Nov. 2017.
    [27]
    X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu, T. Chen, G. Hu, S. Shi, and X. Chu. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes. arXiv:1807.11205 {cs.LG}, July 2018.
    [28]
    Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional Architecture for Fast Feature Embedding. In 22nd ACM International Conference on Multimedia (MM), 2014.
    [29]
    Z. Jia, S. Lin, C. R. Qi, and A. Aiken. Exploring Hidden Dimensions in Accelerating Convolutional Neural Networks. In 35th International Conference on Machine Learning (ICML), 2018.
    [30]
    J. Jiang, B. Cui, C. Zhang, and L. Yu. Heterogeneity-Aware Distributed Parameter Servers. In 2017 ACM International Conference on Management of Data (SIGMOD), 2017.
    [31]
    M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, M. Hughes, and J. Dean. Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. Transactions of the Association for Computational Linguistics, 5:339--351, 2017.
    [32]
    N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al. In-Datacenter Performance Analysis of a Tensor Processing Unit. In 44th Annual International Symposium on Computer Architecture (ISCA), 2017.
    [33]
    N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. arXiv:1609.04836 {cs.LG}, Sept. 2016.
    [34]
    A. Krizhevsky. One Weird Trick for Parallelizing Convolutional Neural Networks. arXiv:1404.5997 {cs.NE}, Apr. 2014.
    [35]
    A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In 25th International Conference on Neural Information Processing Systems (NIPS), 2012.
    [36]
    Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11):2278--2324, Nov. 1998.
    [37]
    Y. LeCun, L. Bottou, G. B. Orr, and K. R. Müller. Efficient BackProp, pages 9--50. Springer Berlin Heidelberg, Berlin, Heidelberg, 1998.
    [38]
    M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling Distributed Machine Learning with the Parameter Server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2014.
    [39]
    M. Li, T. Zhang, Y. Chen, and A. J. Smola. Efficient Mini-batch Training for Stochastic Optimization. In 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2014.
    [40]
    X. Lian, W. Zhang, C. Zhang, and J. Liu. Asynchronous Decentralized Parallel Stochastic Gradient Descent. In 35th International Conference on Machine Learning (ICML), 2018.
    [41]
    D. Masters and C. Luschi. Revisiting Small Batch Training for Deep Neural Networks. arXiv:1804.07612 {cs.LG}, Apr. 2018.
    [42]
    P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan, and I. Stoica. Ray: A Distributed Framework for Emerging AI Applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018.
    [43]
    D. Narayanan, K. Santhanam, and M. Zaharia. Accelerating Model Search with Model Batching. In 1st Conference on Systems and Machine Learning (SysML), SysML '18, 2018.
    [44]
    Y. Nesterov. A Method of Solving a Convex Programming Problem with Convergence Rate O(1/k<sup>2</sup>). Soviet Mathematics Doklady, 27:372--376, 1983.
    [45]
    C. Noel and S. Osindero. Dogwild! - Distributed Hogwild for CPU and GPU. Distributed Machine Learning and Matrix Computations NIPS 2014 Workshop, 2014.
    [46]
    NVIDIA Collective Communications Library (NCCL), 2018. https://developer.nvidia.com/nccl.
    [47]
    NVLink Fabric Multi-GPU Processing, 2018. https://www.nvidia.com/en-us/data-center/nvlink/.
    [48]
    Octoputer 4U 10-GPU Server with Single Root Complex for GPU-Direct, 2018. https://www.microway.com/product/octoputer-4u-10-gpu-server-single-root-complex/.
    [49]
    B. Polyak. Some Methods of Speeding up the Convergence of Iteration Methods. USSR Computational Mathematics and Mathematical Physics, 4:1--17, Dec. 1964.
    [50]
    B. Polyak. New Stochastic Approximation Type Procedures. Avtomatica i Telemekhanika, 7(7):98--107, Jan. 1990.
    [51]
    B. Polyak and A. Juditsky. Acceleration of Stochastic Approximation by Averaging. SIAM Journal on Control and Optimization, 30(4):838--855, 1992.
    [52]
    PyTorch, 2018. https://pytorch.org.
    [53]
    A. Qiao, A. Aghayev, W. Yu, H. Chen, Q. Ho, G. A. Gibson, and E. P. Xing. Litz: Elastic Framework for High-Performance Distributed Machine Learning. In 2018 USENIX Annual Technical Conference (ATC), 2018.
    [54]
    C. Qin, M. Torres, and F. Rusu. Scalable Asynchronous Gradient Descent Optimization for Out-of-core Models. PVLDB, 10(10):986--997, 2017.
    [55]
    B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In 24th International Conference on Neural Information Processing Systems (NIPS), 2011.
    [56]
    H. Robbins and S. Monro. A Stochastic Approximation Method. Ann. Math. Statist., 22(3):400--407, Sept. 1951.
    [57]
    D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning Internal Representations by Error Propagation. In D. E. Rumelhart, J. L. McClelland, and C. PDP Research Group, editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1, pages 318--362. MIT Press, Cambridge, MA, USA, 1986.
    [58]
    D. Ruppert. Efficient Estimators from a Slowly Convergent Robbins-Monro Process. Technical Report 781, School of Operations Research and Industrial Enginnering, Cornell University, Ithaka, New York 14853--7501, Feb. 1988.
    [59]
    F. Seide and A. Agarwal. CNTK: Microsoft's Open-Source Deep-Learning Toolkit. In 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2016.
    [60]
    A. Sergeev and M. D. Balso. Horovod: Fast and Easy Distributed Deep Learning in Tensor Flow. arXiv:1802.05799 {cs.LG}, Feb. 2018.
    [61]
    C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl. Measuring the Effects of Data Parallelism on Neural Network Training. arXiv:1811.03600 {cs.LG}, Nov. 2018.
    [62]
    S. L. Smith, P. Kindermans, and Q. V. Le. Don't Decay the Learning Rate, Increase the Batch Size. arXiv:1711.00489 {cs.LG}, Nov. 2017.
    [63]
    I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the Importance of Initialization and Momentum in Deep Learning. In 30th International Conference on Machine Learning (ICML), 2013.
    [64]
    TensorFlow Benchmarks, 2018. https://github.com/tensorflow/benchmarks.
    [65]
    VGG16 models for CIFAR-10 and CIFAR-100 using Keras, 2018.https://github.com/geifmany/cifar-vgg.
    [66]
    L. Wang, J. Ye, Y. Zhao, W. Wu, A. Li, S. L. Song, Z. Xu, and T. Kraska. Superneurons: Dynamic GPU Memory Management for Training Deep Neural Networks. In 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2018.
    [67]
    P. Watcharapichat, V. L. Morales, R. C. Fernandez, and P. Pietzuch. Ako: Decentralised Deep Learning with Partial Gradient Exchange. In 7th ACM Symposium on Cloud Computing (SoCC), 2016.
    [68]
    W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig. The Microsoft 2016 Conversational Speech Recognition System. arXiv:1609.03528 {cs.CL}, Jan. 2017.
    [69]
    W. Xu. Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent. arXiv:1107.2490 {cs.LG}, Dec. 2011.
    [70]
    Y. You, I. Gitman, and B. Ginsburg. Large Batch Training of Convolutional Networks. arXiv:1708.03888 {cs.CV}, Sept. 2017.
    [71]
    C. Zhang and C. Ré. DimmWitted: A Study of Main-memory Statistical Analytics. PVLDB, 7(12):1283--1294, 2014.
    [72]
    J. Zhang and I. Mitliagkas. YellowFin and the Art of Momentum Tuning. arXiv:1706.03471 {stat.ML}, June 2017.
    [73]
    S. Zhang, A. E. Choromanska, and Y. LeCun. Deep learning with Elastic Averaging SGD. In 28th International Conference on Neural Information Processing Systems (NIPS), 2015.

    Cited By

    View all
    • (2024)An Analysis of Collocation on GPUs for Deep Learning TrainingProceedings of the 4th Workshop on Machine Learning and Systems10.1145/3642970.3655827(81-90)Online publication date: 22-Apr-2024
    • (2024)The Image Calculator: 10x Faster Image-AI Inference by Replacing JPEG with Self-designing Storage FormatProceedings of the ACM on Management of Data10.1145/36393072:1(1-31)Online publication date: 26-Mar-2024
    • (2024)Orion: Interference-aware, Fine-grained GPU Sharing for ML ApplicationsProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629578(1075-1092)Online publication date: 22-Apr-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 12, Issue 11
    July 2019
    543 pages

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 July 2019
    Published in PVLDB Volume 12, Issue 11

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)35
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)An Analysis of Collocation on GPUs for Deep Learning TrainingProceedings of the 4th Workshop on Machine Learning and Systems10.1145/3642970.3655827(81-90)Online publication date: 22-Apr-2024
    • (2024)The Image Calculator: 10x Faster Image-AI Inference by Replacing JPEG with Self-designing Storage FormatProceedings of the ACM on Management of Data10.1145/36393072:1(1-31)Online publication date: 26-Mar-2024
    • (2024)Orion: Interference-aware, Fine-grained GPU Sharing for ML ApplicationsProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629578(1075-1092)Online publication date: 22-Apr-2024
    • (2023)TorchOpt:The Journal of Machine Learning Research10.5555/3648699.364906624:1(17651-17664)Online publication date: 1-Jan-2023
    • (2023)GEARProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619921(36380-36390)Online publication date: 23-Jul-2023
    • (2023)Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured SparsityProceedings of the VLDB Endowment10.14778/3626292.362630317:2(211-224)Online publication date: 1-Oct-2023
    • (2023)Data Management and Visualization for Benchmarking Deep Learning Training SystemsProceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning10.1145/3595360.3595851(1-5)Online publication date: 18-Jun-2023
    • (2023)Interference-aware Multiplexing for Deep Learning in GPU Clusters: A Middleware ApproachProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607060(1-15)Online publication date: 12-Nov-2023
    • (2023)Profiling and Monitoring Deep Learning Training TasksProceedings of the 3rd Workshop on Machine Learning and Systems10.1145/3578356.3592589(18-25)Online publication date: 8-May-2023
    • (2023)Elastic Averaging for Efficient Pipelined DNN TrainingProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577484(380-391)Online publication date: 25-Feb-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media