Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
survey
Free access

Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis

Published: 30 August 2019 Publication History
  • Get Citation Alerts
  • Abstract

    Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we describe the problem from a theoretical perspective, followed by approaches for its parallelization. We present trends in DNN architectures and the resulting implications on parallelization strategies. We then review and model the different types of concurrency in DNNs: from the single operator, through parallelism in network inference and training, to distributed deep learning. We discuss asynchronous stochastic optimization, distributed system architectures, communication schemes, and neural architecture search. Based on those approaches, we extrapolate potential directions for parallelism in deep learning.

    References

    [1]
    M. Abadi et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from http://www.tensorflow.org.
    [2]
    A. Agarwal and J. C. Duchi. 2011. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems 24. MIT Press, 873--881.
    [3]
    A. F. Aji and K. Heafield. 2017. Sparse communication for distributed gradient descent. arxiv:1704.05021
    [4]
    F. Akopyan et al. 2015. TrueNorth: Design and tool flow of a 65 mW 1 million neuron programmable neurosynaptic chip. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 34, 10 (2015), 1537--1557.
    [5]
    D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems 30. MIT Press, 1709--1720.
    [6]
    D. Amodei et al. 2016. Deep speech 2 : End-to-end speech recognition in English and mandarin. In Proceedings of the 33rd International Conference on Machine Learning, vol. 48. 173--182.
    [7]
    J. Appleyard, T. Kociský, and P. Blunsom. 2016. Optimizing performance of recurrent neural networks on GPUs. arxiv:1604.01946
    [8]
    N. S. Arora, R. D. Blumofe, and C. G. Plaxton. 1998. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’98). 119--129.
    [9]
    A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda. 2017. S-Caffe: Co-designing MPI runtimes and caffe for scalable deep learning on modern GPU clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’17). 193--205.
    [10]
    J. Ba and R. Caruana. 2014. Do deep nets really need to be deep? In Advances in Neural Information Processing Systems 27. MIT Press, 2654--2662.
    [11]
    J. Ba, R. Grosse, and J. Martens. 2017. Distributed second-order optimization using kronecker-factored approximations. In Proceedings of the International Conference on Learning Representations (ICLR’17).
    [12]
    B. Baker, O. Gupta, N. Naik, and R. Raskar. 2017. Designing neural network architectures using reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR’17).
    [13]
    B. Baker, O. Gupta, R. Raskar, and N. Naik. 2017. Practical neural network performance prediction for early stopping. arxiv:1705.10823.
    [14]
    B. W. Barrett et al. 2018. The Portals 4.2 Network Programming Interface. Sandia Report SAND2018-12790. Technical Report.
    [15]
    R. Belli and T. Hoefler. 2015. Notified access: Extending remote memory access programming models for producer-consumer synchronization. In Proceedings of the 29th IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’15).
    [16]
    T. Ben-Nun, E. Levy, A. Barak, and E. Rubin. 2015. Memory access patterns: The missing piece of the multi-GPU puzzle. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’15). 19:1--19:12.
    [17]
    Y. Bengio. 2013. Deep learning of representations: Looking forward. In Proceedings of the Statistical Language and Speech Processing (SLSP’13).
    [18]
    Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. 2007. Greedy layer-wise training of deep networks. In Advances in Neural Information Processing Systems 19. MIT Press, 153--160.
    [19]
    Y. Bengio, P. Simard, and P. Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5, 2 (1994), 157--166.
    [20]
    P. Blanchard, E. M. El Mhamdi, R. Guerraoui, and J. Stainer. 2017. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems 30. MIT Press, 119--129.
    [21]
    R. D. Blumofe and C. E. Leiserson. 1999. Scheduling multithreaded computations by work stealing. J. ACM 46, 5 (1999), 720--748.
    [22]
    M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba. 2016. End to end learning for self-driving cars. arxiv:1604.07316
    [23]
    L. Bottou, F. E. Curtis, and J. Nocedal. 2016. Optimization methods for large-scale machine learning. arxiv:1606.04838
    [24]
    S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. 2005. Gossip algorithms: Design, analysis and applications. In Proceedings of the IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies, vol. 3. 1653--1664.
    [25]
    S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3, 1 (2011), 1--122.
    [26]
    R. P. Brent. 1974. The parallel evaluation of general arithmetic expressions. J. ACM 21, 2 (1974), 201--206.
    [27]
    A. Brock, T. Lim, J. M. Ritchie, and N. Weston. 2017. SMASH: One-shot model architecture search through HyperNetworks. arxiv:1708.05344.
    [28]
    R. H. Byrd, S. L. Hansen, J. Nocedal, and Y. Singer. 2016. A stochastic quasi-newton method for large-scale optimization. SIAM J. Optim. 26, 2 (2016), 1008--1031.
    [29]
    E. Chan, M. Heimlich, A. Purkayastha, and R. van de Geijn. 2007. Collective communication: Theory, practice, and experience: Research articles. Concurr. Comput.: Pract. Exper. 19, 13 (2007), 1749--1783.
    [30]
    W. Chan, N. Jaitly, Q. Le, and O. Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 4960--4964.
    [31]
    K. Chellapilla, S. Puri, and P. Simard. 2006. High performance convolutional neural networks for document processing. In Proceedings of the 10th International Workshop on Frontiers in Handwriting Recognition.
    [32]
    C.-Y. Chen, J. Choi, D. Brand, A. Agrawal, W. Zhang, and K. Gopalakrishnan. 2017. AdaComp : Adaptive residual gradient compression for data-parallel distributed training. arxiv:1712.02679.
    [33]
    K. Chen and Q. Huo. 2016. Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 5880--5884.
    [34]
    T. Chen et al. 2018. TVM: End-to-end optimization stack for deep learning. arxiv:1802.04799.
    [35]
    T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. 2014. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). 269--284.
    [36]
    T. Chen, B. Xu, C. Zhang, and C. Guestrin. 2016. Training deep nets with sublinear memory cost. arxiv:1604.06174
    [37]
    Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. 2017. Dual path networks. In Advances in Neural Information Processing Systems 30. MIT Press, 4470--4478.
    [38]
    S. Chetlur et al. 2014. cuDNN: Efficient primitives for deep learning. arxiv:1410.0759.
    [39]
    T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. 2014. Project adam: Building an efficient and scalable deep learning training system. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation. 571--582.
    [40]
    K. Cho et al. 2014. Learning phrase representations using RNN encoder--decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1724--1734.
    [41]
    F. Chollet. 2016. Xception: Deep learning with depthwise separable convolutions. arxiv:1610.02357
    [42]
    C. Chu, S. K. Kim, Y. Lin, Y. Yu, G. Bradski, K. Olukotun, and A. Y. Ng. 2007. Map-reduce for machine learning on multicore. In Advances in Neural Information Processing Systems 19. MIT Press, 281--288.
    [43]
    I. H. Chung et al. 2017. Parallel deep neural network training for big data on blue gene/Q. IEEE Trans. Parallel Distrib. Syst. 28, 6 (2017), 1703--1714.
    [44]
    D. C. Cireşan, A. Giusti, L. M. Gambardella, and J. Schmidhuber. 2013. Mitosis detection in breast cancer histology images with deep neural networks. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention. 411--418.
    [45]
    A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and B. Catanzaro. 2013. Deep learning with COTS HPC systems. In Proceedings of the 30th International Conference on Machine Learning—Volume 28 (ICML’13). III--1337--III--1345.
    [46]
    N. Cohen, O. Sharir, and A. Shashua. 2016. Deep SimNets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 4782--4791.
    [47]
    N. Cohen, O. Sharir, and A. Shashua. 2016. On the expressive power of deep learning: A tensor analysis. In Proceedings of the 29th Annual Conference on Learning Theory, vol. 49. 698--728.
    [48]
    R. Collobert, K. Kavukcuoglu, and C. Farabet. 2011. Torch7: A matlab-like environment for machine learning. In Proceedings of the Workshop on Big Learning: Algorithms, Systems, and Tools for Learning at Scale (BigLearn’11).
    [49]
    J. Cong and B. Xiao. 2014. Minimizing computation in convolutional neural networks. In Proceedings of the International Conference on Artificial Neural Networks (ICANN’14). 281--290.
    [50]
    M. Courbariaux and Y. Bengio. 2016. BinaryNet: Training deep neural networks with weights and activations constrained to +1 or −1. arxiv:1602.02830
    [51]
    M. Courbariaux, Y. Bengio, and J.-P. David. 2015. BinaryConnect: Training deep neural networks with binary weights during propagations. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15), vol. 2. 3123--3131.
    [52]
    H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing. 2016. GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server. In Proceedings of the European Conference on Computer Systems (EuroSys’16). 4:1--4:16.
    [53]
    X. Cui, W. Zhang, Z. Tüske, and M. Picheny. 2018. Evolutionary stochastic gradient descent for optimization of deep neural networks. In Advances in Neural Information Processing Systems 31. MIT Press, 6048--6058.
    [54]
    D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. 1993. LogP: Towards a realistic model of parallel computation. In Proceedings of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’93). 1--12.
    [55]
    J. Daily et al. 2018. GossipGraD: Scalable deep learning using gossip communication-based asynchronous gradient descent. arxiv:1803.05880.
    [56]
    C. De Sa, C. Zhang, K. Olukotun, and C. Ré. 2015. Taming the wild: A unified analysis of HOGWILD!-style algorithms. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15), vol. 2. 2674--2682.
    [57]
    J. Dean et al. 2012. Large scale distributed deep networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’12), vol. 1. 1223--1231.
    [58]
    J. Dean and S. Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.
    [59]
    O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. 2012. Optimal distributed online prediction using mini-batches. J. Mach. Learn. Res. 13, 1 (2012), 165--202.
    [60]
    O. Delalleau and Y. Bengio. 2011. Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems 24. MIT Press, 666--674.
    [61]
    J. Demmel and G. Dinh. 2018. Communication-optimal convolutional neural nets. arxiv:1802.06905.
    [62]
    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09).
    [63]
    L. Deng, D. Yu, and J. Platt. 2012. Scalable stacking and learning for building deep architectures. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12). 2133--2136.
    [64]
    T. Dettmers. 2015. 8-bit approximations for parallelism in deep learning. arxiv:1511.04561.
    [65]
    G. Diamos et al. 2016. Persistent RNNs: Stashing recurrent weights on-chip. In Proceedings of the 33rd International Conference on Machine Learning, vol. 48. 2024--2033.
    [66]
    T. G. Dietterich. 2000. Ensemble methods in machine learning. In Proceedings of the 1st International Workshop on Multiple Classifier Systems (MCS’00). 1--15.
    [67]
    Z. Drezner and A. Barak. 1986. An asynchronous algorithm for scattering information between the active nodes of a multicomputer system. J. Parallel Distrib. Comput. 3, 3 (1986), 344--351.
    [68]
    N. Dryden et al. 2019. Improving strong-scaling of CNN training by exploiting finer-grained parallelism. In Proceedings of the 33rd IEEE Int:l Parallel 8 Distributed Processing Symposium (IPDPS’19).
    [69]
    N. Dryden, T. Moon, S. A. Jacobs, and B. V. Essen. 2016. Communication quantization for data-parallel training of deep neural networks. In Proceedings of the Workshop on Machine Learning in HPC Environments (MLHPC’16). 1--8.
    [70]
    J. Duchi, E. Hazan, and Y. Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12 (2011), 2121--2159.
    [71]
    V. Dumoulin and F. Visin. 2016. A guide to convolution arithmetic for deep learning. arxiv:1603.07285.
    [72]
    J. L. Elman. 1990. Finding structure in time. Cogn. Sci. 14, 2 (1990), 179--211.
    [73]
    T. Elsken, J.-H. Metzen, and F. Hutter. 2017. Simple and efficient architecture search for convolutional neural networks. arxiv:1711.04528.
    [74]
    L. Ericson and R. Mbuvha. 2017. On the performance of network parallel training in artificial neural networks. arxiv:1701.05130.
    [75]
    P. Farber and K. Asanovic. 1997. Parallel neural network training on multi-spert. In Proceedings of the 3rd International Conference on Algorithms and Architectures for Parallel Processing. 659--666.
    [76]
    B. M. Forrest, D. Roweth, N. Stroud, D. J. Wallace, and G. V. Wilson. 1987. Implementing neural network models on parallel computers. Comput. J. 30, 5 (1987), 413--419.
    [77]
    K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman. 2017. Meta learning shared hierarchies. arxiv:1710.09767.
    [78]
    M. P. Friedlander and M. W. Schmidt. 2011. Hybrid deterministic-stochastic methods for data fitting. arxiv:1104.2373.
    [79]
    A. Gaunt, M. Johnson, M. Riechert, D. Tarlow, R. Tomioka, D. Vytiniotis, and S. Webster. 2017. AMPNet: Asynchronous model-parallel training for dynamic neural networks. arxiv:1705.09786.
    [80]
    A. Gholami, A. Azad, P. H. Jin, K. Keutzer, and A. Buluç. 2018. Integrated model, batch, and domain parallelism in training neural networks. In Proceedings of the 30th Symposium on Parallelism in Algorithms and Architectures (SPAA’18). 77--86.
    [81]
    X. Glorot and Y. Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, vol. 9. 249--256.
    [82]
    D. E. Goldberg and K. Deb. 1991. A comparative analysis of selection schemes used in genetic algorithms. Foundations of Genetic Algorithms, vol. 1. Elsevier, 69--93.
    [83]
    Google. 2017. TensorFlow XLA Overview. Retrieved from https://www.tensorflow.org/performance/xla.
    [84]
    P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. 2017. Accurate, large minibatch SGD: Training ImageNet in 1 Hour. arxiv:1706.02677.
    [85]
    A. Graves et al. 2016. Hybrid computing using a neural network with dynamic external memory. Nature 538, 7626 (2016), 471--476.
    [86]
    W. Gropp, T. Hoefler, R. Thakur, and E. Lusk. 2014. Using Advanced MPI: Modern Features of the Message-Passing Interface. MIT Press.
    [87]
    A. Gruslys, R. Munos, I. Danihelka, M. Lanctot, and A. Graves. 2016. Memory-efficient backpropagation through time. In Advances in Neural Information Processing Systems 29. MIT Press, 4125--4133.
    [88]
    S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. 2015. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning, vol. 37. 1737--1746.
    [89]
    S. Gupta, W. Zhang, and F. Wang. 2016. Model accuracy and runtime tradeoff in distributed deep learning: A systematic study. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM’16). 171--180.
    [90]
    S. Hadjis, C. Zhang, I. Mitliagkas, and C. Ré. 2016. Omnivore: An optimizer for multi-device deep learning on CPUs and GPUs. arxiv:1606.04487.
    [91]
    S. Han, H. Mao, and W. J. Dally. 2016. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In Proceedings of the International Conference on Learning Representations (ICLR’16) (2016).
    [92]
    E. Hazan, A. Klivans, and Y. Yuan. 2018. Hyperparameter optimization: A spectral approach. In Proceedings of the International Conference on Learning Representations (ICLR’18).
    [93]
    K. He, X. Zhang, S. Ren, and J. Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 1026--1034.
    [94]
    K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770--778.
    [95]
    X. He, D. Mudigere, M. Smelyanskiy, and M. Takac. 2017. Distributed hessian-free optimization for deep neural network. In Proceedings of the AAAI Workshops.
    [96]
    G. Hinton. 2012. Neural Networks for Machine Learning, Lecture 6a: Overview of Mini-batch Gradient Descent.
    [97]
    G. Hinton, O. Vinyals, and J. Dean. 2015. Distilling the knowledge in a neural network. In Proceedings of the NIPS Deep Learning and Representation Learning Workshop.
    [98]
    G. E. Hinton, S. Osindero, and Y. W. Teh. 2006. A fast learning algorithm for deep belief nets. Neural Comput. 18, 7 (2006), 1527--1554.
    [99]
    Q. Ho et al. 2013. More effective distributed ML via a stale synchronous parallel parameter server. In Proceedings of the 26th International Conference on Neural Information Processing Systems, vol. 1 (NIPS’13). 1223--1231.
    [100]
    S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780.
    [101]
    T. Hoefler, A. Barak, A. Shiloh, and Z. Drezner. 2017. Corrected gossip algorithms for fast reliable broadcast on unreliable systems. In Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS’17).
    [102]
    T. Hoefler, A. Lumsdaine, and W. Rehm. 2007. Implementation and performance analysis of non-blocking collective operations for MPI. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’07). IEEE Computer Society/ACM.
    [103]
    T. Hoefler and D. Moor. 2014. Energy, memory, and runtime tradeoffs for implementing collective communication operations. J. Supercomput. Front. Innovat. 1, 2 (2014), 58--75.
    [104]
    T. Hoefler and T. Schneider. 2012. Optimization principles for collective neighborhood communications. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 98:1--98:10.
    [105]
    T. Hoefler and M. Snir. 2011. Generic topology mapping strategies for large-scale parallel architectures. In Proceedings of the ACM International Conference on Supercomputing (ICS’11). 75--85.
    [106]
    T. Hoefler and J. L. Traeff. 2009. Sparse collective operations for MPI. In Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium (HIPS’09).
    [107]
    E. Hoffer, I. Hubara, and D. Soudry. 2017. Train longer, generalize better: Closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems 30. MIT Press, 1729--1739.
    [108]
    A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arxiv:1704.04861.
    [109]
    K. Hsieh, A. Harlap, N. Vijaykumar, D. Konomis, G. R. Ganger, P. B. Gibbons, and O. Mutlu. 2017. Gaia: Geo-distributed machine learning approaching LAN speeds. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation (NSDI’17). 629--647.
    [110]
    G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
    [111]
    Y. Huang et al. 2018. GPipe: Efficient training of giant neural networks using pipeline parallelism. arxiv:1811.06965.
    [112]
    I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. 2016. Quantized neural networks: Training neural networks with low precision weights and activations. arxiv:1609.07061.
    [113]
    D. A. Huffman. 1952. A method for the construction of minimum-redundancy codes. Proc. IRE 40, 9 (1952), 1098--1101.
    [114]
    F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size. arxiv:1602.07360.
    [115]
    F. N. Iandola, M. W. Moskewicz, K. Ashraf, and K. Keutzer. 2016. FireCaffe: Near-linear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).
    [116]
    IBM. 2019. Engineering and Scientific Subroutine Library (ESSL). Version 6.2 Guide and Reference. Retrieved from https://www.ibm.com/support/knowledgecenter/SSFHY8_6.2/reference/essl_reference_pdf.pdf.
    [117]
    P. Ienne. 1993. Architectures for Neuro-Computers: Review and Performance Evaluation. Technical Report. EPFL, Lausanne, Switzerland.
    [118]
    D. J. Im, H. Ma, C. D. Kim, and G. W. Taylor. 2016. Generative adversarial parallelization. arxiv:1612.04021.
    [119]
    Intel. 2009. Intel Math Kernel Library. Reference Manual. Intel Corporation.
    [120]
    Intel. 2017. MKL-DNN. Retrieved from https://01.org/mkl-dnn.
    [121]
    S. Ioffe and C. Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 448--456.
    [122]
    M. Jaderberg et al. 2017. Population-based training of neural networks. arxiv:1711.09846.
    [123]
    X. Jia et al. 2018. Highly scalable deep learning training system with mixed-precision: Training ImageNet in four minutes. arxiv:1807.11205.
    [124]
    Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia. 675--678.
    [125]
    J. Jiang, B. Cui, C. Zhang, and L. Yu. 2017. Heterogeneity-aware distributed parameter servers. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’17). 463--478.
    [126]
    P. H. Jin, Q. Yuan, F. N. Iandola, and K. Keutzer. 2016. How to scale distributed deep learning? InProceedings of the ML Systems Workshop at NIPS.
    [127]
    M. Johnson et al. 2016. Google’s multilingual neural machine translation system: Enabling zero-shot translation. arxiv:1611.04558.
    [128]
    R. Johnson and T. Zhang. 2013. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26. MIT Press, 315--323.
    [129]
    N. P. Jouppi et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). 1--12.
    [130]
    L. Kaiser, A. N. Gomez, N. Shazeer, A. Vaswani, N. Parmar, L. Jones, and J. Uszkoreit. 2017. One model to learn them all. arxiv:1706.05137.
    [131]
    K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, and E. P. Xing. 2018. Neural architecture search with bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems 31. MIT Press, 2016--2025.
    [132]
    T. Karras, T. Aila, S. Laine, and J. Lehtinen. 2017. Progressive growing of GANs for improved quality, stability, and variation. arxiv:1710.10196.
    [133]
    J. Keuper and F. Pfreundt. 2015. Asynchronous parallel stochastic gradient descent: A numeric core for scalable distributed machine learning algorithms. In Proceedings of the Workshop on Machine Learning in HPC Environments (MLHPC’15). 1:1--1:11.
    [134]
    H. Kim et al. 2016. DeepSpark: Spark-based deep learning supporting asynchronous updates and caffe compatibility. arxiv:1602.08191.
    [135]
    Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. 2016. Compression of deep convolutional neural networks for fast and low power mobile applications. In Proceedings of the International Conference on Learning Representations (ICLR’16).
    [136]
    D. P. Kingma and J. Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR’15).
    [137]
    A. Klein, S. Falkner, J. T. Springenberg, and F. Hutter. 2016. Learning curve prediction with Bayesian neural networks. In Proceedings of the International Conference on Learning Representations (ICLR).
    [138]
    U. Köster et al. 2017. Flexpoint: An adaptive numerical format for efficient training of deep neural networks. In Advances in Neural Information Processing Systems 30. MIT Press, 1740--1750.
    [139]
    S. Krishnan, Y. Xiao, and R. A. Saurous. 2018. Neumann optimizer: A practical optimization algorithm for deep neural networks. In Proceedings of the International Conference on Learning Representations (ICLR’18).
    [140]
    A. Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. Master’s thesis, University of Toronto, Canada.
    [141]
    A. Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. arxiv:1404.5997.
    [142]
    A. Krizhevsky, I. Sutskever, and G. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25. MIT Press, 1097--1105.
    [143]
    T. Kurth et al. 2017. Deep learning at 15PF: Supervised and semi-supervised classification for scientific data. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17). 7:1--7:11.
    [144]
    G. Lacey, G. W. Taylor, and S. Areibi. 2016. Deep learning on FPGAs: Past, present, and future. arxiv:1602.04283.
    [145]
    L. Lamport, R. Shostak, and M. Pease. 1982. The byzantine generals problem. ACM Trans. Program. Lang. Syst. 4, 3 (1982), 382--401.
    [146]
    A. Lavin and S. Gray. 2016. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).
    [147]
    Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng. 2011. On optimization methods for deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML’11). 265--272.
    [148]
    Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean, and A. Y. Ng. 2012. Building high-level features using large scale unsupervised learning. In Proceedings of the 29th International Conference on Machine Learning (ICML’12). 507--514.
    [149]
    Y. LeCun, Y. Bengio, and G. Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436--444.
    [150]
    Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. 1989. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1, 4 (1989), 541--551.
    [151]
    Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.
    [152]
    H. Lee, P. Pham, Y. Largman, and A. Y. Ng. 2009. Unsupervised feature learning for audio classification using convolutional deep belief networks. In Advances in Neural Information Processing Systems 22. MIT Press, 1096--1104.
    [153]
    S. Lee, S. Purushwalkam, M. Cogswell, D. J. Crandall, and D. Batra. 2015. Why M heads are better than one: Training a diverse ensemble of deep networks. arxiv:1511.06314.
    [154]
    C. Li, Y. Yang, M. Feng, S. Chakradhar, and H. Zhou. 2016. Optimizing memory efficiency for deep convolutional neural networks on GPUs. In Proceedings of the International Conference for Supercomputing (SC’16). 54:1--54:12.
    [155]
    D. Li, X. Wang, and D. Kong. 2017. DeepRebirth: Accelerating deep neural network execution on mobile devices. arxiv:1708.04728.
    [156]
    F. Li and B. Liu. 2016. Ternary weight networks. arxiv:1605.04711.
    [157]
    M. Li et al. 2014. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). 583--598.
    [158]
    T. Li, J. Zhong, J. Liu, W. Wu, and C. Zhang. 2017. Ease.ml: Towards multi-tenant resource sharing for machine learning workloads. arxiv:1708.07308.
    [159]
    X. Lian et al. 2017. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems 30. MIT Press, 5336--5346.
    [160]
    X. Lian, Y. Huang, Y. Li, and J. Liu. 2015. Asynchronous parallel stochastic gradient for nonconvex optimization. In Proceedings of the 28th International Conference on NIPS, vol. 2. 2737--2745.
    [161]
    X. Lian, W. Zhang, C. Zhang, and J. Liu. 2018. Asynchronous decentralized parallel stochastic gradient descent. In Proceedings of the 35th International Conference on Machine Learning (ICML’18). 3043--3052.
    [162]
    M. Lin, Q. Chen, and S. Yan. 2014. Network in network. In Proceedings of the International Conferecne on Learning Representations (ICLR’14).
    [163]
    Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally. 2018. Deep gradient compression: Reducing the communication bandwidth for distributed training. In Proceedings of the International Conference on Learning Representations (ICLR’18).
    [164]
    C. Liu, B. Zoph, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy. 2017. Progressive neural architecture search. arxiv:1712.00559.
    [165]
    H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu. 2018. Hierarchical representations for efficient architecture search. In Proceedings of the International Conference on Learning Representations (ICLR’18).
    [166]
    H. Liu, K. Simonyan, and Y. Yang. 2018. DARTS: Differentiable architecture search. arxiv:1806.09055.
    [167]
    X. Liu, J. Pool, S. Han, and W. J. Dally. 2018. Efficient sparse-winograd convolutional neural networks. In Proceedings of the International Conference on Learning Representations (ICLR’18).
    [168]
    P. R. Lorenzo, J. Nalepa, L. S. Ramos, and J. R. Pastor. 2017. Hyper-parameter selection in deep neural networks using parallel particle swarm optimization. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO’17). 1864--1871.
    [169]
    I. Loshchilov and F. Hutter. 2017. SGDR: Stochastic gradient descent with warm restarts. In Proceedings of the International Conference on Learning Representations (ICLR’17).
    [170]
    R. Luo, F. Tian, T. Qin, E. Chen, and T.-Y. Liu. 2018. Neural architecture optimization. In Advances in Neural Information Processing Systems 31. MIT Press, 7816--7827.
    [171]
    J. Martens. 2010. Deep learning via hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML’10). 735--742.
    [172]
    M. Mathieu, M. Henaff, and Y. LeCun. 2014. Fast training of convolutional networks through FFTs. In Proceedings of the International Conference on Learning Representations (ICLR’14).
    [173]
    Message Passing Interface Forum. 2015. MPI: A Message-Passing Interface Standard Version 3.1. Retrieved from https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf.
    [174]
    Y. Miao, H. Zhang, and F. Metze. 2014. Distributed learning of multilingual DNN feature extractors using GPUs. In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH’14). 830--834.
    [175]
    R. Miikkulainen et al. 2017. Evolving deep neural networks. arxiv:1703.00548.
    [176]
    H. Mikami et al. 2018. ImageNet/ResNet-50 training in 224 seconds. arxiv:1811.05233.
    [177]
    P. Moritz, R. Nishihara, and M. Jordan. 2016. A linearly-convergent stochastic L-BFGS algorithm. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, vol. 51. 249--258.
    [178]
    P. Moritz, R. Nishihara, I. Stoica, and M. I. Jordan. 2016. SparkNet: Training deep networks in spark. In Proceedings of the International Conference on Learning Representations (ICLR’16).
    [179]
    U. A. Muller and A. Gunzinger. 1994. Neural net simulation on parallel computers. In Proceedings of the IEEE International Conference on Neural Networks, vol. 6. 3961--3966.
    [180]
    R. Negrinho and G. Gordon. 2017. DeepArchitect: Automatically designing and training deep architectures. arxiv:1704.08792.
    [181]
    A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. 2009. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19, 4 (2009), 1574--1609.
    [182]
    Y. Nesterov. 1983. A method of solving a convex programming problem with convergence rate O(1/k)<sup>2</sup>. Soviet Math. Doklady 269 (1983), 543--547.
    [183]
    Netlib. 2019. Basic Linear Algebra Subprograms (BLAS). Retrieved from http://www.netlib.org/blas.
    [184]
    J. Ngiam, Z. Chen, D. Chia, P. W. Koh, Q. V. Le, and A. Y. Ng. 2010. Tiled convolutional neural networks. In Advances in Neural Information Processing Systems 23. MIT Press, 1279--1287.
    [185]
    J. Nocedal and S. Wright. 2006. Numerical Optimization. Springer.
    [186]
    C. Noel and S. Osindero. 2014. Dogwild!—Distributed hogwild for CPU 8 GPU. In Proceedings of the NIPS Workshop on Distributed Machine Learning and Matrix Computations.
    [187]
    E. Nurvitadhi et al. 2017. Can FPGAs beat GPUs in accelerating next-generation deep neural networks?. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). 5--14.
    [188]
    NVIDIA. 2017. Programming Tensor Cores in CUDA 9. Retrieved from https://devblogs.nvidia.com/programming-tensor-cores-cuda-9.
    [189]
    NVIDIA. 2019. CUBLAS Library Documentation. Retrieved from http://docs.nvidia.com/cuda/cublas.
    [190]
    C. Olah. 2015. Understanding LSTM Networks. Retrieved from http://colah.github.io/posts/2015-08-Understanding-LSTMs.
    [191]
    K. Osawa et al. 2018. Second-order optimization method for large mini-batch: Training resnet-50 on ImageNet in 35 Epochs. arxiv:1811.12019.
    [192]
    M. Ott et al. 2018. Scaling neural machine translation. In Proceedings of the 3rd Conference on Machine Translation: Research Papers. 1--9.
    [193]
    Y. Oyama et al. 2016. Predicting statistics of asynchronous SGD parameters for a large-scale distributed deep learning system on GPU supercomputers. In Proceedings of the IEEE International Conference on Big Data (BigData’16). 66--75.
    [194]
    Y. Oyama, T. Ben-Nun, T. Hoefler, and S. Matsuoka. 2018. Accelerating deep learning frameworks with micro-batches. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’18).
    [195]
    PaddlePaddle. 2017. Elastic Deep Learning. Retrieved from https://github.com/PaddlePaddle/cloud/tree/develop/doc/edl.
    [196]
    T. Paine et al. 2013. GPU asynchronous stochastic gradient descent to speed up neural network training. arxiv:1312.6186.
    [197]
    S. J. Pan and Q. Yang. 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 10 (2010), 1345--1359.
    [198]
    P. Patarasuk and X. Yuan. 2009. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distrib. Comput. 69, 2 (2009), 117--124.
    [199]
    F. Petroski Such et al. 2017. Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arxiv:1712.06567.
    [200]
    H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. 2018. Efficient neural architecture search via parameter sharing. arxiv:1802.03268.
    [201]
    B. T. Polyak and A. B. Juditsky. 1992. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30, 4 (1992), 838--855.
    [202]
    D. Povey, X. Zhang, and S. Khudanpur. 2014. Parallel training of deep neural networks with natural gradient and parameter averaging. arxiv:1410.7455.
    [203]
    R. Puri et al. 2018. Large scale language modeling: Converging on 40GB of text in four hours. arxiv:1808.01371.
    [204]
    H. Qi, E. R. Sparks, and A. Talwalkar. 2017. Paleo: A performance model for deep neural networks. In Proceedings of the International Conference on Learning Representations (ICLR’17).
    [205]
    N. Qian. 1999. On the momentum term in gradient descent learning algorithms. Neural Netw. 12, 1 (1999).
    [206]
    R. Rabenseifner. 2004. Optimization of collective reduction operations. In Proceedings of the International Conference on Computational Science. 1--9.
    [207]
    A. Rahimi and B. Recht. 2017. Reflections on random kitchen sinks. Retrieved from http://www.argmin.net/2017/12/05/kitchen-sinks NIPS Test of Time Award Talk.
    [208]
    R. Raina, A. Madhavan, and A. Y. Ng. 2009. Large-scale deep unsupervised learning using graphics processors. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09). 873--880.
    [209]
    S. Sundhar Ram, A. Nedic, and V. V. Veeravalli. 2009. Asynchronous gossip algorithms for stochastic optimization. In Proceedings of the International Conference on Game Theory for Networks. 80--81.
    [210]
    M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. 2016. XNOR-Net: ImageNet classification using binary convolutional neural networks. arxiv:1603.05279.
    [211]
    E. Real, A. Aggarwal, Y. Huang, and Q. V Le. 2018. Regularized evolution for image classifier architecture search. arxiv:1802.01548.
    [212]
    E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin. 2017. Large-scale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine Learning. 2902--2911.
    [213]
    B. Recht, C. Re, S. Wright, and F. Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems 24. MIT Press, 693--701.
    [214]
    C. Renggli, D. Alistarh, and T. Hoefler. 2018. SparCML: High-performance sparse communication for machine learning. arxiv:1802.08021.
    [215]
    H. Robbins and S. Monro. 1951. A stochastic approximation method. Ann. Math. Stat. 22, 3 (1951), 400--407.
    [216]
    T. Salimans and D. P. Kingma. 2016. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems 29. MIT Press, 901--909.
    [217]
    F. Seide et al. 2014. 1-bit stochastic gradient descent and application to data-parallel distributed training of speech DNNs. In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH’14).
    [218]
    F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 2014. On parallelizability of stochastic gradient descent for speech DNNs. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 235--239.
    [219]
    S. Shalev-Shwartz and S. Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
    [220]
    C. J. Shallue et al. 2018. Measuring the effects of data parallelism on neural network training. arxiv:1811.03600.
    [221]
    O. Shamir. 2016. Without-replacement sampling for stochastic gradient methods. In Advances in Neural Information Processing Systems 29. MIT Press, 46--54.
    [222]
    R. Shokri and V. Shmatikov. 2015. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS’15). 1310--1321.
    [223]
    D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. 2017. Mastering the game of go without human knowledge. Nature 550, 7676 (2017), 354.
    [224]
    K. Simonyan and A. Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. Proceedings of the International Conference on Learning Representations (ICLR’15).
    [225]
    A. J. R. Simpson. 2015. Instant learning: Parallel deep neural networks and convolutional bootstrapping. arxiv:1505.05972.
    [226]
    S. L. Smith, P. Kindermans, and Q. V. Le. 2017. Don’t decay the learning rate, increase the batch size. arxiv:1711.00489.
    [227]
    J. Snoek, H. Larochelle, and R. P Adams. 2012. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25. MIT Press, 2951--2959.
    [228]
    E. Solomonik and T. Hoefler. 2015. Sparse Tensor Algebra as a Parallel Programming Model. arxiv:1512.00066.
    [229]
    M. Song, Y. Hu, H. Chen, and T. Li. 2017. Towards pervasive and user satisfactory CNN across GPU microarchitectures. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’17). 1--12.
    [230]
    H. V. Sorensen and C. S. Burrus. 1993. Efficient computation of the DFT with only a subset of input or output points. IEEE Trans. Signal Process. 41, 3 (1993), 1184--1200.
    [231]
    V. Strassen. 1969. Gaussian elimination is not optimal. Numer. Math. 13, 4 (1969), 354--356.
    [232]
    N. Strom. 2015. Scalable distributed DNN training using commodity GPU cloud computing. In Proceedings of the 16th Annual Conference of the International Speech Communication Association.
    [233]
    V. Sze, Y. H. Chen, T. J. Yang, and J. S. Emer. 2017. Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE 105, 12 (2017), 2295--2329.
    [234]
    C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’15).
    [235]
    G. Taylor, R. Burmeister, Z. Xu, B. Singh, A. Patel, and T. Goldstein. 2016. Training neural networks without gradients: A scalable ADMM approach. (2016). arxiv:1605.02026
    [236]
    L. Truong et al. 2016. Latte: A language, compiler, and runtime for elegant and efficient deep neural networks. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’16). 209--223.
    [237]
    J. Tsitsiklis, D. Bertsekas, and M. Athans. 1986. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Trans. Automat. Control 31, 9 (1986), 803--812.
    [238]
    B. Van Essen et al. 2015. LBANN: Livermore big artificial neural network HPC toolkit. In Proceedings of the Workshop on Machine Learning in HPC Environments.
    [239]
    V. Vanhoucke, A. Senior, and M. Z. Mao. 2011. Improving the speed of neural networks on CPUs. In Proceedings of the Deep Learning and Unsupervised Feature Learning Workshop (NIPS’11).
    [240]
    N. Vasilache et al. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arxiv:1802.04730.
    [241]
    N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y. LeCun. 2015. Fast convolutional nets with fbfft: A GPU performance evaluation. In Proceedings of the International Conference on Learning Representations (ICLR’15).
    [242]
    A. Vasudevan, A. Anderson, and D. Gregg. 2017. Parallel multi channel convolution using general matrix multiplication. arxiv:1704.04428.
    [243]
    P. Verbancsics and J. Harguess. 2015. Image classification using generative neuro evolution for deep learning. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 488--493.
    [244]
    A. Viebke, S. Memeti, S. Pllana, and A. Abraham. 2019. CHAOS: A parallelization scheme for training convolutional neural networks on Intel Xeon Phi. The Journal of Supercomputing 75, 1 (Jan. 2019), 197--227.
    [245]
    W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. 2017. TernGrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems 30. MIT Press, 1509--1519.
    [246]
    P. J. Werbos. 1990. Backpropagation through time: What it does and how to do it. Proc. IEEE 78, 10 (1990), 1550--1560.
    [247]
    J. H. Wilkinson. 1994. Rounding Errors in Algebraic Processes. Dover Publications.
    [248]
    R. J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 3 (1992), 229--256.
    [249]
    S. Winograd. 1980. Arithmetic Complexity of Computations. Society for Industrial and Applied Mathematics.
    [250]
    L. Xie and A. Yuille. 2017. Genetic CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 1388--1397.
    [251]
    P. Xie, J. K. Kim, Y. Zhou, Q. Ho, A. Kumar, Y. Yu, and E. Xing. 2016. Lighter-communication distributed machine learning via sufficient factor broadcasting. In Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence (UAI’16). 795--804.
    [252]
    E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu. 2015. Petuum: A new platform for distributed machine learning on big data. IEEE Trans. Big Data 1, 2 (2015), 49--67.
    [253]
    K. Xu et al. 2015. Show, attend and tell: Neural image caption generation with visual attention. arxiv:1502.03044.
    [254]
    O. Yadan, K. Adams, Y. Taigman, and M. Ranzato. 2013. Multi-GPU training of ConvNets. arxiv:1312.5853.
    [255]
    F. Yan, O. Ruwase, Y. He, and T. Chilimbi. 2015. Performance modeling and scalability optimization of distributed deep learning systems. In Proceedings of the 21st ACM International Conference on Knowledge Discovery and Data Mining (KDD’15). 1355--1364.
    [256]
    C. Ying et al. 2018. Image classification at supercomputer scale. arxiv:1811.06992.
    [257]
    Y. You et al. 2019. Large-batch training for LSTM and beyond. arxiv:1901.08256.
    [258]
    Y. You, A. Buluç, and J. Demmel. 2017. Scaling deep learning on GPU and knights landing clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17). 9:1--9:12.
    [259]
    Y. You, I. Gitman, and B. Ginsburg. 2017. Large batch training of convolutional networks. arxiv:1708.03888.
    [260]
    Y. You, Z. Zhang, C. Hsieh, and J. Demmel. 2017. 100-epoch ImageNet training with AlexNet in 24 minutes. arxiv:1709.05011
    [261]
    S. R. Young et al. 2017. Evolving deep networks using HPC. In Proceedings of the Workshop on Machine Learning in HPC Environments (MLHPC’17). 7:1--7:7.
    [262]
    F. Yu and V. Koltun. 2016. Multi-scale context aggregation by dilated convolutions. In Proceedings of the International Conference on Learning Representations (ICLR’16).
    [263]
    Y. Yu, J. Jiang, and X. Chi. 2016. Using supercomputer to speed up neural network training. In Proceedings of the IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS’16). 942--947.
    [264]
    H. Zhang et al. 2015. Poseidon: A system architecture for efficient GPU-based deep learning on multiple machines. arxiv:1512.06216
    [265]
    H. Zhang, Z. Zheng, S. Xu, W. Dai, Q. Ho, X. Liang, Z. Hu, J. Wei, P. Xie, and E. P. Xing. 2017. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In Proceedings of the USENIX Annual Technical Conference (ATC’17). 181--193.
    [266]
    J. Zhang, I. Mitliagkas, and C. Ré. 2017. YellowFin and the art of momentum tuning. arxiv:1706.03471
    [267]
    K. Zhang and X. W. Chen. 2014. Large-scale deep belief nets with MapReduce. IEEE Access 2 (2014), 395--403.
    [268]
    S. Zhang et al. 2015. Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems 28. MIT Press, 685--693.
    [269]
    S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen. 2016. Cambricon-X: An accelerator for sparse neural networks. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). 1--12.
    [270]
    S. Zhang, C. Zhang, Z. You, R. Zheng, and B. Xu. 2013. Asynchronous stochastic gradient descent for DNN training. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 6660--6663.
    [271]
    W. Zhang et al. 2017. GaDei: On scale-up training as a service for deep learning. In Proceedings of the IEEE International Conference on Data Mining (ICDM’17). 1195--1200.
    [272]
    W. Zhang, S. Gupta, X. Lian, and J. Liu. 2016. Staleness-aware async-SGD for distributed deep learning. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16). 2350--2356.
    [273]
    X. Zhang, M. McKenna, J. P. Mesirov, and D. L. Waltz. 1990. An efficient implementation of the back-propagation algorithm on the connection machine CM-2. In Advances in Neural Information Processing Systems 2. MIT Press, 801--809.
    [274]
    H. Zhao and J. Canny. 2014. Kylix: A sparse allreduce for commodity clusters. In Proceedings of the 43rd International Conference on Parallel Processing. 273--282.
    [275]
    Z. Zhong, J. Yan, and C.-L. Liu. 2017. Practical network blocks design with Q-Learning. arxiv:1708.05552
    [276]
    S. Zhou et al. 2016. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arxiv:1606.06160
    [277]
    M. A. Zinkevich, M. Weimer, A. Smola, and L. Li. 2010. Parallelized stochastic gradient descent. In Proceedings of the 23rd International Conference on Neural Information Processing Systems, vol. 2. 2595--2603.
    [278]
    A. Zlateski, K. Lee, and H. S. Seung. 2016. ZNNi: Maximizing the inference throughput of 3D convolutional networks on CPUs and GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 854--865.
    [279]
    B. Zoph and Q. V. Le. 2017. Neural architecture search with reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR’17).
    [280]
    B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. 2017. Learning transferable architectures for scalable image recognition. arxiv:1707.07012

    Cited By

    View all
    • (2024)A snapshot of parallelism in distributed deep learning trainingRevista Colombiana de Computación10.29375/25392115.505425:1(60-73)Online publication date: 30-Jun-2024
    • (2024)Federated Meta Reinforcement Learning for Personalized TasksTsinghua Science and Technology10.26599/TST.2023.901006629:3(911-926)Online publication date: Jun-2024
    • (2024)HPC-Coder: Modeling Parallel Programs using Large Language ModelsISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528929(1-12)Online publication date: May-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Computing Surveys
    ACM Computing Surveys  Volume 52, Issue 4
    July 2020
    769 pages
    ISSN:0360-0300
    EISSN:1557-7341
    DOI:10.1145/3359984
    • Editor:
    • Sartaj Sahni
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 August 2019
    Accepted: 01 March 2019
    Received: 01 August 2018
    Published in CSUR Volume 52, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Deep learning
    2. distributed computing
    3. parallel algorithms

    Qualifiers

    • Survey
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,276
    • Downloads (Last 6 weeks)389
    Reflects downloads up to

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A snapshot of parallelism in distributed deep learning trainingRevista Colombiana de Computación10.29375/25392115.505425:1(60-73)Online publication date: 30-Jun-2024
    • (2024)Federated Meta Reinforcement Learning for Personalized TasksTsinghua Science and Technology10.26599/TST.2023.901006629:3(911-926)Online publication date: Jun-2024
    • (2024)HPC-Coder: Modeling Parallel Programs using Large Language ModelsISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528929(1-12)Online publication date: May-2024
    • (2024)An Introduction to the Compute Express Link (CXL) InterconnectACM Computing Surveys10.1145/366990056:11(1-37)Online publication date: 8-Jul-2024
    • (2024)System Optimizations for Enabling Training of Extreme Long Sequence Transformer ModelsProceedings of the 43rd ACM Symposium on Principles of Distributed Computing10.1145/3662158.3662806(121-130)Online publication date: 17-Jun-2024
    • (2024)Byzantine Machine Learning: A PrimerACM Computing Surveys10.1145/361653756:7(1-39)Online publication date: 9-Apr-2024
    • (2024)Enhancing Training of Physics-Informed Neural Networks Using Domain Decomposition–Based Preconditioning StrategiesSIAM Journal on Scientific Computing10.1137/23M1583375(S46-S67)Online publication date: 11-Apr-2024
    • (2024)AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth CostIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.339780035:8(1331-1344)Online publication date: Aug-2024
    • (2024)Graft: Efficient Inference Serving for Hybrid Deep Learning With SLO Guarantees via DNN Re-AlignmentIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.334051835:2(280-296)Online publication date: 1-Feb-2024
    • (2024)Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency AnalysisIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.3303431(1-20)Online publication date: 2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media