survey

Free access

Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis

Authors:

Torsten HoeflerAuthors Info & Claims

ACM Computing Surveys (CSUR), Volume 52, Issue 4

Article No.: 65, Pages 1 - 43

https://doi.org/10.1145/3320060

Published: 30 August 2019 Publication History

All formats PDF

Abstract

Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we describe the problem from a theoretical perspective, followed by approaches for its parallelization. We present trends in DNN architectures and the resulting implications on parallelization strategies. We then review and model the different types of concurrency in DNNs: from the single operator, through parallelism in network inference and training, to distributed deep learning. We discuss asynchronous stochastic optimization, distributed system architectures, communication schemes, and neural architecture search. Based on those approaches, we extrapolate potential directions for parallelism in deep learning.

References

[1]

M. Abadi et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from http://www.tensorflow.org.

[2]

A. Agarwal and J. C. Duchi. 2011. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems 24. MIT Press, 873--881.

Digital Library

[3]

A. F. Aji and K. Heafield. 2017. Sparse communication for distributed gradient descent. arxiv:1704.05021

[4]

F. Akopyan et al. 2015. TrueNorth: Design and tool flow of a 65 mW 1 million neuron programmable neurosynaptic chip. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 34, 10 (2015), 1537--1557.

Digital Library

[5]

D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems 30. MIT Press, 1709--1720.

Digital Library

[6]

D. Amodei et al. 2016. Deep speech 2 : End-to-end speech recognition in English and mandarin. In Proceedings of the 33rd International Conference on Machine Learning, vol. 48. 173--182.

Digital Library

[7]

J. Appleyard, T. Kociský, and P. Blunsom. 2016. Optimizing performance of recurrent neural networks on GPUs. arxiv:1604.01946

[8]

N. S. Arora, R. D. Blumofe, and C. G. Plaxton. 1998. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’98). 119--129.

Digital Library

[9]

A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda. 2017. S-Caffe: Co-designing MPI runtimes and caffe for scalable deep learning on modern GPU clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’17). 193--205.

Digital Library

[10]

J. Ba and R. Caruana. 2014. Do deep nets really need to be deep? In Advances in Neural Information Processing Systems 27. MIT Press, 2654--2662.

Digital Library

[11]

J. Ba, R. Grosse, and J. Martens. 2017. Distributed second-order optimization using kronecker-factored approximations. In Proceedings of the International Conference on Learning Representations (ICLR’17).

[12]

B. Baker, O. Gupta, N. Naik, and R. Raskar. 2017. Designing neural network architectures using reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR’17).

[13]

B. Baker, O. Gupta, R. Raskar, and N. Naik. 2017. Practical neural network performance prediction for early stopping. arxiv:1705.10823.

[14]

B. W. Barrett et al. 2018. The Portals 4.2 Network Programming Interface. Sandia Report SAND2018-12790. Technical Report.

[15]

R. Belli and T. Hoefler. 2015. Notified access: Extending remote memory access programming models for producer-consumer synchronization. In Proceedings of the 29th IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’15).

Digital Library

[16]

T. Ben-Nun, E. Levy, A. Barak, and E. Rubin. 2015. Memory access patterns: The missing piece of the multi-GPU puzzle. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’15). 19:1--19:12.

Digital Library

[17]

Y. Bengio. 2013. Deep learning of representations: Looking forward. In Proceedings of the Statistical Language and Speech Processing (SLSP’13).

Digital Library

[18]

Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. 2007. Greedy layer-wise training of deep networks. In Advances in Neural Information Processing Systems 19. MIT Press, 153--160.

Digital Library

[19]

Y. Bengio, P. Simard, and P. Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5, 2 (1994), 157--166.

Digital Library

[20]

P. Blanchard, E. M. El Mhamdi, R. Guerraoui, and J. Stainer. 2017. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems 30. MIT Press, 119--129.

Digital Library

[21]

R. D. Blumofe and C. E. Leiserson. 1999. Scheduling multithreaded computations by work stealing. J. ACM 46, 5 (1999), 720--748.

Digital Library

[22]

M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba. 2016. End to end learning for self-driving cars. arxiv:1604.07316

[23]

L. Bottou, F. E. Curtis, and J. Nocedal. 2016. Optimization methods for large-scale machine learning. arxiv:1606.04838

[24]

S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. 2005. Gossip algorithms: Design, analysis and applications. In Proceedings of the IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies, vol. 3. 1653--1664.

[25]

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3, 1 (2011), 1--122.

Digital Library

[26]

R. P. Brent. 1974. The parallel evaluation of general arithmetic expressions. J. ACM 21, 2 (1974), 201--206.

Digital Library

[27]

A. Brock, T. Lim, J. M. Ritchie, and N. Weston. 2017. SMASH: One-shot model architecture search through HyperNetworks. arxiv:1708.05344.

[28]

R. H. Byrd, S. L. Hansen, J. Nocedal, and Y. Singer. 2016. A stochastic quasi-newton method for large-scale optimization. SIAM J. Optim. 26, 2 (2016), 1008--1031.

Digital Library

[29]

E. Chan, M. Heimlich, A. Purkayastha, and R. van de Geijn. 2007. Collective communication: Theory, practice, and experience: Research articles. Concurr. Comput.: Pract. Exper. 19, 13 (2007), 1749--1783.

Digital Library

[30]

W. Chan, N. Jaitly, Q. Le, and O. Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 4960--4964.

[31]

K. Chellapilla, S. Puri, and P. Simard. 2006. High performance convolutional neural networks for document processing. In Proceedings of the 10th International Workshop on Frontiers in Handwriting Recognition.

[32]

C.-Y. Chen, J. Choi, D. Brand, A. Agrawal, W. Zhang, and K. Gopalakrishnan. 2017. AdaComp : Adaptive residual gradient compression for data-parallel distributed training. arxiv:1712.02679.

[33]

K. Chen and Q. Huo. 2016. Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 5880--5884.

[34]

T. Chen et al. 2018. TVM: End-to-end optimization stack for deep learning. arxiv:1802.04799.

[35]

T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. 2014. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). 269--284.

Digital Library

[36]

T. Chen, B. Xu, C. Zhang, and C. Guestrin. 2016. Training deep nets with sublinear memory cost. arxiv:1604.06174

[37]

Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. 2017. Dual path networks. In Advances in Neural Information Processing Systems 30. MIT Press, 4470--4478.

Digital Library

[38]

S. Chetlur et al. 2014. cuDNN: Efficient primitives for deep learning. arxiv:1410.0759.

[39]

T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. 2014. Project adam: Building an efficient and scalable deep learning training system. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation. 571--582.

Digital Library

[40]

K. Cho et al. 2014. Learning phrase representations using RNN encoder--decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1724--1734.

[41]

F. Chollet. 2016. Xception: Deep learning with depthwise separable convolutions. arxiv:1610.02357

[42]

C. Chu, S. K. Kim, Y. Lin, Y. Yu, G. Bradski, K. Olukotun, and A. Y. Ng. 2007. Map-reduce for machine learning on multicore. In Advances in Neural Information Processing Systems 19. MIT Press, 281--288.

Digital Library

[43]

I. H. Chung et al. 2017. Parallel deep neural network training for big data on blue gene/Q. IEEE Trans. Parallel Distrib. Syst. 28, 6 (2017), 1703--1714.

Digital Library

[44]

D. C. Cireşan, A. Giusti, L. M. Gambardella, and J. Schmidhuber. 2013. Mitosis detection in breast cancer histology images with deep neural networks. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention. 411--418.

[45]

A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and B. Catanzaro. 2013. Deep learning with COTS HPC systems. In Proceedings of the 30th International Conference on Machine Learning—Volume 28 (ICML’13). III--1337--III--1345.

Digital Library

[46]

N. Cohen, O. Sharir, and A. Shashua. 2016. Deep SimNets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 4782--4791.

[47]

N. Cohen, O. Sharir, and A. Shashua. 2016. On the expressive power of deep learning: A tensor analysis. In Proceedings of the 29th Annual Conference on Learning Theory, vol. 49. 698--728.

[48]

R. Collobert, K. Kavukcuoglu, and C. Farabet. 2011. Torch7: A matlab-like environment for machine learning. In Proceedings of the Workshop on Big Learning: Algorithms, Systems, and Tools for Learning at Scale (BigLearn’11).

[49]

J. Cong and B. Xiao. 2014. Minimizing computation in convolutional neural networks. In Proceedings of the International Conference on Artificial Neural Networks (ICANN’14). 281--290.

[50]

M. Courbariaux and Y. Bengio. 2016. BinaryNet: Training deep neural networks with weights and activations constrained to +1 or −1. arxiv:1602.02830

[51]

M. Courbariaux, Y. Bengio, and J.-P. David. 2015. BinaryConnect: Training deep neural networks with binary weights during propagations. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15), vol. 2. 3123--3131.

Digital Library

[52]

H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing. 2016. GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server. In Proceedings of the European Conference on Computer Systems (EuroSys’16). 4:1--4:16.

Digital Library

[53]

X. Cui, W. Zhang, Z. Tüske, and M. Picheny. 2018. Evolutionary stochastic gradient descent for optimization of deep neural networks. In Advances in Neural Information Processing Systems 31. MIT Press, 6048--6058.

Digital Library

[54]

D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. 1993. LogP: Towards a realistic model of parallel computation. In Proceedings of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’93). 1--12.

Digital Library

[55]

J. Daily et al. 2018. GossipGraD: Scalable deep learning using gossip communication-based asynchronous gradient descent. arxiv:1803.05880.

[56]

C. De Sa, C. Zhang, K. Olukotun, and C. Ré. 2015. Taming the wild: A unified analysis of HOGWILD!-style algorithms. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15), vol. 2. 2674--2682.

Digital Library

[57]

J. Dean et al. 2012. Large scale distributed deep networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’12), vol. 1. 1223--1231.

Digital Library

[58]

J. Dean and S. Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.

Digital Library

[59]

O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. 2012. Optimal distributed online prediction using mini-batches. J. Mach. Learn. Res. 13, 1 (2012), 165--202.

Digital Library

[60]

O. Delalleau and Y. Bengio. 2011. Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems 24. MIT Press, 666--674.

Digital Library

[61]

J. Demmel and G. Dinh. 2018. Communication-optimal convolutional neural nets. arxiv:1802.06905.

[62]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09).

[63]

L. Deng, D. Yu, and J. Platt. 2012. Scalable stacking and learning for building deep architectures. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12). 2133--2136.

[64]

T. Dettmers. 2015. 8-bit approximations for parallelism in deep learning. arxiv:1511.04561.

[65]

G. Diamos et al. 2016. Persistent RNNs: Stashing recurrent weights on-chip. In Proceedings of the 33rd International Conference on Machine Learning, vol. 48. 2024--2033.

Digital Library

[66]

T. G. Dietterich. 2000. Ensemble methods in machine learning. In Proceedings of the 1st International Workshop on Multiple Classifier Systems (MCS’00). 1--15.

Digital Library

[67]

Z. Drezner and A. Barak. 1986. An asynchronous algorithm for scattering information between the active nodes of a multicomputer system. J. Parallel Distrib. Comput. 3, 3 (1986), 344--351.

Digital Library

[68]

N. Dryden et al. 2019. Improving strong-scaling of CNN training by exploiting finer-grained parallelism. In Proceedings of the 33rd IEEE Int:l Parallel 8 Distributed Processing Symposium (IPDPS’19).

[69]

N. Dryden, T. Moon, S. A. Jacobs, and B. V. Essen. 2016. Communication quantization for data-parallel training of deep neural networks. In Proceedings of the Workshop on Machine Learning in HPC Environments (MLHPC’16). 1--8.

Digital Library

[70]

J. Duchi, E. Hazan, and Y. Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12 (2011), 2121--2159.

Digital Library

[71]

V. Dumoulin and F. Visin. 2016. A guide to convolution arithmetic for deep learning. arxiv:1603.07285.

[72]

J. L. Elman. 1990. Finding structure in time. Cogn. Sci. 14, 2 (1990), 179--211.

[73]

T. Elsken, J.-H. Metzen, and F. Hutter. 2017. Simple and efficient architecture search for convolutional neural networks. arxiv:1711.04528.

[74]

L. Ericson and R. Mbuvha. 2017. On the performance of network parallel training in artificial neural networks. arxiv:1701.05130.

[75]

P. Farber and K. Asanovic. 1997. Parallel neural network training on multi-spert. In Proceedings of the 3rd International Conference on Algorithms and Architectures for Parallel Processing. 659--666.

[76]

B. M. Forrest, D. Roweth, N. Stroud, D. J. Wallace, and G. V. Wilson. 1987. Implementing neural network models on parallel computers. Comput. J. 30, 5 (1987), 413--419.

Digital Library

[77]

K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman. 2017. Meta learning shared hierarchies. arxiv:1710.09767.

[78]

M. P. Friedlander and M. W. Schmidt. 2011. Hybrid deterministic-stochastic methods for data fitting. arxiv:1104.2373.

[79]

A. Gaunt, M. Johnson, M. Riechert, D. Tarlow, R. Tomioka, D. Vytiniotis, and S. Webster. 2017. AMPNet: Asynchronous model-parallel training for dynamic neural networks. arxiv:1705.09786.

[80]

A. Gholami, A. Azad, P. H. Jin, K. Keutzer, and A. Buluç. 2018. Integrated model, batch, and domain parallelism in training neural networks. In Proceedings of the 30th Symposium on Parallelism in Algorithms and Architectures (SPAA’18). 77--86.

Digital Library

[81]

X. Glorot and Y. Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, vol. 9. 249--256.

[82]

D. E. Goldberg and K. Deb. 1991. A comparative analysis of selection schemes used in genetic algorithms. Foundations of Genetic Algorithms, vol. 1. Elsevier, 69--93.

[83]

Google. 2017. TensorFlow XLA Overview. Retrieved from https://www.tensorflow.org/performance/xla.

[84]

P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. 2017. Accurate, large minibatch SGD: Training ImageNet in 1 Hour. arxiv:1706.02677.

[85]

A. Graves et al. 2016. Hybrid computing using a neural network with dynamic external memory. Nature 538, 7626 (2016), 471--476.

[86]

W. Gropp, T. Hoefler, R. Thakur, and E. Lusk. 2014. Using Advanced MPI: Modern Features of the Message-Passing Interface. MIT Press.

Digital Library

[87]

A. Gruslys, R. Munos, I. Danihelka, M. Lanctot, and A. Graves. 2016. Memory-efficient backpropagation through time. In Advances in Neural Information Processing Systems 29. MIT Press, 4125--4133.

Digital Library

[88]

S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. 2015. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning, vol. 37. 1737--1746.

Digital Library

[89]

S. Gupta, W. Zhang, and F. Wang. 2016. Model accuracy and runtime tradeoff in distributed deep learning: A systematic study. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM’16). 171--180.

[90]

S. Hadjis, C. Zhang, I. Mitliagkas, and C. Ré. 2016. Omnivore: An optimizer for multi-device deep learning on CPUs and GPUs. arxiv:1606.04487.

[91]

S. Han, H. Mao, and W. J. Dally. 2016. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In Proceedings of the International Conference on Learning Representations (ICLR’16) (2016).

[92]

E. Hazan, A. Klivans, and Y. Yuan. 2018. Hyperparameter optimization: A spectral approach. In Proceedings of the International Conference on Learning Representations (ICLR’18).

[93]

K. He, X. Zhang, S. Ren, and J. Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 1026--1034.

Digital Library

[94]

K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770--778.

[95]

X. He, D. Mudigere, M. Smelyanskiy, and M. Takac. 2017. Distributed hessian-free optimization for deep neural network. In Proceedings of the AAAI Workshops.

[96]

G. Hinton. 2012. Neural Networks for Machine Learning, Lecture 6a: Overview of Mini-batch Gradient Descent.

[97]

G. Hinton, O. Vinyals, and J. Dean. 2015. Distilling the knowledge in a neural network. In Proceedings of the NIPS Deep Learning and Representation Learning Workshop.

[98]

G. E. Hinton, S. Osindero, and Y. W. Teh. 2006. A fast learning algorithm for deep belief nets. Neural Comput. 18, 7 (2006), 1527--1554.

Digital Library

[99]

Q. Ho et al. 2013. More effective distributed ML via a stale synchronous parallel parameter server. In Proceedings of the 26th International Conference on Neural Information Processing Systems, vol. 1 (NIPS’13). 1223--1231.

Digital Library

[100]

S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780.

Digital Library

[101]

T. Hoefler, A. Barak, A. Shiloh, and Z. Drezner. 2017. Corrected gossip algorithms for fast reliable broadcast on unreliable systems. In Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS’17).

[102]

T. Hoefler, A. Lumsdaine, and W. Rehm. 2007. Implementation and performance analysis of non-blocking collective operations for MPI. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’07). IEEE Computer Society/ACM.

Digital Library

[103]

T. Hoefler and D. Moor. 2014. Energy, memory, and runtime tradeoffs for implementing collective communication operations. J. Supercomput. Front. Innovat. 1, 2 (2014), 58--75.

Digital Library

[104]

T. Hoefler and T. Schneider. 2012. Optimization principles for collective neighborhood communications. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 98:1--98:10.

Digital Library

[105]

T. Hoefler and M. Snir. 2011. Generic topology mapping strategies for large-scale parallel architectures. In Proceedings of the ACM International Conference on Supercomputing (ICS’11). 75--85.

Digital Library

[106]

T. Hoefler and J. L. Traeff. 2009. Sparse collective operations for MPI. In Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium (HIPS’09).

Digital Library

[107]

E. Hoffer, I. Hubara, and D. Soudry. 2017. Train longer, generalize better: Closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems 30. MIT Press, 1729--1739.

Digital Library

[108]

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arxiv:1704.04861.

[109]

K. Hsieh, A. Harlap, N. Vijaykumar, D. Konomis, G. R. Ganger, P. B. Gibbons, and O. Mutlu. 2017. Gaia: Geo-distributed machine learning approaching LAN speeds. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation (NSDI’17). 629--647.

Digital Library

[110]

G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[111]

Y. Huang et al. 2018. GPipe: Efficient training of giant neural networks using pipeline parallelism. arxiv:1811.06965.

[112]

I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. 2016. Quantized neural networks: Training neural networks with low precision weights and activations. arxiv:1609.07061.

[113]

D. A. Huffman. 1952. A method for the construction of minimum-redundancy codes. Proc. IRE 40, 9 (1952), 1098--1101.

[114]

F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size. arxiv:1602.07360.

[115]

F. N. Iandola, M. W. Moskewicz, K. Ashraf, and K. Keutzer. 2016. FireCaffe: Near-linear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).

[116]

IBM. 2019. Engineering and Scientific Subroutine Library (ESSL). Version 6.2 Guide and Reference. Retrieved from https://www.ibm.com/support/knowledgecenter/SSFHY8_6.2/reference/essl_reference_pdf.pdf.

[117]

P. Ienne. 1993. Architectures for Neuro-Computers: Review and Performance Evaluation. Technical Report. EPFL, Lausanne, Switzerland.

[118]

D. J. Im, H. Ma, C. D. Kim, and G. W. Taylor. 2016. Generative adversarial parallelization. arxiv:1612.04021.

[119]

Intel. 2009. Intel Math Kernel Library. Reference Manual. Intel Corporation.

[120]

Intel. 2017. MKL-DNN. Retrieved from https://01.org/mkl-dnn.

[121]

S. Ioffe and C. Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 448--456.

Digital Library

[122]

M. Jaderberg et al. 2017. Population-based training of neural networks. arxiv:1711.09846.

[123]

X. Jia et al. 2018. Highly scalable deep learning training system with mixed-precision: Training ImageNet in four minutes. arxiv:1807.11205.

[124]

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia. 675--678.

Digital Library

[125]

J. Jiang, B. Cui, C. Zhang, and L. Yu. 2017. Heterogeneity-aware distributed parameter servers. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’17). 463--478.

Digital Library

[126]

P. H. Jin, Q. Yuan, F. N. Iandola, and K. Keutzer. 2016. How to scale distributed deep learning? InProceedings of the ML Systems Workshop at NIPS.

[127]

M. Johnson et al. 2016. Google’s multilingual neural machine translation system: Enabling zero-shot translation. arxiv:1611.04558.

[128]

R. Johnson and T. Zhang. 2013. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26. MIT Press, 315--323.

Digital Library

[129]

N. P. Jouppi et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). 1--12.

Digital Library

[130]

L. Kaiser, A. N. Gomez, N. Shazeer, A. Vaswani, N. Parmar, L. Jones, and J. Uszkoreit. 2017. One model to learn them all. arxiv:1706.05137.

[131]

K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, and E. P. Xing. 2018. Neural architecture search with bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems 31. MIT Press, 2016--2025.

Digital Library

[132]

T. Karras, T. Aila, S. Laine, and J. Lehtinen. 2017. Progressive growing of GANs for improved quality, stability, and variation. arxiv:1710.10196.

[133]

J. Keuper and F. Pfreundt. 2015. Asynchronous parallel stochastic gradient descent: A numeric core for scalable distributed machine learning algorithms. In Proceedings of the Workshop on Machine Learning in HPC Environments (MLHPC’15). 1:1--1:11.

Digital Library

[134]

H. Kim et al. 2016. DeepSpark: Spark-based deep learning supporting asynchronous updates and caffe compatibility. arxiv:1602.08191.

[135]

Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. 2016. Compression of deep convolutional neural networks for fast and low power mobile applications. In Proceedings of the International Conference on Learning Representations (ICLR’16).

[136]

D. P. Kingma and J. Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR’15).

[137]

A. Klein, S. Falkner, J. T. Springenberg, and F. Hutter. 2016. Learning curve prediction with Bayesian neural networks. In Proceedings of the International Conference on Learning Representations (ICLR).

[138]

U. Köster et al. 2017. Flexpoint: An adaptive numerical format for efficient training of deep neural networks. In Advances in Neural Information Processing Systems 30. MIT Press, 1740--1750.

Digital Library

[139]

S. Krishnan, Y. Xiao, and R. A. Saurous. 2018. Neumann optimizer: A practical optimization algorithm for deep neural networks. In Proceedings of the International Conference on Learning Representations (ICLR’18).

[140]

A. Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. Master’s thesis, University of Toronto, Canada.

[141]

A. Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. arxiv:1404.5997.

[142]

A. Krizhevsky, I. Sutskever, and G. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25. MIT Press, 1097--1105.

Digital Library

[143]

T. Kurth et al. 2017. Deep learning at 15PF: Supervised and semi-supervised classification for scientific data. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17). 7:1--7:11.

Digital Library

[144]

G. Lacey, G. W. Taylor, and S. Areibi. 2016. Deep learning on FPGAs: Past, present, and future. arxiv:1602.04283.

[145]

L. Lamport, R. Shostak, and M. Pease. 1982. The byzantine generals problem. ACM Trans. Program. Lang. Syst. 4, 3 (1982), 382--401.

Digital Library

[146]

A. Lavin and S. Gray. 2016. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).

[147]

Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng. 2011. On optimization methods for deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML’11). 265--272.

Digital Library

[148]

Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean, and A. Y. Ng. 2012. Building high-level features using large scale unsupervised learning. In Proceedings of the 29th International Conference on Machine Learning (ICML’12). 507--514.

Digital Library

[149]

Y. LeCun, Y. Bengio, and G. Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436--444.

[150]

Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. 1989. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1, 4 (1989), 541--551.

Digital Library

[151]

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.

[152]

H. Lee, P. Pham, Y. Largman, and A. Y. Ng. 2009. Unsupervised feature learning for audio classification using convolutional deep belief networks. In Advances in Neural Information Processing Systems 22. MIT Press, 1096--1104.

Digital Library

[153]

S. Lee, S. Purushwalkam, M. Cogswell, D. J. Crandall, and D. Batra. 2015. Why M heads are better than one: Training a diverse ensemble of deep networks. arxiv:1511.06314.

[154]

C. Li, Y. Yang, M. Feng, S. Chakradhar, and H. Zhou. 2016. Optimizing memory efficiency for deep convolutional neural networks on GPUs. In Proceedings of the International Conference for Supercomputing (SC’16). 54:1--54:12.

Digital Library

[155]

D. Li, X. Wang, and D. Kong. 2017. DeepRebirth: Accelerating deep neural network execution on mobile devices. arxiv:1708.04728.

[156]

F. Li and B. Liu. 2016. Ternary weight networks. arxiv:1605.04711.

[157]

M. Li et al. 2014. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). 583--598.

Digital Library

[158]

T. Li, J. Zhong, J. Liu, W. Wu, and C. Zhang. 2017. Ease.ml: Towards multi-tenant resource sharing for machine learning workloads. arxiv:1708.07308.

[159]

X. Lian et al. 2017. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems 30. MIT Press, 5336--5346.

Digital Library

[160]

X. Lian, Y. Huang, Y. Li, and J. Liu. 2015. Asynchronous parallel stochastic gradient for nonconvex optimization. In Proceedings of the 28th International Conference on NIPS, vol. 2. 2737--2745.

Digital Library

[161]

X. Lian, W. Zhang, C. Zhang, and J. Liu. 2018. Asynchronous decentralized parallel stochastic gradient descent. In Proceedings of the 35th International Conference on Machine Learning (ICML’18). 3043--3052.

[162]

M. Lin, Q. Chen, and S. Yan. 2014. Network in network. In Proceedings of the International Conferecne on Learning Representations (ICLR’14).

[163]

Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally. 2018. Deep gradient compression: Reducing the communication bandwidth for distributed training. In Proceedings of the International Conference on Learning Representations (ICLR’18).

[164]

C. Liu, B. Zoph, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy. 2017. Progressive neural architecture search. arxiv:1712.00559.

[165]

H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu. 2018. Hierarchical representations for efficient architecture search. In Proceedings of the International Conference on Learning Representations (ICLR’18).

[166]

H. Liu, K. Simonyan, and Y. Yang. 2018. DARTS: Differentiable architecture search. arxiv:1806.09055.

[167]

X. Liu, J. Pool, S. Han, and W. J. Dally. 2018. Efficient sparse-winograd convolutional neural networks. In Proceedings of the International Conference on Learning Representations (ICLR’18).

[168]

P. R. Lorenzo, J. Nalepa, L. S. Ramos, and J. R. Pastor. 2017. Hyper-parameter selection in deep neural networks using parallel particle swarm optimization. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO’17). 1864--1871.

Digital Library

[169]

I. Loshchilov and F. Hutter. 2017. SGDR: Stochastic gradient descent with warm restarts. In Proceedings of the International Conference on Learning Representations (ICLR’17).

[170]

R. Luo, F. Tian, T. Qin, E. Chen, and T.-Y. Liu. 2018. Neural architecture optimization. In Advances in Neural Information Processing Systems 31. MIT Press, 7816--7827.

Digital Library

[171]

J. Martens. 2010. Deep learning via hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML’10). 735--742.

Digital Library

[172]

M. Mathieu, M. Henaff, and Y. LeCun. 2014. Fast training of convolutional networks through FFTs. In Proceedings of the International Conference on Learning Representations (ICLR’14).

[173]

Message Passing Interface Forum. 2015. MPI: A Message-Passing Interface Standard Version 3.1. Retrieved from https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf.

[174]

Y. Miao, H. Zhang, and F. Metze. 2014. Distributed learning of multilingual DNN feature extractors using GPUs. In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH’14). 830--834.

[175]

R. Miikkulainen et al. 2017. Evolving deep neural networks. arxiv:1703.00548.

[176]

H. Mikami et al. 2018. ImageNet/ResNet-50 training in 224 seconds. arxiv:1811.05233.

[177]

P. Moritz, R. Nishihara, and M. Jordan. 2016. A linearly-convergent stochastic L-BFGS algorithm. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, vol. 51. 249--258.

[178]

P. Moritz, R. Nishihara, I. Stoica, and M. I. Jordan. 2016. SparkNet: Training deep networks in spark. In Proceedings of the International Conference on Learning Representations (ICLR’16).

[179]

U. A. Muller and A. Gunzinger. 1994. Neural net simulation on parallel computers. In Proceedings of the IEEE International Conference on Neural Networks, vol. 6. 3961--3966.

[180]

R. Negrinho and G. Gordon. 2017. DeepArchitect: Automatically designing and training deep architectures. arxiv:1704.08792.

[181]

A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. 2009. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19, 4 (2009), 1574--1609.

Digital Library

[182]

Y. Nesterov. 1983. A method of solving a convex programming problem with convergence rate O(1/k)<sup>2</sup>. Soviet Math. Doklady 269 (1983), 543--547.

[183]

Netlib. 2019. Basic Linear Algebra Subprograms (BLAS). Retrieved from http://www.netlib.org/blas.

[184]

J. Ngiam, Z. Chen, D. Chia, P. W. Koh, Q. V. Le, and A. Y. Ng. 2010. Tiled convolutional neural networks. In Advances in Neural Information Processing Systems 23. MIT Press, 1279--1287.

Digital Library

[185]

J. Nocedal and S. Wright. 2006. Numerical Optimization. Springer.

[186]

C. Noel and S. Osindero. 2014. Dogwild!—Distributed hogwild for CPU 8 GPU. In Proceedings of the NIPS Workshop on Distributed Machine Learning and Matrix Computations.

[187]

E. Nurvitadhi et al. 2017. Can FPGAs beat GPUs in accelerating next-generation deep neural networks?. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). 5--14.

Digital Library

[188]

NVIDIA. 2017. Programming Tensor Cores in CUDA 9. Retrieved from https://devblogs.nvidia.com/programming-tensor-cores-cuda-9.

[189]

NVIDIA. 2019. CUBLAS Library Documentation. Retrieved from http://docs.nvidia.com/cuda/cublas.

[190]

C. Olah. 2015. Understanding LSTM Networks. Retrieved from http://colah.github.io/posts/2015-08-Understanding-LSTMs.

[191]

K. Osawa et al. 2018. Second-order optimization method for large mini-batch: Training resnet-50 on ImageNet in 35 Epochs. arxiv:1811.12019.

[192]

M. Ott et al. 2018. Scaling neural machine translation. In Proceedings of the 3rd Conference on Machine Translation: Research Papers. 1--9.

[193]

Y. Oyama et al. 2016. Predicting statistics of asynchronous SGD parameters for a large-scale distributed deep learning system on GPU supercomputers. In Proceedings of the IEEE International Conference on Big Data (BigData’16). 66--75.

[194]

Y. Oyama, T. Ben-Nun, T. Hoefler, and S. Matsuoka. 2018. Accelerating deep learning frameworks with micro-batches. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’18).

[195]

PaddlePaddle. 2017. Elastic Deep Learning. Retrieved from https://github.com/PaddlePaddle/cloud/tree/develop/doc/edl.

[196]

T. Paine et al. 2013. GPU asynchronous stochastic gradient descent to speed up neural network training. arxiv:1312.6186.

[197]

S. J. Pan and Q. Yang. 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 10 (2010), 1345--1359.

Digital Library

[198]

P. Patarasuk and X. Yuan. 2009. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distrib. Comput. 69, 2 (2009), 117--124.

Digital Library

[199]

F. Petroski Such et al. 2017. Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arxiv:1712.06567.

[200]

H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. 2018. Efficient neural architecture search via parameter sharing. arxiv:1802.03268.

[201]

B. T. Polyak and A. B. Juditsky. 1992. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30, 4 (1992), 838--855.

Digital Library

[202]

D. Povey, X. Zhang, and S. Khudanpur. 2014. Parallel training of deep neural networks with natural gradient and parameter averaging. arxiv:1410.7455.

[203]

R. Puri et al. 2018. Large scale language modeling: Converging on 40GB of text in four hours. arxiv:1808.01371.

[204]

H. Qi, E. R. Sparks, and A. Talwalkar. 2017. Paleo: A performance model for deep neural networks. In Proceedings of the International Conference on Learning Representations (ICLR’17).

[205]

N. Qian. 1999. On the momentum term in gradient descent learning algorithms. Neural Netw. 12, 1 (1999).

Digital Library

[206]

R. Rabenseifner. 2004. Optimization of collective reduction operations. In Proceedings of the International Conference on Computational Science. 1--9.

[207]

A. Rahimi and B. Recht. 2017. Reflections on random kitchen sinks. Retrieved from http://www.argmin.net/2017/12/05/kitchen-sinks NIPS Test of Time Award Talk.

[208]

R. Raina, A. Madhavan, and A. Y. Ng. 2009. Large-scale deep unsupervised learning using graphics processors. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09). 873--880.

Digital Library

[209]

S. Sundhar Ram, A. Nedic, and V. V. Veeravalli. 2009. Asynchronous gossip algorithms for stochastic optimization. In Proceedings of the International Conference on Game Theory for Networks. 80--81.

Digital Library

[210]

M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. 2016. XNOR-Net: ImageNet classification using binary convolutional neural networks. arxiv:1603.05279.

[211]

E. Real, A. Aggarwal, Y. Huang, and Q. V Le. 2018. Regularized evolution for image classifier architecture search. arxiv:1802.01548.

[212]

E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin. 2017. Large-scale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine Learning. 2902--2911.

Digital Library

[213]

B. Recht, C. Re, S. Wright, and F. Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems 24. MIT Press, 693--701.

Digital Library

[214]

C. Renggli, D. Alistarh, and T. Hoefler. 2018. SparCML: High-performance sparse communication for machine learning. arxiv:1802.08021.

[215]

H. Robbins and S. Monro. 1951. A stochastic approximation method. Ann. Math. Stat. 22, 3 (1951), 400--407.

[216]

T. Salimans and D. P. Kingma. 2016. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems 29. MIT Press, 901--909.

Digital Library

[217]

F. Seide et al. 2014. 1-bit stochastic gradient descent and application to data-parallel distributed training of speech DNNs. In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH’14).

[218]

F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 2014. On parallelizability of stochastic gradient descent for speech DNNs. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 235--239.

[219]

S. Shalev-Shwartz and S. Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.

Digital Library

[220]

C. J. Shallue et al. 2018. Measuring the effects of data parallelism on neural network training. arxiv:1811.03600.

[221]

O. Shamir. 2016. Without-replacement sampling for stochastic gradient methods. In Advances in Neural Information Processing Systems 29. MIT Press, 46--54.

Digital Library

[222]

R. Shokri and V. Shmatikov. 2015. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS’15). 1310--1321.

Digital Library

[223]

D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. 2017. Mastering the game of go without human knowledge. Nature 550, 7676 (2017), 354.

[224]

K. Simonyan and A. Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. Proceedings of the International Conference on Learning Representations (ICLR’15).

[225]

A. J. R. Simpson. 2015. Instant learning: Parallel deep neural networks and convolutional bootstrapping. arxiv:1505.05972.

[226]

S. L. Smith, P. Kindermans, and Q. V. Le. 2017. Don’t decay the learning rate, increase the batch size. arxiv:1711.00489.

[227]

J. Snoek, H. Larochelle, and R. P Adams. 2012. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25. MIT Press, 2951--2959.

Digital Library

[228]

E. Solomonik and T. Hoefler. 2015. Sparse Tensor Algebra as a Parallel Programming Model. arxiv:1512.00066.

[229]

M. Song, Y. Hu, H. Chen, and T. Li. 2017. Towards pervasive and user satisfactory CNN across GPU microarchitectures. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’17). 1--12.

[230]

H. V. Sorensen and C. S. Burrus. 1993. Efficient computation of the DFT with only a subset of input or output points. IEEE Trans. Signal Process. 41, 3 (1993), 1184--1200.

Digital Library

[231]

V. Strassen. 1969. Gaussian elimination is not optimal. Numer. Math. 13, 4 (1969), 354--356.

Digital Library

[232]

N. Strom. 2015. Scalable distributed DNN training using commodity GPU cloud computing. In Proceedings of the 16th Annual Conference of the International Speech Communication Association.

[233]

V. Sze, Y. H. Chen, T. J. Yang, and J. S. Emer. 2017. Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE 105, 12 (2017), 2295--2329.

[234]

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’15).

[235]

G. Taylor, R. Burmeister, Z. Xu, B. Singh, A. Patel, and T. Goldstein. 2016. Training neural networks without gradients: A scalable ADMM approach. (2016). arxiv:1605.02026

[236]

L. Truong et al. 2016. Latte: A language, compiler, and runtime for elegant and efficient deep neural networks. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’16). 209--223.

Digital Library

[237]

J. Tsitsiklis, D. Bertsekas, and M. Athans. 1986. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Trans. Automat. Control 31, 9 (1986), 803--812.

[238]

B. Van Essen et al. 2015. LBANN: Livermore big artificial neural network HPC toolkit. In Proceedings of the Workshop on Machine Learning in HPC Environments.

Digital Library

[239]

V. Vanhoucke, A. Senior, and M. Z. Mao. 2011. Improving the speed of neural networks on CPUs. In Proceedings of the Deep Learning and Unsupervised Feature Learning Workshop (NIPS’11).

[240]

N. Vasilache et al. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arxiv:1802.04730.

[241]

N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y. LeCun. 2015. Fast convolutional nets with fbfft: A GPU performance evaluation. In Proceedings of the International Conference on Learning Representations (ICLR’15).

[242]

A. Vasudevan, A. Anderson, and D. Gregg. 2017. Parallel multi channel convolution using general matrix multiplication. arxiv:1704.04428.

[243]

P. Verbancsics and J. Harguess. 2015. Image classification using generative neuro evolution for deep learning. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 488--493.

Digital Library

[244]

A. Viebke, S. Memeti, S. Pllana, and A. Abraham. 2019. CHAOS: A parallelization scheme for training convolutional neural networks on Intel Xeon Phi. The Journal of Supercomputing 75, 1 (Jan. 2019), 197--227.

Digital Library

[245]

W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. 2017. TernGrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems 30. MIT Press, 1509--1519.

Digital Library

[246]

P. J. Werbos. 1990. Backpropagation through time: What it does and how to do it. Proc. IEEE 78, 10 (1990), 1550--1560.

[247]

J. H. Wilkinson. 1994. Rounding Errors in Algebraic Processes. Dover Publications.

Digital Library

[248]

R. J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 3 (1992), 229--256.

Digital Library

[249]

S. Winograd. 1980. Arithmetic Complexity of Computations. Society for Industrial and Applied Mathematics.

[250]

L. Xie and A. Yuille. 2017. Genetic CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 1388--1397.

[251]

P. Xie, J. K. Kim, Y. Zhou, Q. Ho, A. Kumar, Y. Yu, and E. Xing. 2016. Lighter-communication distributed machine learning via sufficient factor broadcasting. In Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence (UAI’16). 795--804.

Digital Library

[252]

E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu. 2015. Petuum: A new platform for distributed machine learning on big data. IEEE Trans. Big Data 1, 2 (2015), 49--67.

[253]

K. Xu et al. 2015. Show, attend and tell: Neural image caption generation with visual attention. arxiv:1502.03044.

[254]

O. Yadan, K. Adams, Y. Taigman, and M. Ranzato. 2013. Multi-GPU training of ConvNets. arxiv:1312.5853.

[255]

F. Yan, O. Ruwase, Y. He, and T. Chilimbi. 2015. Performance modeling and scalability optimization of distributed deep learning systems. In Proceedings of the 21st ACM International Conference on Knowledge Discovery and Data Mining (KDD’15). 1355--1364.

Digital Library

[256]

C. Ying et al. 2018. Image classification at supercomputer scale. arxiv:1811.06992.

[257]

Y. You et al. 2019. Large-batch training for LSTM and beyond. arxiv:1901.08256.

[258]

Y. You, A. Buluç, and J. Demmel. 2017. Scaling deep learning on GPU and knights landing clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17). 9:1--9:12.

Digital Library

[259]

Y. You, I. Gitman, and B. Ginsburg. 2017. Large batch training of convolutional networks. arxiv:1708.03888.

[260]

Y. You, Z. Zhang, C. Hsieh, and J. Demmel. 2017. 100-epoch ImageNet training with AlexNet in 24 minutes. arxiv:1709.05011

[261]

S. R. Young et al. 2017. Evolving deep networks using HPC. In Proceedings of the Workshop on Machine Learning in HPC Environments (MLHPC’17). 7:1--7:7.

Digital Library

[262]

F. Yu and V. Koltun. 2016. Multi-scale context aggregation by dilated convolutions. In Proceedings of the International Conference on Learning Representations (ICLR’16).

[263]

Y. Yu, J. Jiang, and X. Chi. 2016. Using supercomputer to speed up neural network training. In Proceedings of the IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS’16). 942--947.

[264]

H. Zhang et al. 2015. Poseidon: A system architecture for efficient GPU-based deep learning on multiple machines. arxiv:1512.06216

[265]

H. Zhang, Z. Zheng, S. Xu, W. Dai, Q. Ho, X. Liang, Z. Hu, J. Wei, P. Xie, and E. P. Xing. 2017. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In Proceedings of the USENIX Annual Technical Conference (ATC’17). 181--193.

Digital Library

[266]

J. Zhang, I. Mitliagkas, and C. Ré. 2017. YellowFin and the art of momentum tuning. arxiv:1706.03471

[267]

K. Zhang and X. W. Chen. 2014. Large-scale deep belief nets with MapReduce. IEEE Access 2 (2014), 395--403.

[268]

S. Zhang et al. 2015. Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems 28. MIT Press, 685--693.

Digital Library

[269]

S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen. 2016. Cambricon-X: An accelerator for sparse neural networks. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). 1--12.

Digital Library

[270]

S. Zhang, C. Zhang, Z. You, R. Zheng, and B. Xu. 2013. Asynchronous stochastic gradient descent for DNN training. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 6660--6663.

[271]

W. Zhang et al. 2017. GaDei: On scale-up training as a service for deep learning. In Proceedings of the IEEE International Conference on Data Mining (ICDM’17). 1195--1200.

[272]

W. Zhang, S. Gupta, X. Lian, and J. Liu. 2016. Staleness-aware async-SGD for distributed deep learning. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16). 2350--2356.

Digital Library

[273]

X. Zhang, M. McKenna, J. P. Mesirov, and D. L. Waltz. 1990. An efficient implementation of the back-propagation algorithm on the connection machine CM-2. In Advances in Neural Information Processing Systems 2. MIT Press, 801--809.

Digital Library

[274]

H. Zhao and J. Canny. 2014. Kylix: A sparse allreduce for commodity clusters. In Proceedings of the 43rd International Conference on Parallel Processing. 273--282.

Digital Library

[275]

Z. Zhong, J. Yan, and C.-L. Liu. 2017. Practical network blocks design with Q-Learning. arxiv:1708.05552

[276]

S. Zhou et al. 2016. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arxiv:1606.06160

[277]

M. A. Zinkevich, M. Weimer, A. Smola, and L. Li. 2010. Parallelized stochastic gradient descent. In Proceedings of the 23rd International Conference on Neural Information Processing Systems, vol. 2. 2595--2603.

Digital Library

[278]

A. Zlateski, K. Lee, and H. S. Seung. 2016. ZNNi: Maximizing the inference throughput of 3D convolutional networks on CPUs and GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 854--865.

Digital Library

[279]

B. Zoph and Q. V. Le. 2017. Neural architecture search with reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR’17).

[280]

B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. 2017. Learning transferable architectures for scalable image recognition. arxiv:1707.07012

Cited By

Romero-Sandí HNúñez GRojas E(2024)A snapshot of parallelism in distributed deep learning trainingRevista Colombiana de Computación10.29375/25392115.505425:1(60-73)Online publication date: 30-Jun-2024
https://doi.org/10.29375/25392115.5054
Liu WXu XWu JJiang J(2024)Federated Meta Reinforcement Learning for Personalized TasksTsinghua Science and Technology10.26599/TST.2023.901006629:3(911-926)Online publication date: Jun-2024
https://doi.org/10.26599/TST.2023.9010066
Nichols DMarathe AMenon HGamblin TBhatele A(2024)HPC-Coder: Modeling Parallel Programs using Large Language ModelsISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528929(1-12)Online publication date: May-2024
https://doi.org/10.23919/ISC.2024.10528929
Show More Cited By

Index Terms

Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis
1. Computing methodologies
2. General and reference
  1. Document types
    1. Surveys and overviews

Recommendations

Fully Distributed Deep Learning Inference on Resource-Constrained Edge Devices
Embedded Computer Systems: Architectures, Modeling, and Simulation
Abstract
Performing inference tasks of deep learning applications on IoT edge devices ensures privacy of input data and can result in shorter latency when compared to a cloud solution. As most edge devices are memory- and compute-constrained, they cannot ...
Recommender Systems based on Parallel and Distributed Deep Learning
PCI '23: Proceedings of the 27th Pan-Hellenic Conference on Progress in Computing and Informatics

As individuals have become overloaded with information, Recommender Systems (RS) were created to provide machine generated recommendations. Significant advancements in RS have been made thanks to Machine Learning methods; Deep Learning (DL) in particular ...
Distributing deep learning inference on edge devices
CoNEXT '20: Proceedings of the 16th International Conference on emerging Networking EXperiments and Technologies

Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs) are widely used in IoT related applications. However, inferencing pre-trained large DNNs and CNNs consumes a significant amount of time, memory and computational resources. This makes ...

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys

ACM Computing Surveys Volume 52, Issue 4

July 2020

769 pages

ISSN:0360-0300

EISSN:1557-7341

DOI:10.1145/3359984

Editor:
Sartaj Sahni
Department of Computer and Information Science and Engineering

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 August 2019

Accepted: 01 March 2019

Received: 01 August 2018

Published in CSUR Volume 52, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Survey
Research
Refereed

Funding Sources

FP7 People: Marie-Curie Actions
H2020 European Research Council

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

311
Total Citations
View Citations
8,003
Total Downloads

Downloads (Last 12 months)1,276
Downloads (Last 6 weeks)389

Reflects downloads up to

Other Metrics

View Author Metrics

Citations

Cited By

Romero-Sandí HNúñez GRojas E(2024)A snapshot of parallelism in distributed deep learning trainingRevista Colombiana de Computación10.29375/25392115.505425:1(60-73)Online publication date: 30-Jun-2024
https://doi.org/10.29375/25392115.5054
Liu WXu XWu JJiang J(2024)Federated Meta Reinforcement Learning for Personalized TasksTsinghua Science and Technology10.26599/TST.2023.901006629:3(911-926)Online publication date: Jun-2024
https://doi.org/10.26599/TST.2023.9010066
Nichols DMarathe AMenon HGamblin TBhatele A(2024)HPC-Coder: Modeling Parallel Programs using Large Language ModelsISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528929(1-12)Online publication date: May-2024
https://doi.org/10.23919/ISC.2024.10528929
Das Sharma DBlankenship RBerger D(2024)An Introduction to the Compute Express Link (CXL) InterconnectACM Computing Surveys10.1145/366990056:11(1-37)Online publication date: 8-Jul-2024
https://dl.acm.org/doi/10.1145/3669900
Jacobs STanaka MZhang CZhang MAminadabi RSong SRajbhandari SHe YKuznetsov PGelles ROlivetti D(2024)System Optimizations for Enabling Training of Extreme Long Sequence Transformer ModelsProceedings of the 43rd ACM Symposium on Principles of Distributed Computing10.1145/3662158.3662806(121-130)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3662158.3662806
Guerraoui RGupta NPinot R(2024)Byzantine Machine Learning: A PrimerACM Computing Surveys10.1145/361653756:7(1-39)Online publication date: 9-Apr-2024
https://dl.acm.org/doi/10.1145/3616537
Kopaničáková AKothari HKarniadakis GKrause R(2024)Enhancing Training of Physics-Informed Neural Networks Using Domain Decomposition–Based Preconditioning StrategiesSIAM Journal on Scientific Computing10.1137/23M1583375(S46-S67)Online publication date: 11-Apr-2024
https://doi.org/10.1137/23M1583375
Chen JLi SGuo RYuan JHoefler T(2024)AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth CostIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.339780035:8(1331-1344)Online publication date: Aug-2024
https://doi.org/10.1109/TPDS.2024.3397800
Wu JWang LJin QLiu F(2024)Graft: Efficient Inference Serving for Hybrid Deep Learning With SLO Guarantees via DNN Re-AlignmentIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.334051835:2(280-296)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1109/TPDS.2023.3340518
Besta MHoefler T(2024)Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency AnalysisIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.3303431(1-20)Online publication date: 2024
https://doi.org/10.1109/TPAMI.2023.3303431
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents