Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/MICRO.2018.00023acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

A network-centric hardware/algorithm co-design to accelerate distributed training of deep neural networks

Published: 20 October 2018 Publication History

Abstract

Training real-world Deep Neural Networks (DNNs) can take an eon (i.e., weeks or months) without leveraging distributed systems. Even distributed training takes inordinate time, of which a large fraction is spent in communicating weights and gradients over the network. State-of-the-art distributed training algorithms use a hierarchy of worker-aggregator nodes. The aggregators repeatedly receive gradient updates from their allocated group of the workers, and send back the updated weights. This paper sets out to reduce this significant communication cost by embedding data compression accelerators in the Network Interface Cards (NICs). To maximize the benefits of in-network acceleration, the proposed solution, named INCEPTIONN (In-Network Computing to Exchange and Process Training Information Of Neural Networks), uniquely combines hardware and algorithmic innovations by exploiting the following three observations. (1) Gradients are significantly more tolerant to precision loss than weights and as such lend themselves better to aggressive compression without the need for the complex mechanisms to avert any loss. (2) The existing training algorithms only communicate gradients in one leg of the communication, which reduces the opportunities for in-network acceleration of compression. (3) The aggregators can become a bottleneck with compression as they need to compress/decompress multiple streams from their allocated worker group.
To this end, we first propose a lightweight and hardware-friendly lossy-compression algorithm for floating-point gradients, which exploits their unique value characteristics. This compression not only enables significantly reducing the gradient communication with practically no loss of accuracy, but also comes with low complexity for direct implementation as a hardware block in the NIC. To maximize the opportunities for compression and avoid the bottleneck at aggregators, we also propose an aggregator-free training algorithm that exchanges gradients in both legs of communication in the group, while the workers collectively perform the aggregation in a distributed manner. Without changing the mathematics of training, this algorithm leverages the associative property of the aggregation operator and enables our in-network accelerators to (1) apply compression for all communications, and (2) prevent the aggregator nodes from becoming bottlenecks. Our experiments demonstrate that INCEPTIONN reduces the communication time by 70.9~80.7% and offers 2.2~3.1x speedup over the conventional training system, while achieving the same level of accuracy.

References

[1]
J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng, "Large scale distributed deep networks," in NIPS, 2012.
[2]
Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing, "More effective distributed ML via a stale synchronous parallel parameter server," in NIPS, 2013.
[3]
T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman, "Project Adam: Building an efficient and scalable deep learning training system," in OSDI, 2014.
[4]
M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, "Scaling distributed machine learning with the parameter server," in OSDI, 2014.
[5]
P. Moritz, R. Nishihara, I. Stoica, and M. I. Jordan, "SparkNet: Training deep networks in spark," in ICLR, 2016.
[6]
F. N. Iandola, K. Ashraf, M. W. Moskewicz, and K. Keutzer, "Firecaffe: near-linear acceleration of deep neural network training on compute clusters," in CVPR, 2016.
[7]
P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, "Accurate, large minibatch SGD: training ImageNet in 1 hour," in arXiv.1706.02677 {cs.CV}, 2017.
[8]
S. L. Smith, P. Kindermans, and Q. V. Le, "Don't decay the learning rate, increase the batch size," in ICLR, 2018.
[9]
Y. You, Z. Zhang, C. Hsieh, and J. Demmel, "100-epoch ImageNet Training with AlexNet in 24 Minutes," in arXiv:1709.05011v10 {cs.CV}, 2018.
[10]
Y. LeCun, Y. Bengio, and G. E. Hinton, "Deep learning," Nature, 2015.
[11]
F. Iandola, "Exploring the Design Space of Deep Convolutional Neural Networks at Large Scale," in arXiv:1612.06519, 2016.
[12]
Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, "Deep gradient compression: Reducing the communication bandwidth for distributed training," in ICLR, 2018.
[13]
J. Park, H. Sharma, D. Mahajan, J. K. Kim, P. Olds, and H. Esmaeilzadeh, "Scale-out acceleration for machine learning," in MICRO, 2017.
[14]
D. S. Banerjee, K. Hamidouche, and D. K. Panda, "Re-designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters," in CloudCom, 2016.
[15]
A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, "S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters," in PPoPP, 2017.
[16]
K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," CoRR, vol. abs/1409.1556, 2014.
[17]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Proc. NIPS, 2012.
[18]
K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," arXiv preprint arXiv:1512.03385, 2015.
[19]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, "ImageNet Large Scale Visual Recognition Challenge," in IJCV, 2015.
[20]
I. Kokkinos, "Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory," in CVPR, 2017.
[21]
B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hernández-Lobato, G.-Y. Wei, and D. Brooks, "Minerva: Enabling low-power, highly-accurate deep neural network accelerators," in ISCA, 2016.
[22]
N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, "In-datacenter performance analysis of a tensor processing unit," in ISCA, 2017.
[23]
S. Sardashti, A. Arelakis, and P. Stenstrm, A Primer on Compression in the Memory Hierarchy. Morgan & Claypool Publishers, 2015.
[24]
R. Thakur, R. Rabenseifner, and W. Gropp, "Optimization of collective communication operations in mpich." IJHPCA, vol. 19, 2005.
[25]
F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, "1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns." in INTERSPEECH, 2014.
[26]
W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, "Terngrad: Ternary gradients to reduce communication in distributed deep learning," CoRR, vol. abs/1705.07878, 2017.
[27]
D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, "QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding," in NIPS, 2017.
[28]
N. Dryden, T. Moon, S. A. Jacobs, and B. V. Essen, "Communication quantization for data-parallel training of deep neural networks." in MLHPC@SC. IEEE Computer Society, 2016.
[29]
N. Strom, "Scalable distributed dnn training using commodity gpu cloud computing." in INTERSPEECH, 2015.
[30]
E. Sitaridi, R. Mueller, T. Kaldewey, G. Lohman, and K. A. Ross, "Massively-parallel lossless data decompression," in Parallel Processing (ICPP), 2016 45th International Conference on. IEEE, 2016.
[31]
Google, "Snappy compression: https://github.com/google/snappy," 2011.
[32]
S. Di and F. Cappello, "Fast error-bounded lossy hpc data compression with sz," in 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2016.
[33]
Xilinx INC, "Xilinx virtex-7 fpga vc709 connectivity kit, https://www.xilinx.com/products/boards-and-kits/dk-v7-vc709-g.html," 2014.
[34]
Xilinx INC, "Virtex-7 fpga xt connectivity targeted reference design for the vc709 board, https://www.xilinx.com/support/documentation/boards\_and\_kits/vc709/2014\_3/ug962-v7-vc709-xt-connectivity-trd-ug.pdf," 2014.
[35]
Network Working Group, "Requirement for comments: 3168, https://tools.ietf.org/html/rfc3168," 2001.
[36]
M. Alian, A. H. Abulila, L. Jindal, D. Kim, and N. S. Kim, "Ncap: Network-driven, packet context-aware power management for client-server architecture," in High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on. IEEE, 2017.
[37]
Apache Incubator, "Handwritten digit recognition, https://mxnet.incubator.apache.org/tutorials/python/mnist.html," 2017.
[38]
Aymeric Damien, "Tensorflow-examples, https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/multilayer_perceptron.py," 2017.
[39]
Google INC, "Keras examples, https://github.com/keras-team/keras/blob/master/examples/mnist_mlp.py," 2017.
[40]
Krzysztof Sopya, "Tensorflow mnist convolutional network tutorial, https://github.com/ksopyla/tensorflow-mnist-convnets," 2017.
[41]
Google INC, "Tensorflow model zoo, https://github.com/tensorflow/models," 2017.
[42]
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," IEEE, 1998.
[43]
NVIDIA Corporation, "NVIDIA CUDA C programming guide," 2010.
[44]
INTEL Corporation, "Intel math kernel library, https://software.intel.com/en-us/mkl," 2018.
[45]
OpenMPI Community, "Openmpi: A high performance message passing library, https://www.open-mpi.org/," 2017.
[46]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zhang, "TensorFlow: Large-scale machine learning on heterogeneous distributed systems," 2015.
[47]
NVIDIA Corporation, "Nvidia titan xp, https://www.nvidia.com/en-us/design-visualization/products/titan-xp/," 2017.
[48]
INTEL Corporation, "Xeon cpu e5, https://www.intel.com/content/www/us/en/products/processors/xeon/e5-processors.html," 2017.
[49]
Samsung Corporation, "Samsung ddr4, http://www.samsung.com/semiconductor/global/file/product/DDR4-Product-guide-May15.pdf," 2017.
[50]
NETGEAR Corporation, "Prosafe xs712t switch, https://www.netgear.com/support/product/xs712t.aspx," 2017.
[51]
Alexey Andreyev, "Introducing data center fabric, the next-generation facebook data center network, https://code.facebook.com/posts/360346274145943/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/," 2014.
[52]
A. Singh, J. Ong, A. Agarwal, G. Anderson, A. Armistead, R. Bannon, S. Boving, G. Desai, B. Felderman, P. Germano, A. Kanagala, J. Provost, J. Simmons, E. Tanda, J. Wanderer, U. Hlzle, S. Stuart, and A. Vahdat, "Jupiter rising: A decade of clos topologies and centralized control in googles datacenter network," in Sigcomm '15, 2015.
[53]
INTEL Corporation, "Intel x540, https://www.intel.com/content/www/us/en/ethernet-products/converged-network-adapters/ethernet-x540-t2-brief.html," 2017.
[54]
A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, "Scnn: An accelerator for compressed-sparse convolutional neural networks," in ISCA, 2017.
[55]
Y. Shen, M. Ferdman, and P. Milder, "Maximizing cnn accelerator efficiency through resource partitioning," in ISCA, 2017.
[56]
J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, "Cnvlutin: Ineffectual-neuron-free deep neural network computing," in ISCA, 2016.
[57]
P. Judd, A. Delmas, S. Sharify, and A. Moshovos, "Cnvlutin2: Ineffectual-activation-and-weight-free deep neural network computing," in ISCA, 2016.
[58]
S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, "Eie: efficient inference engine on compressed deep neural network," in ISCA, 2016.
[59]
Y.-H. Chen, J. Emer, and V. Sze, "Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks," in ISCA, 2016.
[60]
T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, "Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning," in ASPLOS, 2014.
[61]
Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, "ShiDianNao: shifting vision processing closer to the sensor," in ISCA, 2015.
[62]
Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, "Dadiannao: A machine-learning supercomputer." in MICRO, 2014.
[63]
H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Misra, and H. Esmaeilzadeh, "From high-level deep neural models to fpgas," in MICRO, Oct. 2016.
[64]
W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, "Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks," in HPCA, 2017.
[65]
M. Alwani, H. Chen, M. Ferdman, and P. Milder, "Fused-layer cnn accelerator," in MICRO, 2016.
[66]
C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, G. Yuan, X. Ma, Y. Zhang, J. Tang, Q. Qiu, X. Lin, and B. Yuan, "Circnn: Accelerating and compressing deep neural networks using block-circulant weight matrices," in MICRO, 2017.
[67]
A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, "Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars," in ISCA, 2016.
[68]
E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield, T. Massengil, M. Liu, D. Lo, S. Alkalay, M. Haselman, C. Boehn, O. Firestein, A. Forin, K. S. Gatlin, M. Ghandi, S. Heil, K. Holohan, T. Juhasz, R. K. Kovvuri, S. Lanka, F. van Megen, D. Mukhortov, P. Patel, S. Reinhardt, A. Sapek, R. Seera, B. Sridharan, L. Woods, P. Yi-Xiao, R. Zhao, and D. Burger, "Accelerating persistent neural networks at datacenter scale," in HotChips, 2017.
[69]
X. Zhang, A. Ramachandran, C. Zhuge, D. He, W. Zuo, Z. Cheng, K. Rupnow, and D. Chen, "Machine learning on FPGAs to face the IoT revolution," in Proceedings of the 36th International Conference on Computer-Aided Design, ser. ICCAD '17. IEEE Press, 2017.
[70]
X. Zhang, X. Liu, A. Ramachandran, C. Zhuge, S. Tang, P. Ouyang, Z. Cheng, K. Rupnow, and D. Chen, "High-performance video content recognition with long-term recurrent convolutional network for FPGA," in Field Programmable Logic and Applications (FPL), 2017 27th International Conference on, 2017.
[71]
J. Dean and U. Hölzle, "Google Cloud TPUs," https://www.blog.google/topics/google-cloud/google-cloud-offer-tpus-machine-learning/, 2017.
[72]
S. Venkataramani, A. Ranjan, S. Banerjee, D. Das, S. Avancha, A. Jagannathan, A. Durg, D. Nagaraj, B. Kaul, P. Dubey, and A. Raghunathan, "Scaledeep: A scalable compute architecture for learning and evaluating deep networks," in ISCA, 2017.
[73]
D. Mahajan, J. Park, E. Amaro, H. Sharma, A. Yazdanbakhsh, J. K. Kim, and H. Esmaeilzadeh, "TABLA: A unified template-based framework for accelerating statistical machine learning," in HPCA, 2016.
[74]
Q. Wang, Y. Li, and P. Li, "Liquid state machine based pattern recognition on fpga with firing-activity dependent power gating and approximate computing," in ISCAS, 2016.
[75]
Q. Wang, Y. Li, B. Shao, S. Dey, and P. Li, "Energy efficient parallel neuromorphic architectures with approximate arithmetic on fpga." Neurocomputing, vol. 221, 2017.
[76]
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, "TensorFlow: Large-scale machine learning on heterogeneous distributed systems," arXiv:1603.04467 {cs}, 2016.
[77]
M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler, "vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design," in MICRO, 2016.
[78]
M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, "Scaling distributed machine learning with the parameter server," in OSDI, 2014.
[79]
M. Li, D. G. Andersen, A. J. Smola, and K. Yu, "Communication efficient distributed machine learning with the parameter server," in NIPS, 2014.
[80]
B. Recht, C. R, S. J. Wright, and F. Niu, "Hogwild: A lock-free approach to parallelizing stochastic gradient descent." in NIPS, 2011.
[81]
Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. R. Ganger, and E. P. Xing, "More effective distributed ml via a stale synchronous parallel parameter server." in NIPS, 2013.
[82]
D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, "QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding," arXiv:1610.02132 {cs}, 2017.
[83]
S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, "DoReFaNet: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients," arXiv:1606.06160 {cs}, 2016.

Cited By

View all
  • (2024)LASERProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693469(34383-34416)Online publication date: 21-Jul-2024
  • (2024)Towards domain-specific network transport for distributed DNN trainingProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691904(1421-1443)Online publication date: 16-Apr-2024
  • (2022)Revisiting the Classics: Online RL in the Programmable DataplaneNOMS 2022-2022 IEEE/IFIP Network Operations and Management Symposium10.1109/NOMS54207.2022.9789930(1-10)Online publication date: 25-Apr-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO-51: Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture
October 2018
1015 pages
ISBN:9781538662403

Sponsors

Publisher

IEEE Press

Publication History

Published: 20 October 2018

Check for updates

Qualifiers

  • Research-article

Conference

MICRO-51
Sponsor:

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)1
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)LASERProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693469(34383-34416)Online publication date: 21-Jul-2024
  • (2024)Towards domain-specific network transport for distributed DNN trainingProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691904(1421-1443)Online publication date: 16-Apr-2024
  • (2022)Revisiting the Classics: Online RL in the Programmable DataplaneNOMS 2022-2022 IEEE/IFIP Network Operations and Management Symposium10.1109/NOMS54207.2022.9789930(1-10)Online publication date: 25-Apr-2022
  • (2021)AutoFL: Enabling Heterogeneity-Aware Energy Efficient Federated LearningMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480129(183-198)Online publication date: 18-Oct-2021
  • (2021)PMNetProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00068(804-817)Online publication date: 14-Jun-2021
  • (2021)Enabling compute-communication overlap in distributed deep learning training platformsProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00049(540-553)Online publication date: 14-Jun-2021
  • (2020)Think fastProceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture10.1109/ISCA45697.2020.00023(145-158)Online publication date: 30-May-2020
  • (2019)TensorDIMMProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3352460.3358284(740-753)Online publication date: 12-Oct-2019
  • (2019)Dynamic Multi-Resolution Data StorageProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3352460.3358282(196-210)Online publication date: 12-Oct-2019
  • (2019)Sparse Tensor CoreProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3352460.3358269(359-371)Online publication date: 12-Oct-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media