Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks

Published: 18 June 2016 Publication History

Abstract

Deep convolutional neural networks (CNNs) are widely used in modern AI systems for their superior accuracy but at the cost of high computational complexity. The complexity comes from the need to simultaneously process hundreds of filters and channels in the high-dimensional convolutions, which involve a significant amount of data movement. Although highly-parallel compute paradigms, such as SIMD/SIMT, effectively address the computation requirement to achieve high throughput, energy consumption still remains high as data movement can be more expensive than computation. Accordingly, finding a dataflow that supports parallel processing with minimal data movement cost is crucial to achieving energy-efficient CNN processing without compromising accuracy.
In this paper, we present a novel dataflow, called row-stationary (RS), that minimizes data movement energy consumption on a spatial architecture. This is realized by exploiting local data reuse of filter weights and feature map pixels, i.e., activations, in the high-dimensional convolutions, and minimizing data movement of partial sum accumulations. Unlike dataflows used in existing designs, which only reduce certain types of data movement, the proposed RS dataflow can adapt to different CNN shape configurations and reduces all types of data movement through maximally utilizing the processing engine (PE) local storage, direct inter-PE communication and spatial parallelism. To evaluate the energy efficiency of the different dataflows, we propose an analysis framework that compares energy cost under the same hardware area and processing parallelism constraints. Experiments using the CNN configurations of AlexNet show that the proposed RS dataflow is more energy efficient than existing dataflows in both convolutional (1.4× to 2.5×) and fully-connected layers (at least 1.3× for batch size larger than 16). The RS dataflow has also been demonstrated on a fabricated chip, which verifies our energy analysis.

References

[1]
Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, no. 7553, 2015.
[2]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks," in NIPS, 2012.
[3]
K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition," CoRR, vol. abs/1409.1556, 2014.
[4]
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going Deeper With Convolutions," in IEEE CVPR, 2015.
[5]
K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," in IEEE CVPR, 2016.
[6]
R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation," in IEEE CVPR, 2014.
[7]
P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, "OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks," CoRR, vol. abs/1312.6229, 2013.
[8]
B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, "Learning Deep Features for Scene Recognition using Places Database," in NIPS, 2014.
[9]
Y. Le Cun, L. Jackel, B. Boser, J. Denker, H. Graf, I. Guyon, D. Henderson, R. Howard, and W. Hubbard, "Handwritten digit recognition: applications of neural network chips and automatic learning," IEEE Communications Magazine, vol. 27, no. 11, 1989.
[10]
J. Cong and B. Xiao, "Minimizing computation in convolutional neural networks," in ICANN, 2014.
[11]
B. Dally, "Power, Programmability, and Granularity: The Challenges of ExaScale Computing," in IEEE IPDPS, 2011.
[12]
M. Horowitz, "Computing's energy problem (and what we can do about it)," in IEEE ISSCC, 2014.
[13]
R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C. Lee, S. Richardson, C. Kozyrakis, and M. Horowitz, "Understanding Sources of Inefficiency in General-purpose Chips," in ISCA, 2010.
[14]
S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, "cuDNN: Efficient Primitives for Deep Learning," CoRR, vol. abs/1410.0759, 2014.
[15]
M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Durdanovic, E. Cosatto, and H. P. Graf, "A Massively Parallel Coprocessor for Convolutional Neural Networks," in IEEE ASAP, 2009.
[16]
V. Sriram, D. Cox, K. H. Tsoi, and W. Luk, "Towards an embedded biologically-inspired machine vision processor," in FPT, 2010.
[17]
S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, "A Dynamically Configurable Coprocessor for Convolutional Neural Networks," in ISCA, 2010.
[18]
M. Peemen, A. A. A. Setio, B. Mesman, and H. Corporaal, "Memory-centric accelerator design for Convolutional Neural Networks," in IEEE ICCD, 2013.
[19]
V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, "A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks," in IEEE CVPRW, 2014.
[20]
S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, "Deep Learning with Limited Numerical Precision," CoRR, vol. abs/1502.02551, 2015.
[21]
C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, "Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks," in FPGA, 2015.
[22]
T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, "DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning," in ASPLOS, 2014.
[23]
Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, "ShiDianNao: Shifting Vision Processing Closer to the Sensor," in ISCA, 2015.
[24]
Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, "DaDianNao: A Machine-Learning Supercomputer," in MICRO, 2014.
[25]
S. Park, K. Bong, D. Shin, J. Lee, S. Choi, and H.-J. Yoo, "A 1.93TOPS/W scalable deep learning/inference processor with tetra-parallel MIMD architecture for big-data applications," in IEEE ISSCC, 2015.
[26]
L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, and L. Benini, "Origami: A Convolutional Network Accelerator," in GLSVLSI, 2015.
[27]
E. Mirsky and A. DeHon, "MATRIX: a reconfigurable computing architecture with configurable instruction distribution and deployable resources," in IEEE FCCM, 1996.
[28]
J. R. Hauser and J. Wawrzynek, "Garp: a MIPS processor with a reconfigurable coprocessor," in IEEE FCCM, 1997.
[29]
B. Mei, S. Vernalde, D. Verkest, H. D. Man, and R. Lauwereins, "ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix," in FPL, 2003.
[30]
A. Parashar, M. Pellauer, M. Adler, B. Ahsan, N. Crago, D. Lustig, V. Pavlov, A. Zhai, M. Gambhir, A. Jaleel, R. Allmon, R. Rayess, S. Maresh, and J. Emer, "Triggered Instructions: A Control Paradigm for Spatially-programmed Architectures," in ISCA, 2013.
[31]
V. Govindaraju, C.-H. Ho, and K. Sankaralingam, "Dynamically Specialized Datapaths for Energy Efficient Computing," in IEEE HPCA, 2011.
[32]
S. Swanson, A. Schwerin, M. Mercaldi, A. Petersen, A. Putnam, K. Michelson, M. Oskin, and S. J. Eggers, "The WaveScalar Architecture," ACM TOCS, vol. 25, no. 2, 2007.
[33]
H. Schmit, D. Whelihan, A. Tsai, M. Moe, B. Levine, and R. Reed Taylor, "PipeRench: A virtualized programmable datapath in 0.18 micron technology," in IEEE CICC, 2002.
[34]
D. Burger, S. W. Keckler, K. S. McKinley, M. Dahlin, L. K. John, C. Lin, C. R. Moore, J. Burrill, R. G. McDonald, and W. Yoder, "Scaling to the End of Silicon with EDGE Architectures," Computer, vol. 37, no. 7, 2004.
[35]
T. Nowatzki, V. Gangadhar, and K. Sankaralingam, "Exploring the Potential of Heterogeneous Von Neumann/Dataflow Execution Models," in ISCA, 2015.
[36]
Y. LeCun, K. Kavukcuoglu, and C. Farabet, "Convolutional networks and applications in vision," in IEEE ISCAS, 2010.
[37]
V. Nair and G. E. Hinton, "Rectified Linear Units Improve Restricted Boltzmann Machines," in ICML, 2010.
[38]
S. Han, H. Mao, and W. J. Dally, "Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding," in ICLR, 2016.
[39]
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, "Caffe: Convolutional Architecture for Fast Feature Embedding," arXiv preprint arXiv:1408.5093, 2014.
[40]
W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. A. Horowitz, "Convolution Engine: Balancing Efficiency and Flexibility in Specialized Computing," in ISCA, 2013.
[41]
Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks," in IEEE ISSCC, 2016.
[42]
J. J. Tithi, N. C. Crago, and J. S. Emer, "Exploiting spatial architectures for edit distance algorithms," IEEE ISPASS, 2014.
[43]
K. T. Malladi, B. C. Lee, F. A. Nothaft, C. Kozyrakis, K. Periyathambi, and M. Horowitz, "Towards energy-proportional datacenter memory with mobile dram," in ISCA, 2012.

Cited By

View all
  • (2024)Leveraging Bit-Serial Architectures for Hardware-Oriented Deep Learning Accelerators with Column-Buffering DataflowElectronics10.3390/electronics1307121713:7(1217)Online publication date: 26-Mar-2024
  • (2024)Design of a Convolutional Neural Network Accelerator Based on On-Chip Data ReorderingElectronics10.3390/electronics1305097513:5(975)Online publication date: 4-Mar-2024
  • (2024)Design of a Generic Dynamically Reconfigurable Convolutional Neural Network Accelerator with Optimal BalanceElectronics10.3390/electronics1304076113:4(761)Online publication date: 14-Feb-2024
  • Show More Cited By

Index Terms

  1. Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM SIGARCH Computer Architecture News
      ACM SIGARCH Computer Architecture News  Volume 44, Issue 3
      ISCA'16
      June 2016
      730 pages
      ISSN:0163-5964
      DOI:10.1145/3007787
      Issue’s Table of Contents
      • cover image ACM Conferences
        ISCA '16: Proceedings of the 43rd International Symposium on Computer Architecture
        June 2016
        756 pages
        ISBN:9781467389471
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 June 2016
      Published in SIGARCH Volume 44, Issue 3

      Check for updates

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)850
      • Downloads (Last 6 weeks)78
      Reflects downloads up to 01 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Leveraging Bit-Serial Architectures for Hardware-Oriented Deep Learning Accelerators with Column-Buffering DataflowElectronics10.3390/electronics1307121713:7(1217)Online publication date: 26-Mar-2024
      • (2024)Design of a Convolutional Neural Network Accelerator Based on On-Chip Data ReorderingElectronics10.3390/electronics1305097513:5(975)Online publication date: 4-Mar-2024
      • (2024)Design of a Generic Dynamically Reconfigurable Convolutional Neural Network Accelerator with Optimal BalanceElectronics10.3390/electronics1304076113:4(761)Online publication date: 14-Feb-2024
      • (2024)Stable Low-Rank CP Decomposition for Compression of Convolutional Neural Networks Based on SensitivityApplied Sciences10.3390/app1404149114:4(1491)Online publication date: 12-Feb-2024
      • (2024)Neural network methods for radiation detectors and imagingFrontiers in Physics10.3389/fphy.2024.133429812Online publication date: 22-Feb-2024
      • (2024)ViTA: A Highly Efficient Dataflow and Architecture for Vision Transformers2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546565(1-6)Online publication date: 25-Mar-2024
      • (2024)Scratchpad Memory Management for Deep Learning AcceleratorsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673115(629-639)Online publication date: 12-Aug-2024
      • (2024)Aries: A DNN Inference Scheduling Framework for Multi-core AcceleratorsProceedings of the 2024 5th International Conference on Computing, Networks and Internet of Things10.1145/3670105.3670136(186-191)Online publication date: 24-May-2024
      • (2024)gem5-NVDLA: A Simulation Framework for Compiling, Scheduling and Architecture Evaluation on AI System-on-ChipsACM Transactions on Design Automation of Electronic Systems10.1145/3661997Online publication date: 29-Apr-2024
      • (2024)A Review on the emerging technology of TinyMLACM Computing Surveys10.1145/3661820Online publication date: 30-Apr-2024
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media