research-article

Understanding Reuse, Performance, and Hardware Cost of DNN Dataflow: A Data-Centric Approach

Authors:

Prasanth Chatarasi,

Michael Pellauer,

Angshuman Parashar,

Tushar KrishnaAuthors Info & Claims

MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

Pages 754 - 768

https://doi.org/10.1145/3352460.3358252

Published: 12 October 2019 Publication History

Abstract

The data partitioning and scheduling strategies used by DNN accelerators to leverage reuse and perform staging are known as dataflow, which directly impacts the performance and energy efficiency of DNN accelerators. An accelerator micro architecture dictates the dataflow(s) that can be employed to execute layers in a DNN. Selecting a dataflow for a layer can have a large impact on utilization and energy efficiency, but there is a lack of understanding on the choices and consequences of dataflow, and of tools and methodologies to help architects explore the co-optimization design space.

In this work, we first introduce a set of data-centric directives to concisely specify the DNN dataflow space in a compiler-friendly form. We then show how these directives can be analyzed to infer various forms of reuse and to exploit them using hardware capabilities. We codify this analysis into an analytical cost model, MAESTRO (Modeling Accelerator Efficiency via Patio-Temporal Reuse and Occupancy), that estimates various cost-benefit tradeoffs of a dataflow including execution time and energy efficiency for a DNN model and hardware configuration. We demonstrate the use of MAESTRO to drive a hardware design space exploration experiment, which searches across 480M designs to identify 2.5M valid designs at an average rate of 0.17M designs per second, including Pareto-optimal throughput- and energy-optimized design points.

References

[1]

2017. NVDLA Deep Learning Accelerator. http://nvdla.org.

[2]

2019. MAESTRO project page. https://synergy.ece.gatech.edu/tools/maestro/.

[3]

V Alight, Air Yazdanbakhsh, Kambiz Samadi, H Esmaeilzadeh, and RK Gupta. 2018. Snapea: Predictive early activation for reducing computation in deep convolutional neural networks. In International Symposium on Computer Architecture (ISCA).

[4]

Luke Metz Alec Radford and Soumith Chintala. 2015. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv preprint arXiv:1511.06434 (2015).

[5]

Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, et al. 2015. Deep speech 2: End-to-end speech recognition in english and mandarin. arXiv preprint arXiv:1512.02595 (2015).

[6]

Bo Chen Dmitry Kalenichenko Weijun Wang Tobias Weyand Marco Andreetto Andrew G. Howard, Menglong Zhu and Hartwig Adam. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv preprint arXiv:1704.04861 (2017).

[7]

Wenlei Bao, Sriram Krishnamoorthy, Louis-Noel Pouchet, and P. Sadayappan. 2017. Analytical Modeling of Cache Behavior for Affine Programs. Proc. ACM Program. Lang. 2, POPL, Article 32 (Dec. 2017), 26 pages. https://doi.org/10.1145/3158120

[8]

Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '08). ACM, New York, NY, USA, 101--113. https://doi.org/10.1145/1375581.1375595

Digital Library

[9]

Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In International conference on Architectural support for programming languages and operating systems (ASPLOS).

Digital Library

[10]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 578--594. https://www.usenix.org/conference/osdi18/presentation/chen

[11]

Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In International Symposium on Computer Architecture (ISCA).

Digital Library

[12]

Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2018. Eyeriss v2: A Flexible and High-Performance Accelerator for Emerging Deep Neural Networks. arXiv:cs.DC/1807.07928

[13]

Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52, 1 (2017), 127--138.

[14]

Jason Cong and Bingjun Xiao. 2014. Minimizing computation in convolutional neural networks. In International conference on artificial neural networks (ICANN). Springer, 281--290.

[15]

Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In International Symposium on Computer Architecture (ISCA).

Digital Library

[16]

Clement Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. 2013. Learning hierarchical features for scene labeling. PAMI 35, 8 (2013), 1915--1929.

Digital Library

[17]

Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017. Tetris: Scalable and efficient neural network acceleration with 3d memory. ACM SIGOPS Operating Systems Review 51, 2 (2017), 751--764.

[18]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[19]

Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In International Symposium on Computer Architecture (ISCA). IEEE, 1--12.

Digital Library

[20]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Conference on Computer Vision and Pattern Recognition (CVPR).

[21]

Induprakas Kodukula, Nawaaz Ahmed, and Keshav Pingali. 1997. Data-centric Multi-level Blocking. In Proceedings of the ACM SIGPLAN 1997 Conference on Programming Language Design and Implementation (PLDI '97). ACM, New York, NY, USA, 346--357. https://doi.org/10.1145/258915.258946

Digital Library

[22]

Induprakas Kodukula and Keshav Pingali. 2001. Data-Centric Transformations for Locality Enhancement. International Journal of Parallel Programming 29, 3 (01 Jun 2001), 319--364. https://doi.org/10.1023/A:1011172104768

Digital Library

[23]

Induprakas Kodukula, Keshav Pingali, Robert Cox, and Dror Maydan. 1999. An Experimental Evaluation of Tiling and Shackling for Memory Hierarchy Management. In Proceedings of the 13th International Conference on Supercomputing (ICS '99). ACM, New York, NY, USA, 482--491. https://doi.org/10.1145/305138.305243

Digital Library

[24]

Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. 2018. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 461--475.

Digital Library

[25]

Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xiaowei Li. 2017. FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks. In International Symposium on High Performance Computer Architecture (HPCA).

[26]

Yufei Ma, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. 2017. Optimizing loop operation and dataflow in fpga acceleration of deep convolutional neural networks. In International Symposium on Field-Programmable Gate Arrays (FPGA). ACM, 45--54.

Digital Library

[27]

Mostafa Mahmoud, Kevin Siu, and Andreas Moshovos. 2018. Diffy: a Déja vu-Free Differential Deep Neural Network Accelerator. In International Symposium on Microarchitecture (MICRO).

Digital Library

[28]

Menglong Zhu Andrey Zhmoginov Mark Sandler, Andrew Howard and Liang-Chieh Chen. 2019. MobileNetV2: Inverted Residuals and Linear Bottlenecks. arXiv preprint arXiv:1801.04381 (2019).

[29]

Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P Jouppi. 2009. CACTI 6.0: A tool to model large caches. HP laboratories (2009), 22--31.

[30]

Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W. Keckler, and Joel Emer. 2019. Timeloop: A Systematic Approach to DNN Accelerator Evaluation. In Proceedings of the 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[31]

A. Parashar et al. 2017. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. In International Symposium on Computer Architecture (ISCA). 27--40.

Digital Library

[32]

Louis-Noël Pouchet, Uday Bondhugula, Cédric Bastoul, Albert Cohen, J. Ramanujam, and P. Sadayappan. 2010. Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework. In Conference on High Performance Computing Networking, Storage and Analysis, SC 2010, New Orleans, LA, USA, November 13-19, 2010. 1--11. https://doi.org/10.1109/SC.2010.14

[33]

Benoît Pradelle, Benoît Meister, M. Baskaran, Jonathan Springer, and Richard Lethin. 2017. Polyhedral Optimization of TensorFlow Computation Graphs. In Workshop on Extreme-scale Programming Tools (ESPT).

[34]

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234--241.

[35]

Piotr DollÃąr Zhuowen Tu Saining Xie, Ross Girshick and Kaiming He. 2017. Aggregated Residual Transformations for Deep Neural Networks. arXiv preprint arXiv:1611.05431 (2017).

[36]

Vivek Sarkar. 1997. Automatic selection of high-order transformations in the IBM XL FORTRAN compilers. IBM Journal of Research and Development 41, 3 (1997), 233--264. https://doi.org/10.1147/rd.413.0233

Digital Library

[37]

Vivek Sarkar and Nimrod Megiddo. 2000. An analytical model for loop tiling and its solution. In ISPASS. 146--153. https://doi.org/10.1109/ISPASS.2000.842294

[38]

Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In IEEE/ACM International Symposium on Microarchitecture (MICRO).

Digital Library

[39]

Jun Shirako, Louis-Noël Pouchet, and Vivek Sarkar. 2014. Oil and Water Can Mix: An Integration of Polyhedral and AST-based Transformations. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, Piscataway, NJ, USA, 287--298. https://doi.org/10.1109/SC.2014.29

Digital Library

[40]

Jun Shirako, Kamal Sharma, Naznin Fauzia, Louis-Noël Pouchet, J. Ramanujam, P. Sadayappan, and Vivek Sarkar. 2012. Analytical Bounds for Optimal Tile Size Selection. In Compiler Construction - 21st International Conference, CC 2012, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2012, Tallinn, Estonia, March 24-April 1, 2012. Proceedings. 101--121. https://doi.org/10.1007/978-3-642-28652-0_6

[41]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[42]

Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks For Large-Scale Image Recognition. In International Conference on Learning Representations (ICLR).

[43]

Lili Song, Ying Wang, Yinhe Han, Xin Zhao, Bosheng Liu, and Xiaowei Li. 2016. C-brain: a deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization. In Design Automation Conference (DAC). 1--6.

Digital Library

[44]

Alexander Toshev and Christian Szegedy. 2014. Deeppose: Human pose estimation via deep neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR).

Digital Library

[45]

Michael E. Wolf and Monica S. Lam. 1991. A Data Locality Optimizing Algorithm. In Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation (PLDI '91). ACM, New York, NY, USA, 30--44. https://doi.org/10.1145/113445.113449

[46]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv preprint arXiv:1609.08144 (2016).

[47]

Wu, Yannan N. and Emer, Joel S. and Sze, Vivienne. 2019. Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs. In IEEE/ACM International Conference On Computer Aided Design (ICCAD).

[48]

Chen Zhang, Guangyu Sun, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2018. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD) (2018).

Cited By

Taheri NTabrizchi SRoohi A(2024)Intermittent-Aware Design Exploration of Systolic Array Using Various Non-Volatile Memory: A Comparative StudyMicromachines10.3390/mi1503034315:3(343)Online publication date: 29-Feb-2024
https://doi.org/10.3390/mi15030343
Ibrahim MUsman MLee J(2024)ECHO: Energy-Efficient Computation Harnessing Online Arithmetic—An MSDF-Based Accelerator for DNN InferenceElectronics10.3390/electronics1310189313:10(1893)Online publication date: 11-May-2024
https://doi.org/10.3390/electronics13101893
Zhou CHassman ZShah DRichard VLi YRodríguez GSadayappan PSukumaran-Rajam A(2024)YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641566(212-226)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641566
Show More Cited By

Index Terms

Understanding Reuse, Performance, and Hardware Cost of DNN Dataflow: A Data-Centric Approach
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks
2. Hardware
  1. Electronic design automation
    1. Modeling and parameter extraction

Recommendations

RipTide: A Programmable, Energy-Minimal Dataflow Compiler and Architecture
MICRO '22: Proceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture

Emerging sensing applications create an unprecedented need for energy efficiency in programmable processors. To achieve useful multi-year deployments on a small battery or energy harvester, these applications must avoid off-device communication and ...
The multi-dataflow composer tool: generation of on-the-fly reconfigurable platforms

Dataflow specifications are suitable to describe both signal processing applications and the relative specialized hardware architectures, fostering the hardware---software development gap closure. They can be exploited for the development of automatic ...
A detailed cost model for concurrent use with hardware/software co-design
DAC '02: Proceedings of the 39th annual Design Automation Conference

Hardware/software co-design methodologies generally focus on the prediction of system performance or co-verification of system functionality. This study extends this conventional focus through the development of a methodology and software tool that ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

October 2019

1104 pages

ISBN:9781450369381

DOI:10.1145/3352460

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

MICRO '52

Sponsor:

SIGMICRO

MICRO '52: The 52nd Annual IEEE/ACM International Symposium on Microarchitecture

October 12 - 16, 2019

OH, Columbus, USA

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Sponsor:
sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

162
Total Citations
View Citations
4,057
Total Downloads

Downloads (Last 12 months)821
Downloads (Last 6 weeks)82

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Taheri NTabrizchi SRoohi A(2024)Intermittent-Aware Design Exploration of Systolic Array Using Various Non-Volatile Memory: A Comparative StudyMicromachines10.3390/mi1503034315:3(343)Online publication date: 29-Feb-2024
https://doi.org/10.3390/mi15030343
Ibrahim MUsman MLee J(2024)ECHO: Energy-Efficient Computation Harnessing Online Arithmetic—An MSDF-Based Accelerator for DNN InferenceElectronics10.3390/electronics1310189313:10(1893)Online publication date: 11-May-2024
https://doi.org/10.3390/electronics13101893
Zhou CHassman ZShah DRichard VLi YRodríguez GSadayappan PSukumaran-Rajam A(2024)YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641566(212-226)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641566
Giordano MDoshi RLu QMurmann BTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)TinyForge: A Design Space Exploration to Advance Energy and Silicon Area Trade-offs in tinyML Compute Architectures with Custom Latch ArraysProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651328(1033-1047)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651328
Bhagirath KPant DTiwari AKhaneja V(2024)Towards Scalability and Performance: Framework for Heterogeneous Cluster Integration in Deep Learning Accelerators2024 IEEE 4th International Conference on VLSI Systems, Architecture, Technology and Applications (VLSI SATA)10.1109/VLSISATA61709.2024.10560269(1-6)Online publication date: 17-May-2024
https://doi.org/10.1109/VLSISATA61709.2024.10560269
Yang JZheng HLouri A(2024)Versa-DNN: A Versatile Architecture Enabling High-Performance and Energy-Efficient Multi-DNN AccelerationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.334095335:2(349-361)Online publication date: Mar-2024
https://doi.org/10.1109/TPDS.2023.3340953
Li YLouri AKaranth A(2024)A High-Performance and Energy-Efficient Photonic Architecture for Multi-DNN AccelerationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332753535:1(46-58)Online publication date: Jan-2024
https://doi.org/10.1109/TPDS.2023.3327535
Mao WWang MXie XWu XWang Z(2024)Hardware Accelerator Design for Sparse DNN Inference and Training: A TutorialIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2023.334468171:3(1708-1714)Online publication date: Mar-2024
https://doi.org/10.1109/TCSII.2023.3344681
Lu LLuo ZZheng SYin JCong JLiang YYin J(2024)Rubick: A Unified Infrastructure for Analyzing, Exploring, and Implementing Spatial Architectures via Dataflow DecompositionIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333720843:4(1177-1190)Online publication date: Apr-2024
https://doi.org/10.1109/TCAD.2023.3337208
Zhang JFan XYe YWang XXiong GLeng XXu NLian YHe G(2024)INDM: Chiplet-Based Interconnect Network and Dataflow Mapping for DNN AcceleratorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333283243:4(1107-1120)Online publication date: Apr-2024
https://doi.org/10.1109/TCAD.2023.3332832
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents