Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3352460.3358252acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Understanding Reuse, Performance, and Hardware Cost of DNN Dataflow: A Data-Centric Approach

Published: 12 October 2019 Publication History
  • Get Citation Alerts
  • Abstract

    The data partitioning and scheduling strategies used by DNN accelerators to leverage reuse and perform staging are known as dataflow, which directly impacts the performance and energy efficiency of DNN accelerators. An accelerator micro architecture dictates the dataflow(s) that can be employed to execute layers in a DNN. Selecting a dataflow for a layer can have a large impact on utilization and energy efficiency, but there is a lack of understanding on the choices and consequences of dataflow, and of tools and methodologies to help architects explore the co-optimization design space.
    In this work, we first introduce a set of data-centric directives to concisely specify the DNN dataflow space in a compiler-friendly form. We then show how these directives can be analyzed to infer various forms of reuse and to exploit them using hardware capabilities. We codify this analysis into an analytical cost model, MAESTRO (Modeling Accelerator Efficiency via Patio-Temporal Reuse and Occupancy), that estimates various cost-benefit tradeoffs of a dataflow including execution time and energy efficiency for a DNN model and hardware configuration. We demonstrate the use of MAESTRO to drive a hardware design space exploration experiment, which searches across 480M designs to identify 2.5M valid designs at an average rate of 0.17M designs per second, including Pareto-optimal throughput- and energy-optimized design points.

    References

    [1]
    2017. NVDLA Deep Learning Accelerator. http://nvdla.org.
    [2]
    2019. MAESTRO project page. https://synergy.ece.gatech.edu/tools/maestro/.
    [3]
    V Alight, Air Yazdanbakhsh, Kambiz Samadi, H Esmaeilzadeh, and RK Gupta. 2018. Snapea: Predictive early activation for reducing computation in deep convolutional neural networks. In International Symposium on Computer Architecture (ISCA).
    [4]
    Luke Metz Alec Radford and Soumith Chintala. 2015. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv preprint arXiv:1511.06434 (2015).
    [5]
    Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, et al. 2015. Deep speech 2: End-to-end speech recognition in english and mandarin. arXiv preprint arXiv:1512.02595 (2015).
    [6]
    Bo Chen Dmitry Kalenichenko Weijun Wang Tobias Weyand Marco Andreetto Andrew G. Howard, Menglong Zhu and Hartwig Adam. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv preprint arXiv:1704.04861 (2017).
    [7]
    Wenlei Bao, Sriram Krishnamoorthy, Louis-Noel Pouchet, and P. Sadayappan. 2017. Analytical Modeling of Cache Behavior for Affine Programs. Proc. ACM Program. Lang. 2, POPL, Article 32 (Dec. 2017), 26 pages. https://doi.org/10.1145/3158120
    [8]
    Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '08). ACM, New York, NY, USA, 101--113. https://doi.org/10.1145/1375581.1375595
    [9]
    Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In International conference on Architectural support for programming languages and operating systems (ASPLOS).
    [10]
    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 578--594. https://www.usenix.org/conference/osdi18/presentation/chen
    [11]
    Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In International Symposium on Computer Architecture (ISCA).
    [12]
    Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2018. Eyeriss v2: A Flexible and High-Performance Accelerator for Emerging Deep Neural Networks. arXiv:cs.DC/1807.07928
    [13]
    Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52, 1 (2017), 127--138.
    [14]
    Jason Cong and Bingjun Xiao. 2014. Minimizing computation in convolutional neural networks. In International conference on artificial neural networks (ICANN). Springer, 281--290.
    [15]
    Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In International Symposium on Computer Architecture (ISCA).
    [16]
    Clement Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. 2013. Learning hierarchical features for scene labeling. PAMI 35, 8 (2013), 1915--1929.
    [17]
    Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017. Tetris: Scalable and efficient neural network acceleration with 3d memory. ACM SIGOPS Operating Systems Review 51, 2 (2017), 751--764.
    [18]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
    [19]
    Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In International Symposium on Computer Architecture (ISCA). IEEE, 1--12.
    [20]
    Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Conference on Computer Vision and Pattern Recognition (CVPR).
    [21]
    Induprakas Kodukula, Nawaaz Ahmed, and Keshav Pingali. 1997. Data-centric Multi-level Blocking. In Proceedings of the ACM SIGPLAN 1997 Conference on Programming Language Design and Implementation (PLDI '97). ACM, New York, NY, USA, 346--357. https://doi.org/10.1145/258915.258946
    [22]
    Induprakas Kodukula and Keshav Pingali. 2001. Data-Centric Transformations for Locality Enhancement. International Journal of Parallel Programming 29, 3 (01 Jun 2001), 319--364. https://doi.org/10.1023/A:1011172104768
    [23]
    Induprakas Kodukula, Keshav Pingali, Robert Cox, and Dror Maydan. 1999. An Experimental Evaluation of Tiling and Shackling for Memory Hierarchy Management. In Proceedings of the 13th International Conference on Supercomputing (ICS '99). ACM, New York, NY, USA, 482--491. https://doi.org/10.1145/305138.305243
    [24]
    Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. 2018. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 461--475.
    [25]
    Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xiaowei Li. 2017. FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks. In International Symposium on High Performance Computer Architecture (HPCA).
    [26]
    Yufei Ma, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. 2017. Optimizing loop operation and dataflow in fpga acceleration of deep convolutional neural networks. In International Symposium on Field-Programmable Gate Arrays (FPGA). ACM, 45--54.
    [27]
    Mostafa Mahmoud, Kevin Siu, and Andreas Moshovos. 2018. Diffy: a Déja vu-Free Differential Deep Neural Network Accelerator. In International Symposium on Microarchitecture (MICRO).
    [28]
    Menglong Zhu Andrey Zhmoginov Mark Sandler, Andrew Howard and Liang-Chieh Chen. 2019. MobileNetV2: Inverted Residuals and Linear Bottlenecks. arXiv preprint arXiv:1801.04381 (2019).
    [29]
    Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P Jouppi. 2009. CACTI 6.0: A tool to model large caches. HP laboratories (2009), 22--31.
    [30]
    Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W. Keckler, and Joel Emer. 2019. Timeloop: A Systematic Approach to DNN Accelerator Evaluation. In Proceedings of the 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
    [31]
    A. Parashar et al. 2017. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. In International Symposium on Computer Architecture (ISCA). 27--40.
    [32]
    Louis-Noël Pouchet, Uday Bondhugula, Cédric Bastoul, Albert Cohen, J. Ramanujam, and P. Sadayappan. 2010. Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework. In Conference on High Performance Computing Networking, Storage and Analysis, SC 2010, New Orleans, LA, USA, November 13-19, 2010. 1--11. https://doi.org/10.1109/SC.2010.14
    [33]
    Benoît Pradelle, Benoît Meister, M. Baskaran, Jonathan Springer, and Richard Lethin. 2017. Polyhedral Optimization of TensorFlow Computation Graphs. In Workshop on Extreme-scale Programming Tools (ESPT).
    [34]
    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234--241.
    [35]
    Piotr DollÃąr Zhuowen Tu Saining Xie, Ross Girshick and Kaiming He. 2017. Aggregated Residual Transformations for Deep Neural Networks. arXiv preprint arXiv:1611.05431 (2017).
    [36]
    Vivek Sarkar. 1997. Automatic selection of high-order transformations in the IBM XL FORTRAN compilers. IBM Journal of Research and Development 41, 3 (1997), 233--264. https://doi.org/10.1147/rd.413.0233
    [37]
    Vivek Sarkar and Nimrod Megiddo. 2000. An analytical model for loop tiling and its solution. In ISPASS. 146--153. https://doi.org/10.1109/ISPASS.2000.842294
    [38]
    Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In IEEE/ACM International Symposium on Microarchitecture (MICRO).
    [39]
    Jun Shirako, Louis-Noël Pouchet, and Vivek Sarkar. 2014. Oil and Water Can Mix: An Integration of Polyhedral and AST-based Transformations. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, Piscataway, NJ, USA, 287--298. https://doi.org/10.1109/SC.2014.29
    [40]
    Jun Shirako, Kamal Sharma, Naznin Fauzia, Louis-Noël Pouchet, J. Ramanujam, P. Sadayappan, and Vivek Sarkar. 2012. Analytical Bounds for Optimal Tile Size Selection. In Compiler Construction - 21st International Conference, CC 2012, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2012, Tallinn, Estonia, March 24-April 1, 2012. Proceedings. 101--121. https://doi.org/10.1007/978-3-642-28652-0_6
    [41]
    Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
    [42]
    Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks For Large-Scale Image Recognition. In International Conference on Learning Representations (ICLR).
    [43]
    Lili Song, Ying Wang, Yinhe Han, Xin Zhao, Bosheng Liu, and Xiaowei Li. 2016. C-brain: a deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization. In Design Automation Conference (DAC). 1--6.
    [44]
    Alexander Toshev and Christian Szegedy. 2014. Deeppose: Human pose estimation via deep neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR).
    [45]
    Michael E. Wolf and Monica S. Lam. 1991. A Data Locality Optimizing Algorithm. In Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation (PLDI '91). ACM, New York, NY, USA, 30--44. https://doi.org/10.1145/113445.113449
    [46]
    Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv preprint arXiv:1609.08144 (2016).
    [47]
    Wu, Yannan N. and Emer, Joel S. and Sze, Vivienne. 2019. Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs. In IEEE/ACM International Conference On Computer Aided Design (ICCAD).
    [48]
    Chen Zhang, Guangyu Sun, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2018. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD) (2018).

    Cited By

    View all
    • (2024)Intermittent-Aware Design Exploration of Systolic Array Using Various Non-Volatile Memory: A Comparative StudyMicromachines10.3390/mi1503034315:3(343)Online publication date: 29-Feb-2024
    • (2024)ECHO: Energy-Efficient Computation Harnessing Online Arithmetic—An MSDF-Based Accelerator for DNN InferenceElectronics10.3390/electronics1310189313:10(1893)Online publication date: 11-May-2024
    • (2024)YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641566(212-226)Online publication date: 17-Feb-2024
    • Show More Cited By

    Index Terms

    1. Understanding Reuse, Performance, and Hardware Cost of DNN Dataflow: A Data-Centric Approach

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture
        October 2019
        1104 pages
        ISBN:9781450369381
        DOI:10.1145/3352460
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 12 October 2019

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Cost modeling
        2. Dataflow
        3. Neural networks

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Conference

        MICRO '52
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 484 of 2,242 submissions, 22%

        Upcoming Conference

        MICRO '24

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)821
        • Downloads (Last 6 weeks)82
        Reflects downloads up to 27 Jul 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Intermittent-Aware Design Exploration of Systolic Array Using Various Non-Volatile Memory: A Comparative StudyMicromachines10.3390/mi1503034315:3(343)Online publication date: 29-Feb-2024
        • (2024)ECHO: Energy-Efficient Computation Harnessing Online Arithmetic—An MSDF-Based Accelerator for DNN InferenceElectronics10.3390/electronics1310189313:10(1893)Online publication date: 11-May-2024
        • (2024)YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641566(212-226)Online publication date: 17-Feb-2024
        • (2024)TinyForge: A Design Space Exploration to Advance Energy and Silicon Area Trade-offs in tinyML Compute Architectures with Custom Latch ArraysProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651328(1033-1047)Online publication date: 27-Apr-2024
        • (2024)Towards Scalability and Performance: Framework for Heterogeneous Cluster Integration in Deep Learning Accelerators2024 IEEE 4th International Conference on VLSI Systems, Architecture, Technology and Applications (VLSI SATA)10.1109/VLSISATA61709.2024.10560269(1-6)Online publication date: 17-May-2024
        • (2024)Versa-DNN: A Versatile Architecture Enabling High-Performance and Energy-Efficient Multi-DNN AccelerationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.334095335:2(349-361)Online publication date: Mar-2024
        • (2024)A High-Performance and Energy-Efficient Photonic Architecture for Multi-DNN AccelerationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332753535:1(46-58)Online publication date: Jan-2024
        • (2024)Hardware Accelerator Design for Sparse DNN Inference and Training: A TutorialIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2023.334468171:3(1708-1714)Online publication date: Mar-2024
        • (2024)Rubick: A Unified Infrastructure for Analyzing, Exploring, and Implementing Spatial Architectures via Dataflow DecompositionIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333720843:4(1177-1190)Online publication date: Apr-2024
        • (2024)INDM: Chiplet-Based Interconnect Network and Dataflow Mapping for DNN AcceleratorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333283243:4(1107-1120)Online publication date: Apr-2024
        • Show More Cited By

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media