Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

dMazeRunner: Executing Perfectly Nested Loops on Dataflow Accelerators

Published: 08 October 2019 Publication History
  • Get Citation Alerts
  • Abstract

    Dataflow accelerators feature simplicity, programmability, and energy-efficiency and are visualized as a promising architecture for accelerating perfectly nested loops that dominate several important applications, including image and media processing and deep learning. Although numerous accelerator designs are being proposed, how to discover the most efficient way to execute the perfectly nested loop of an application onto computational and memory resources of a given dataflow accelerator (execution method) remains an essential and yet unsolved challenge. In this paper, we propose dMazeRunner -- to efficiently and accurately explore the vast space of the different ways to spatiotemporally execute a perfectly nested loop on dataflow accelerators (execution methods). The novelty of dMazeRunner framework is in: i) a holistic representation of the loop nests, that can succinctly capture the various execution methods, ii) accurate energy and performance models that explicitly capture the computation and communication patterns, data movement, and data buffering of the different execution methods, and iii) drastic pruning of the vast search space by discarding invalid solutions and the solutions that lead to the same cost. Our experiments on various convolution layers (perfectly nested loops) of popular deep learning applications demonstrate that the solutions discovered by dMazeRunner are on average 9.16× better in Energy-Delay-Product (EDP) and 5.83× better in execution time, as compared to prior approaches. With additional pruning heuristics, dMazeRunner reduces the search time from days to seconds with a mere 2.56% increase in EDP, as compared to the optimal solution.

    References

    [1]
    Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1492--1500.
    [2]
    Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 1--12.
    [3]
    Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. 2016. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52, 1 (2016), 127--138.
    [4]
    HT Kung, Bradley McDanel, and Sai Qian Zhang. 2019. Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 821--834.
    [5]
    Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xiaowei Li. 2017. Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 553--564.
    [6]
    Hongbo Rong. 2017. Programmatic control of a compiler for generating high-performance spatial hardware. arXiv preprint arXiv:1711.07606 (2017).
    [7]
    Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. 2018. Scale-sim: Systolic cnn accelerator. arXiv preprint arXiv:1811.02883 (2018).
    [8]
    Michael Pellauer, Angshuman Parashar, Michael Adler, Bushra Ahsan, Randy Allmon, Neal Crago, Kermin Fleming, Mohit Gambhir, Aamer Jaleel, Tushar Krishna, et al. 2015. Efficient control and communication paradigms for coarse-grained spatial architectures. ACM Transactions on Computer Systems (TOCS) 33, 3 (2015), 10.
    [9]
    Tony Nowatzki, Michael Sartin-Tarm, Lorenzo De Carli, Karthikeyan Sankaralingam, Cristian Estan, and Behnam Robatmili. 2013. A general constraint-centric scheduling framework for spatial architectures. In ACM SIGPLAN Notices, Vol. 48. ACM, 495--506.
    [10]
    Yang You, Zhao Zhang, Cho-Jui Hsieh, Jim Demmel, and Kurt Keutzer. 2019. Fast deep neural network training on distributed systems and cloud TPUs. IEEE Transactions on Parallel and Distributed Systems (2019).
    [11]
    Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press, 367--379.
    [12]
    Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 92--104.
    [13]
    Shouyi Yin, Peng Ouyang, Shibin Tang, Fengbin Tu, Xiudong Li, Shixuan Zheng, Tianyi Lu, Jiangyuan Gu, Leibo Liu, and Shaojun Wei. 2017. A high energy efficient reconfigurable hybrid neural network processor for deep learning applications. IEEE Journal of Solid-State Circuits 53, 4 (2017), 968--982.
    [14]
    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: End-to-end optimization stack for deep learning. arXiv preprint arXiv:1802.04799 (2018), 1--15.
    [15]
    Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 16--25.
    [16]
    SCALE-Sim. https://github.com/ARM-software/SCALE-Sim. ([n. d.]). Accessed: November 5, 2018.
    [17]
    Ye Yu, Yingmin Li, Shuai Che, Niraj K Jha, and Weifeng Zhang. 2019. Software-defined design space exploration for an efficient AI accelerator architecture. arXiv preprint arXiv:1903.07676 (2019).
    [18]
    Xuan Yang, Mingyu Gao, Jing Pu, Ankita Nayak, Qiaoyi Liu, Steven Emberton Bell, Jeff Ou Setter, Kaidi Cao, Heonjae Ha, Christos Kozyrakis, et al. 2018. DNN dataflow choice is overrated. arXiv preprint arXiv:1809.04070 (2018).
    [19]
    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.
    [20]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
    [21]
    Jongsoo Park, Maxim Naumov, Protonu Basu, Summer Deng, Aravind Kalaiah, Daya Khudia, James Law, Parth Malani, Andrey Malevich, Satish Nadathur, et al. 2018. Deep learning inference in facebook data centers: Characterization, performance optimizations and hardware implications. arXiv preprint arXiv:1811.09886 (2018).
    [22]
    Yann LeCun. 2019. 1.1 deep learning hardware: Past, Present, and Future. In 2019 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE, 12--19.
    [23]
    Kartik Hegde, Rohit Agrawal, Yulun Yao, and Christopher W. Fletcher. 2018. Morph: Flexible acceleration for 3D CNN-based video understanding. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 933--946.
    [24]
    Xuan Yang et al. DNN Energy Model and Optimizer. https://github.com/xuanyoya/CNN-blocking/tree/dev. Accessed: November 5, 2018.
    [25]
    Bruce Fleischer, Sunil Shukla, Matthew Ziegler, Joel Silberman, Jinwook Oh, Vijavalakshmi Srinivasan, Jungwook Choi, Silvia Mueller, Ankur Agrawal, Tina Babinsky, et al. 2018. A scalable multi-TeraOPS deep learning processor core for AI trainina and inference. In 2018 IEEE Symposium on VLSI Circuits. IEEE, 35--36.
    [26]
    Jason Cong and Jie Wang. 2018. PolySA: Polyhedral-based systolic array auto-compilation. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 1--8.
    [27]
    Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. 2019. HyPar: Towards hybrid parallelism for deep learning accelerator array. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 56--68.
    [28]
    Alfred V. Aho et al. 2007. Compilers: Principles, techniques and tools. (2007).
    [29]
    Steve Carr, Kathryn S. McKinley, and Chau-Wen Tseng. 1994. Compiler Optimizations for Improving Data Locality. Vol. 29. ACM.
    [30]
    Ilya Issenin, Erik Brockmeyer, Miguel Miranda, and Nikil Dutt. 2007. DRDU: A data reuse analysis technique for efficient scratch-pad memory management. ACM Transactions on Design Automation of Electronic Systems (TODAES) 12, 2 (2007), 15.
    [31]
    Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Acm Sigplan Notices, Vol. 43. ACM, 101--113.
    [32]
    Florin Balasa, Per Gunnar Kjeldsberg, Arnout Vandecappelle, Martin Palkovic, Qubo Hu, Hongwei Zhu, and Francky Catthoor. 2008. Storage estimation and design space exploration methodologies for the memory management of signal processing applications. Journal of Signal Processing Systems 53, 1--2 (2008), 51.
    [33]
    Manupa Karunaratne, Aditi Kulkarni Mohite, Tulika Mitra, and Li-Shiuan Peh. 2017. Hycube: A cgra with reconfigurable single-cycle multi-hop interconnect. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6.
    [34]
    Bernhard Egger, Hochan Lee, Duseok Kang, Mansureh S. Moghaddam, Youngchul Cho, Yeonbok Lee, Sukjin Kim, Soonhoi Ha, and Kiyoung Choi. 2017. A space-and energy-efficient code compression/decompression technique for coarse-grained reconfigurable architectures. In Proceedings of the 2017 International Symposium on Code Generation and Optimization. IEEE Press, 197--209.
    [35]
    Shail Dave, Mahesh Balasubramanian, and Aviral Shrivastava. 2018. Ramp: Resource-aware mapping for cgras. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). IEEE, 1--6.
    [36]
    Shail Dave, Mahesh Balasubramanian, and Aviral Shrivastava. 2018. URECA: A compiler solution to manage unified register file for CGRAs. In 2018 Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE). IEEE, 1081--1086.
    [37]
    Arthur Stoutchinin, Francesco Conti, and Luca Benini. 2019. Optimally scheduling CNN convolutions for efficient memory access. arXiv preprint arXiv:1902.01492 (2019).
    [38]
    Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In International Conference on Machine Learning. 1737--1746.
    [39]
    Zhongyuan Zhao, Hyoukjun Kwon, Sachit Kuhar, Weiguang Sheng, Zhigang Mao, and Tushar Krishna. 2019. mRNA: Enabling efficient mapping space exploration for a reconfiguration neural accelerator. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 282--292.
    [40]
    Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 161--170.
    [41]
    Hyoukjun Kwon, Michael Pellauer, and Tushar Krishna. 2018. MAESTRO: An open-source infrastructure for modeling dataflows within deep learning accelerators. CoRR abs/1805.02566 (2018).
    [42]
    Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization. IEEE Computer Society, 75.
    [43]
    fmincon. https://www.mathworks.com/help/optim/ug/fmincon.html. Accessed: November 5, 2018.
    [44]
    Michael Kistler, Michael Perrone, and Fabrizio Petrini. 2006. Cell multiprocessor communication network: Built for speed. IEEE Micro 26, 3 (2006), 10--23.
    [45]
    Yongjoo Kim, Jongeun Lee, Aviral Shrivastava, Jonghee W. Yoon, Doosan Cho, and Yunheung Paek. 2011. High throughput data mapping for coarse-grained reconfigurable architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 30, 11 (2011), 1599--1609.

    Cited By

    View all
    • (2024)Estimating Power, Performance, and Area for On-Sensor Deployment of AR/VR Workloads Using an Analytical FrameworkACM Transactions on Design Automation of Electronic Systems10.1145/3670404Online publication date: 7-Jun-2024
    • (2024)DNNOPT: A Framework for Efficiently Selecting On-chip Memory Loop Optimizations of DNN AcceleratorsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649196(126-137)Online publication date: 7-May-2024
    • (2024)TinyNS: Platform-aware Neurosymbolic Auto Tiny Machine LearningACM Transactions on Embedded Computing Systems10.1145/360317123:3(1-48)Online publication date: 11-May-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Embedded Computing Systems
    ACM Transactions on Embedded Computing Systems  Volume 18, Issue 5s
    Special Issue ESWEEK 2019, CASES 2019, CODES+ISSS 2019 and EMSOFT 2019
    October 2019
    1423 pages
    ISSN:1539-9087
    EISSN:1558-3465
    DOI:10.1145/3365919
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Journal Family

    Publication History

    Published: 08 October 2019
    Accepted: 01 July 2019
    Revised: 01 June 2019
    Received: 01 April 2019
    Published in TECS Volume 18, Issue 5s

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Coarse-grained reconfigurable array
    2. analytical model
    3. dataflow
    4. deep neural networks
    5. design space exploration
    6. energy-efficiency
    7. loop optimization
    8. mapping
    9. systolic arrays

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)626
    • Downloads (Last 6 weeks)47
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Estimating Power, Performance, and Area for On-Sensor Deployment of AR/VR Workloads Using an Analytical FrameworkACM Transactions on Design Automation of Electronic Systems10.1145/3670404Online publication date: 7-Jun-2024
    • (2024)DNNOPT: A Framework for Efficiently Selecting On-chip Memory Loop Optimizations of DNN AcceleratorsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649196(126-137)Online publication date: 7-May-2024
    • (2024)TinyNS: Platform-aware Neurosymbolic Auto Tiny Machine LearningACM Transactions on Embedded Computing Systems10.1145/360317123:3(1-48)Online publication date: 11-May-2024
    • (2024)An ASIC Accelerator for QNN With Variable Precision and Tunable Energy EfficiencyIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.335759743:7(2057-2070)Online publication date: Jul-2024
    • (2024)TensorMap: A Deep RL-Based Tensor Mapping Framework for Spatial AcceleratorsIEEE Transactions on Computers10.1109/TC.2024.339842473:8(1899-1912)Online publication date: Aug-2024
    • (2024)AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444801(143-157)Online publication date: 2-Mar-2024
    • (2024)Exact Scheduling to Minimize Off-Chip Data Movement for Deep Learning Accelerators2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC)10.1109/ASP-DAC58780.2024.10473916(908-914)Online publication date: 22-Jan-2024
    • (2024)HiEval: A scheduling performance estimation approach for spatial accelerators via hierarchical abstractionJournal of Systems Architecture10.1016/j.sysarc.2024.103079148(103079)Online publication date: Mar-2024
    • (2023)A unifying review of edge intelligent computing technique applications in the field of energy networksJournal of Industrial and Management Optimization10.3934/jimo.2023027(0-0)Online publication date: 2023
    • (2023)Explainable-DSE: An Agile and Explainable Exploration of Efficient HW/SW Codesigns of Deep Learning Accelerators Using Bottleneck AnalysisProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624772(87-107)Online publication date: 25-Mar-2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media