research-article

Public Access

dMazeRunner: Executing Perfectly Nested Loops on Dataflow Accelerators

Authors:

Sasikanth Avancha,

Aviral ShrivastavaAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 18, Issue 5s

Article No.: 70, Pages 1 - 27

https://doi.org/10.1145/3358198

Published: 08 October 2019 Publication History

All formats PDF

Abstract

Dataflow accelerators feature simplicity, programmability, and energy-efficiency and are visualized as a promising architecture for accelerating perfectly nested loops that dominate several important applications, including image and media processing and deep learning. Although numerous accelerator designs are being proposed, how to discover the most efficient way to execute the perfectly nested loop of an application onto computational and memory resources of a given dataflow accelerator (execution method) remains an essential and yet unsolved challenge. In this paper, we propose dMazeRunner -- to efficiently and accurately explore the vast space of the different ways to spatiotemporally execute a perfectly nested loop on dataflow accelerators (execution methods). The novelty of dMazeRunner framework is in: i) a holistic representation of the loop nests, that can succinctly capture the various execution methods, ii) accurate energy and performance models that explicitly capture the computation and communication patterns, data movement, and data buffering of the different execution methods, and iii) drastic pruning of the vast search space by discarding invalid solutions and the solutions that lead to the same cost. Our experiments on various convolution layers (perfectly nested loops) of popular deep learning applications demonstrate that the solutions discovered by dMazeRunner are on average 9.16× better in Energy-Delay-Product (EDP) and 5.83× better in execution time, as compared to prior approaches. With additional pruning heuristics, dMazeRunner reduces the search time from days to seconds with a mere 2.56% increase in EDP, as compared to the optimal solution.

References

[1]

Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1492--1500.

[2]

Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 1--12.

Digital Library

[3]

Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. 2016. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52, 1 (2016), 127--138.

[4]

HT Kung, Bradley McDanel, and Sai Qian Zhang. 2019. Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 821--834.

Digital Library

[5]

Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xiaowei Li. 2017. Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 553--564.

[6]

Hongbo Rong. 2017. Programmatic control of a compiler for generating high-performance spatial hardware. arXiv preprint arXiv:1711.07606 (2017).

[7]

Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. 2018. Scale-sim: Systolic cnn accelerator. arXiv preprint arXiv:1811.02883 (2018).

[8]

Michael Pellauer, Angshuman Parashar, Michael Adler, Bushra Ahsan, Randy Allmon, Neal Crago, Kermin Fleming, Mohit Gambhir, Aamer Jaleel, Tushar Krishna, et al. 2015. Efficient control and communication paradigms for coarse-grained spatial architectures. ACM Transactions on Computer Systems (TOCS) 33, 3 (2015), 10.

Digital Library

[9]

Tony Nowatzki, Michael Sartin-Tarm, Lorenzo De Carli, Karthikeyan Sankaralingam, Cristian Estan, and Behnam Robatmili. 2013. A general constraint-centric scheduling framework for spatial architectures. In ACM SIGPLAN Notices, Vol. 48. ACM, 495--506.

[10]

Yang You, Zhao Zhang, Cho-Jui Hsieh, Jim Demmel, and Kurt Keutzer. 2019. Fast deep neural network training on distributed systems and cloud TPUs. IEEE Transactions on Parallel and Distributed Systems (2019).

Digital Library

[11]

Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press, 367--379.

Digital Library

[12]

Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 92--104.

Digital Library

[13]

Shouyi Yin, Peng Ouyang, Shibin Tang, Fengbin Tu, Xiudong Li, Shixuan Zheng, Tianyi Lu, Jiangyuan Gu, Leibo Liu, and Shaojun Wei. 2017. A high energy efficient reconfigurable hybrid neural network processor for deep learning applications. IEEE Journal of Solid-State Circuits 53, 4 (2017), 968--982.

[14]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: End-to-end optimization stack for deep learning. arXiv preprint arXiv:1802.04799 (2018), 1--15.

[15]

Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 16--25.

Digital Library

[16]

SCALE-Sim. https://github.com/ARM-software/SCALE-Sim. ([n. d.]). Accessed: November 5, 2018.

[17]

Ye Yu, Yingmin Li, Shuai Che, Niraj K Jha, and Weifeng Zhang. 2019. Software-defined design space exploration for an efficient AI accelerator architecture. arXiv preprint arXiv:1903.07676 (2019).

[18]

Xuan Yang, Mingyu Gao, Jing Pu, Ankita Nayak, Qiaoyi Liu, Steven Emberton Bell, Jeff Ou Setter, Kaidi Cao, Heonjae Ha, Christos Kozyrakis, et al. 2018. DNN dataflow choice is overrated. arXiv preprint arXiv:1809.04070 (2018).

[19]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.

Digital Library

[20]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.

[21]

Jongsoo Park, Maxim Naumov, Protonu Basu, Summer Deng, Aravind Kalaiah, Daya Khudia, James Law, Parth Malani, Andrey Malevich, Satish Nadathur, et al. 2018. Deep learning inference in facebook data centers: Characterization, performance optimizations and hardware implications. arXiv preprint arXiv:1811.09886 (2018).

[22]

Yann LeCun. 2019. 1.1 deep learning hardware: Past, Present, and Future. In 2019 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE, 12--19.

[23]

Kartik Hegde, Rohit Agrawal, Yulun Yao, and Christopher W. Fletcher. 2018. Morph: Flexible acceleration for 3D CNN-based video understanding. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 933--946.

[24]

Xuan Yang et al. DNN Energy Model and Optimizer. https://github.com/xuanyoya/CNN-blocking/tree/dev. Accessed: November 5, 2018.

[25]

Bruce Fleischer, Sunil Shukla, Matthew Ziegler, Joel Silberman, Jinwook Oh, Vijavalakshmi Srinivasan, Jungwook Choi, Silvia Mueller, Ankur Agrawal, Tina Babinsky, et al. 2018. A scalable multi-TeraOPS deep learning processor core for AI trainina and inference. In 2018 IEEE Symposium on VLSI Circuits. IEEE, 35--36.

[26]

Jason Cong and Jie Wang. 2018. PolySA: Polyhedral-based systolic array auto-compilation. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 1--8.

Digital Library

[27]

Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. 2019. HyPar: Towards hybrid parallelism for deep learning accelerator array. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 56--68.

[28]

Alfred V. Aho et al. 2007. Compilers: Principles, techniques and tools. (2007).

[29]

Steve Carr, Kathryn S. McKinley, and Chau-Wen Tseng. 1994. Compiler Optimizations for Improving Data Locality. Vol. 29. ACM.

[30]

Ilya Issenin, Erik Brockmeyer, Miguel Miranda, and Nikil Dutt. 2007. DRDU: A data reuse analysis technique for efficient scratch-pad memory management. ACM Transactions on Design Automation of Electronic Systems (TODAES) 12, 2 (2007), 15.

Digital Library

[31]

Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Acm Sigplan Notices, Vol. 43. ACM, 101--113.

Digital Library

[32]

Florin Balasa, Per Gunnar Kjeldsberg, Arnout Vandecappelle, Martin Palkovic, Qubo Hu, Hongwei Zhu, and Francky Catthoor. 2008. Storage estimation and design space exploration methodologies for the memory management of signal processing applications. Journal of Signal Processing Systems 53, 1--2 (2008), 51.

Digital Library

[33]

Manupa Karunaratne, Aditi Kulkarni Mohite, Tulika Mitra, and Li-Shiuan Peh. 2017. Hycube: A cgra with reconfigurable single-cycle multi-hop interconnect. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6.

Digital Library

[34]

Bernhard Egger, Hochan Lee, Duseok Kang, Mansureh S. Moghaddam, Youngchul Cho, Yeonbok Lee, Sukjin Kim, Soonhoi Ha, and Kiyoung Choi. 2017. A space-and energy-efficient code compression/decompression technique for coarse-grained reconfigurable architectures. In Proceedings of the 2017 International Symposium on Code Generation and Optimization. IEEE Press, 197--209.

Digital Library

[35]

Shail Dave, Mahesh Balasubramanian, and Aviral Shrivastava. 2018. Ramp: Resource-aware mapping for cgras. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). IEEE, 1--6.

Digital Library

[36]

Shail Dave, Mahesh Balasubramanian, and Aviral Shrivastava. 2018. URECA: A compiler solution to manage unified register file for CGRAs. In 2018 Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE). IEEE, 1081--1086.

[37]

Arthur Stoutchinin, Francesco Conti, and Luca Benini. 2019. Optimally scheduling CNN convolutions for efficient memory access. arXiv preprint arXiv:1902.01492 (2019).

[38]

Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In International Conference on Machine Learning. 1737--1746.

Digital Library

[39]

Zhongyuan Zhao, Hyoukjun Kwon, Sachit Kuhar, Weiguang Sheng, Zhigang Mao, and Tushar Krishna. 2019. mRNA: Enabling efficient mapping space exploration for a reconfiguration neural accelerator. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 282--292.

[40]

Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 161--170.

Digital Library

[41]

Hyoukjun Kwon, Michael Pellauer, and Tushar Krishna. 2018. MAESTRO: An open-source infrastructure for modeling dataflows within deep learning accelerators. CoRR abs/1805.02566 (2018).

[42]

Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization. IEEE Computer Society, 75.

[43]

fmincon. https://www.mathworks.com/help/optim/ug/fmincon.html. Accessed: November 5, 2018.

[44]

Michael Kistler, Michael Perrone, and Fabrizio Petrini. 2006. Cell multiprocessor communication network: Built for speed. IEEE Micro 26, 3 (2006), 10--23.

Digital Library

[45]

Yongjoo Kim, Jongeun Lee, Aviral Shrivastava, Jonghee W. Yoon, Doosan Cho, and Yunheung Paek. 2011. High throughput data mapping for coarse-grained reconfigurable architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 30, 11 (2011), 1599--1609.

Digital Library

Cited By

Sun XPeng XZhang SGomez JKhwa WSarwar SLi ZCao WWang ZLiu CChang MSalvo BAkarvardar KWong H(2024)Estimating Power, Performance, and Area for On-Sensor Deployment of AR/VR Workloads Using an Analytical FrameworkACM Transactions on Design Automation of Electronic Systems10.1145/3670404Online publication date: 7-Jun-2024
https://doi.org/10.1145/3670404
Ranawaka PAzhar MStenstrom P(2024)DNNOPT: A Framework for Efficiently Selecting On-chip Memory Loop Optimizations of DNN AcceleratorsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649196(126-137)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3649153.3649196
Saha SSandha SAggarwal MWang BHan LBriseno JSrivastava M(2024)TinyNS: Platform-aware Neurosymbolic Auto Tiny Machine LearningACM Transactions on Embedded Computing Systems10.1145/360317123:3(1-48)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3603171
Show More Cited By

Index Terms

dMazeRunner: Executing Perfectly Nested Loops on Dataflow Accelerators

Recommendations

Exploitation of parallelism to nested loops with dependence cycles

In this paper, we analyze the recurrences from the breakability of the dependence links formed in general multi-statements in a nested loop. The major findings include: (1) A sink variable renaming technique, which can reposition an undesired anti-...
An improved algorithm for loop dead optimization

Loop dead variables are the variables, which are defined in a loop, but not used in that loop. On successive execution of loop, these get different value, however all values (except last value) are not used. Hence in optimized program, the definition of ...
Asynchronous pipeline analysis and scheduling

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 18, Issue 5s

Special Issue ESWEEK 2019, CASES 2019, CODES+ISSS 2019 and EMSOFT 2019

October 2019

1423 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/3365919

Editor:
Sandeep K. Shukla
Indian Institute of Technology, India

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 08 October 2019

Accepted: 01 July 2019

Revised: 01 June 2019

Received: 01 April 2019

Published in TECS Volume 18, Issue 5s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science Foundation
Ministry of Science, ICT and Future Planning
National Research Foundation of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

52
Total Citations
View Citations
2,934
Total Downloads

Downloads (Last 12 months)626
Downloads (Last 6 weeks)47

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sun XPeng XZhang SGomez JKhwa WSarwar SLi ZCao WWang ZLiu CChang MSalvo BAkarvardar KWong H(2024)Estimating Power, Performance, and Area for On-Sensor Deployment of AR/VR Workloads Using an Analytical FrameworkACM Transactions on Design Automation of Electronic Systems10.1145/3670404Online publication date: 7-Jun-2024
https://doi.org/10.1145/3670404
Ranawaka PAzhar MStenstrom P(2024)DNNOPT: A Framework for Efficiently Selecting On-chip Memory Loop Optimizations of DNN AcceleratorsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649196(126-137)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3649153.3649196
Saha SSandha SAggarwal MWang BHan LBriseno JSrivastava M(2024)TinyNS: Platform-aware Neurosymbolic Auto Tiny Machine LearningACM Transactions on Embedded Computing Systems10.1145/360317123:3(1-48)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3603171
Wagle ASingh GKhatri SVrudhula S(2024)An ASIC Accelerator for QNN With Variable Precision and Tunable Energy EfficiencyIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.335759743:7(2057-2070)Online publication date: Jul-2024
https://doi.org/10.1109/TCAD.2024.3357597
Wang FShen MLu YXiao N(2024)TensorMap: A Deep RL-Based Tensor Mapping Framework for Spatial AcceleratorsIEEE Transactions on Computers10.1109/TC.2024.339842473:8(1899-1912)Online publication date: Aug-2024
https://doi.org/10.1109/TC.2024.3398424
Agostini NHaris JGibson PJayaweera MRubin NTumeo AAbellán JCano JKaeli D(2024)AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444801(143-157)Online publication date: 2-Mar-2024
https://doi.org/10.1109/CGO57630.2024.10444801
Li YGupta AMalik S(2024)Exact Scheduling to Minimize Off-Chip Data Movement for Deep Learning Accelerators2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC)10.1109/ASP-DAC58780.2024.10473916(908-914)Online publication date: 22-Jan-2024
https://doi.org/10.1109/ASP-DAC58780.2024.10473916
Wu ZHu YLi NLu WLiu Y(2024)HiEval: A scheduling performance estimation approach for spatial accelerators via hierarchical abstractionJournal of Systems Architecture10.1016/j.sysarc.2024.103079148(103079)Online publication date: Mar-2024
https://doi.org/10.1016/j.sysarc.2024.103079
Li CXu MYang JYao YWang SGao XShao SChen XZheng XLiu Y(2023)A unifying review of edge intelligent computing technique applications in the field of energy networksJournal of Industrial and Management Optimization10.3934/jimo.2023027(0-0)Online publication date: 2023
https://doi.org/10.3934/jimo.2023027
Dave SNowatzki TShrivastava AAamodt TSwift MJerger N(2023)Explainable-DSE: An Agile and Explainable Exploration of Efficient HW/SW Codesigns of Deep Learning Accelerators Using Bottleneck AnalysisProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624772(87-107)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3623278.3624772
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents