Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Heterogeneous Scheduling of Deep Neural Networks for Low-power Real-time Designs

Published: 16 December 2019 Publication History

Abstract

Deep neural networks have become the readiest answer to a range of application challenges including image recognition, stock analysis, natural language processing, and biomedical applications such as seizure detection. All while outperforming prior leading solutions that relied heavily on hand-engineered techniques. However, deployment of these neural networks often requires high-computational and memory-intensive solutions. These requirements make it challenging to deploy Deep Neural Networks (DNNs) in embedded, real-time low-power applications where classic architectures, GPUs and CPUs, still impose significant power burden. Systems-on-Chip (SoC) with Field-programmable Gate Arrays (FPGAs) can be used to improve performance and allow more fine-grain control of resources than CPUs or GPUs, but it is difficult to find the optimal balance between hardware and software to improve DNN efficiency. In the current research literature there have been few proposed solutions to address optimizing hardware and software deployments of DNNs in embedded low-power systems. To address the computation resource restriction and low-power needs for deploying these networks, we describe and implement a domain-specific metric model for optimizing task deployment on differing platforms, hardware and software. Next, we propose a DNN hardware accelerator called Scalable Low-power Accelerator for real-time deep neural Networks (SCALENet) that includes multithreaded software workers. Finally, we propose a heterogeneous aware scheduler that uses the DNN-specific metric models and the SCALENet accelerator to allocate a task to a resource based on solving a numerical cost for a series of domain objectives. To demonstrate the applicability of our contribution, we deploy nine modern deep network architectures, each containing a different number of parameters within the context of two different neural network applications: image processing and biomedical seizure detection. Utilizing the metric modeling techniques integrated into the heterogeneous aware scheduler and the SCALENet accelerator, we demonstrate the ability to meet computational requirements, adapt to multiple architectures, and lower power by providing an optimized task to resource allocation. Our heterogeneous aware scheduler improves power saving by decreasing power consumption by 10% of the total system power, does not affect the accuracy of the networks, and still meets the real-time deadlines. We demonstrate the ability to achieve parity with or exceed the energy efficiency of NVIDIA GPUs when evaluated against Jetson TK1 with embedded GPU SoC and with a 4× power savings in a power envelope of 2.0W. When compared to existing FPGA-based accelerators, SCALENet’s accelerator and heterogeneous aware scheduler achieves a 4× improvement in energy efficiency.

References

[1]
T. Abtahi, A. Kulkarni, and T. Mohsenin. 2017. Accelerating convolutional neural network with FFT on tiny cores. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’17). 1--4.
[2]
T. Abtahi, C. Shea, A. Kulkarni, and T. Mohsenin. 2018. Accelerating convolutional neural network with FFT on embedded hardware. IEEE Trans. Very Large Scale Integr. 26, 9 (Sept. 2018), 1737--1749.
[3]
U. Rajendra Acharya, Shu Lih Oh, Yuki Hagiwara, Jen Hong Tan, and Hojjat Adeli. 2017. Deep convolutional neural network for the automated detection and diagnosis of seizure using EEG signals. Comput. Biol. Med. 100 (2017).
[4]
F. Berman, R. Wolski, S. Figueira, J. Schopf, and G. Shao. 1996. Application-level scheduling on distributed heterogeneous networks. In Proceedings of the ACM/IEEE Conference on Supercomputing. 39--39.
[5]
Tracy D. Braun, Howard Jay Siegel, Noah Beck, Ladislau L. Bölöni, Muthucumaru Maheswaran, Albert I. Reuther, James P. Robertson, Mitchell D. Theys, Bin Yao, Debra Hensgen, and Richard F. Freund. 2001. A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. J. Parallel Distrib. Comput. 61, 6 (2001), 810--837.
[6]
Hassan Chafi, Arvind K. Sujeeth, Kevin J. Brown, HyoukJoong Lee, Anand R. Atreya, and Kunle Olukotun. 2011. A domain-specific approach to heterogeneous parallelism. SIGPLAN Not. 46, 8 (Feb. 2011), 35--46.
[7]
Srimat Chakradhar et al. 2010. A dynamically configurable coprocessor for convolutional neural networks. In ACM SIGARCH Computer Architecture News.
[8]
Y. H. Chen, T. Krishna, J. Emer, and V. Sze. 2016. 14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC’16), Vol. 59. IEEE, 262--263.
[9]
R. DiCecco, G. Lacey, J. Vasiljevic, P. Chow, G. Taylor, and S. Areibi. 2016. Caffeinated FPGAs: FPGA framework for convolutional neural networks. Proceedings of the International Conference on Field-Programmable Technology (FPT’16). 265--268.
[10]
Caiwen Ding, Siyu Liao, Yanzhi Wang, Zhe Li, Ning Liu, Youwei Zhuo, Chao Wang, Xuehai Qian, Yu Bai, Geng Yuan, Xiaolong Ma, Yipeng Zhang, Jian Tang, Qinru Qiu, Xue Lin, and Bo Yuan. 2017. CirCNN: Accelerating and compressing deep neural networks using block-circulant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’17). ACM, New York, NY, 395--408.
[11]
V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello. 2014. A 240 G-ops/s mobile coprocessor for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 696--701.
[12]
A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. Ch. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley. 2000 (June 13). PhysioBank, physiotoolkit, and physionet: Components of a New Research Resource for Complex Physiologic Signals. Circulation 101, 23 (June 2000), e215--e220. Retrieved from http://circ.ahajournals.org/content/101/23/e215.full 1085218;
[13]
K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, and H. Yang. 2018. Angel-eye: A complete design flow for mapping CNN onto embedded FPGA. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 37, 1 (Jan. 2018), 35--47.
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. CoRR abs/1512.03385. Retrieved from http://arxiv.org/abs/1512.03385.
[15]
Morteza Hosseini, Mark Horton, Hiren Paneliya, Uttej Kallakuri, Houman Homayoun, and Tinoosh Mohsenin. 2019. On the complexity reduction of dense layers from O(N2) to O(N log N) with cyclic sparsely connected layers. In Proceedings of the 56th Annual Design Automation Conference. ACM.
[16]
Morteza Hosseini, Hirenkumar Paneliya, Utteja Kallakuri, Mohit Khatwani, and Tinoosh Mohsenin. 2019. Minimizing classification energy of binarized neural network inference for wearable devices. In Proceedings of the 20th International Symposium on Quality Electronic Design (ISQED’19). IEEE, 259--264.
[17]
F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer. 2016. Squeezenet: Alexnet-level accuracy with 50 fewer parameters and <1mb model size. arXiv preprint arXiv:1602.07360 (2016).
[18]
G. Inggs, D. Thomas, and W. Luk. 2013. A heterogeneous computing framework for computational finance. In Proceedings of the 42nd International Conference on Parallel Processing. 688--697.
[19]
Ali Jafari, Morteza Hosseini, Houman Homayoun, and Tinoosh Mohsenin. 2018. A scalable and low-power DCNN for multimodal data classification. In Proceedings of the International Conference on ReConFigurable Computing and FPGAs (ReConFig’18). IEEE, 1--6.
[20]
Mojan Javaheripi, Mohammad Samragh, Tara Javidi, and Farinaz Koushanfar. 2019. ASCAI: Adaptive sampling for acquiring compact AI. In Proceedings of the AutoML Workshop at the 36th International Conference of Machine Learning (ICML’19).
[21]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. Arxiv Preprint Arxiv:1408.5093.
[22]
Qinma Kang, Hong He, and Huimin Song. 2011. Task assignment in heterogeneous computing systems using an effective iterated greedy algorithm. J. Syst. Softw. 84, 6 (June 2011), 985--992.
[23]
Mohit Khatwani, W. David Hairston, Nicholas Waytowich, and Tinoosh Mohsenin. 2019. A low complexity automated multi-channel EEG artifact detection using EEGNet. In Proceedings of the IEEE EMBS Conference on Neural Engineering. IEEE.
[24]
Mohit Khatwani, Morteza Hosseini, Hirenkumar Paneliya, Tinoosh Mohsenin, W. David Hairston, and Nicholas Waytowich. 2018. Energy efficient convolutional neural networks for EEG artifact detection. In Proceedings of the IEEE Biomedical Circuits and Systems Conference (BioCAS’18). IEEE, 1--4.
[25]
Xingyu Liu, Jeff Pool, Song Han, and William J. Dally. 2018. Efficient sparse-winograd convolutional neural networks. CoRR abs/1802.06367. Retrieved from http://arxiv.org/abs/1802.06367.
[26]
Junjie Lu, Stephanie Young, Itamar Arel, and Jeremy Holleman. 2015. A 1 TOPS/W analog deep machine-learning engine with floating-gate storage in 0.13 μm CMOS. IEEE J. Solid-State Circ. 50, 1 (2015), 270--281.
[27]
Hosein Mohammadi Makrani, Hossein Sayadi, Tinoosh Mohsenin, Setareh Rafatirad, Avesta Sasan, and Houman Homayoun. 2019. XPPE: Cross-platform performance estimation of hardware accelerators using machine learning. In Proceedings of the 24th Asia and South Pacific Design Automation Conference (ASPDAC’19). ACM, New York, NY, 727--732.
[28]
K. Neshatpour, F. Behnia, H. Homayoun, and A. Sasan. 2018. ICNN: An iterative implementation of convolutional neural networks to enable energy and computational complexity aware dynamic approximation. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE’18). 551--556.
[29]
K. Neshatpour, F. Behnia, H. Homayoun, and A. Sasan. 2019. Exploiting energy-accuracy trade-off through contextual awareness in multi-stage convolutional neural networks. In Proceedings of the 20th International Symposium on Quality Electronic Design (ISQED’19). 265--270.
[30]
Francisco Ortega-Zamorano, Jose M. Jerez, and Leonardo Franco. 2014. FPGA implementation of the c-mantec neural network constructive algorithm. IEEE Trans. Industr. Inform. 10, 2 (2014), 1154--1161.
[31]
A. Page, N. Attaran, C. Shea, H. Homayoun, and T. Mohsenin. 2016. Low-power manycore accelerator for personalized biomedical applications. In Proceedings of the 26th Edition on Great Lakes Symposium on Very Large Scale Integration (GLSVLSI’16). ACM, New York, NY, 63--68.
[32]
Adam Page, Ali Jafari, Colin Shea, and Tinoosh Mohsenin. 2017. SPARCNet: A hardware accelerator for efficient deployment of sparse convolutional networks. J. Emerg. Technol. Comput. Syst. 13, 3 (May 2017).
[33]
Adam Page, Chris Sagedy et al. 2015. A flexible multichannel EEG feature extractor and classifier for seizure detection. IEEE Trans. Circ. Syst. II: Express Briefs 62, 2 (2015), 109--113.
[34]
A. Page, C. Shea, and T. Mohsenin. 2016. Wearable seizure detection using convolutional neural networks with transfer learning. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’16).
[35]
S. W. Park, J. Park, K. Bong, D. Shin, J. Lee, S. Choi, and H. J. Yoo. 2015. An energy-efficient and scalable deep-learning/inference processor with tetra-parallel MIMD architecture for big data applications.IEEE Trans. Biomed. Circ. Syst. 9, 6 (Dec. 2015), 838--848.
[36]
Mohammad Samragh, Mojan Javaheripi, and Farinaz Koushanfar. 2019. AutoRank: Automated rank selection for effective neural network customization. In Proceedings of the ML-for-Systems Workshop at the 46th International Symposium on Computer Architecture (ISCA’19).
[37]
Mohammad Samragh, Mojan Javaheripi, and Farinaz Koushanfar. 2019. CodeX: Bit-flexible encoding for streaming-based FPGA acceleration of DNNs. Arxiv Preprint Arxiv:1901.05582.
[38]
H. Sayadi, N. Patel, A. Sasan, and H. Homayoun. 2017. Machine-learning-based approaches for energy-efficiency prediction and scheduling in composite cores architectures. In Proceedings of the IEEE International Conference on Computer Design (ICCD’17). 129--136.
[39]
C. Shea, A. Page, and Tinoosh Mohsenin. 2018. SCALENet: A scalable low-power accelerator for real-time embedded deep neural networks. In ACM Proceedings of the 28th Edition of the Great Lakes Symposium on Very Large Scale Integration (GLSVLSI’18). ACM.
[40]
Jaehyeong Sim et al. 2016. 14.6 A 1.42 TOPS/W deep convolutional neural network recognition processor for intelligent IoE systems. In Proceedings of the International Solid-State Circuits Conference (ISSCC’16). IEEE.
[41]
K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[42]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9.
[43]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2015. Rethinking the inception architecture for computer vision. CoRR abs/1512.00567. Retrieevd from http://arxiv.org/abs/1512.00567.
[44]
Arie van Deursen, Paul Klint, and Joost Visser. 2000. Domain-specific languages: An annotated bibliography. SIGPLAN Not. 35, 6 (June 2000), 26--36.
[45]
Nicolas Vasilache, Jeff Johnson, Michaël Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2014. Fast convolutional nets with fbfft: A GPU Performance evaluation. CoRR abs/1412.7580. Retrieved from http://arxiv.org/abs/1412.7580.
[46]
Praneeth Vepakomma, Debraj De, Sajal K. Das, and Shekhar Bhansali. 2015. A-wristocracy: Deep learning on wrist-worn sensing for recognition of user complex activities. In Proceedings of the IEEE 12th International Conference on Wearable and Implantable Body Sensor Networks (BSN’15). IEEE, 1--6.
[47]
Xilinx. 2011. Power methodology guide. Retrieved on March 2011 from https://www.xilinx.com/support/documentation/sw_manuals/xilinx13_1/ug786_PowerMethodology.pdf.
[48]
Chen Zhang et al. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’15). ACM.
[49]
Guanwen Zhong, Akshat Dubey, Cheng Tan, and Tulika Mitra. 2018. Synergy: A HW/SW Framework for high throughput CNNs on embedded heterogeneous SoC. CoRR abs/1804.00706. Retrieved from http://arxiv.org/abs/1804.00706.

Cited By

View all
  • (2023)Energy-Efficient Approximate Edge Inference SystemsACM Transactions on Embedded Computing Systems10.1145/358976622:4(1-50)Online publication date: 31-Mar-2023
  • (2023)PyTorch and CEDR: Enabling Deployment of Machine Learning Models on Heterogeneous Computing Systems2023 20th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA)10.1109/AICCSA59173.2023.10479315(1-8)Online publication date: 4-Dec-2023
  • (2021)Real-Time Hybrid Flow Shop Scheduling Approach in Smart Manufacturing EnvironmentComplex System Modeling and Simulation10.23919/CSMS.2021.00241:4(335-350)Online publication date: Dec-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Journal on Emerging Technologies in Computing Systems
ACM Journal on Emerging Technologies in Computing Systems  Volume 15, Issue 4
Special Issue on HALO for Energy-Constrained On-Chip Machine Learning, Part 2 and Regular Papers
October 2019
226 pages
ISSN:1550-4832
EISSN:1550-4840
DOI:10.1145/3365594
  • Editor:
  • Ramesh Karri
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 16 December 2019
Accepted: 01 August 2019
Revised: 01 May 2019
Received: 01 August 2018
Published in JETC Volume 15, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. FPGA
  2. Machine learning
  3. co-design
  4. hardware
  5. real-time
  6. scheduling
  7. software

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)59
  • Downloads (Last 6 weeks)4
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Energy-Efficient Approximate Edge Inference SystemsACM Transactions on Embedded Computing Systems10.1145/358976622:4(1-50)Online publication date: 31-Mar-2023
  • (2023)PyTorch and CEDR: Enabling Deployment of Machine Learning Models on Heterogeneous Computing Systems2023 20th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA)10.1109/AICCSA59173.2023.10479315(1-8)Online publication date: 4-Dec-2023
  • (2021)Real-Time Hybrid Flow Shop Scheduling Approach in Smart Manufacturing EnvironmentComplex System Modeling and Simulation10.23919/CSMS.2021.00241:4(335-350)Online publication date: Dec-2021
  • (2021)Binary Precision Neural Network Manycore AcceleratorACM Journal on Emerging Technologies in Computing Systems10.1145/342313617:2(1-27)Online publication date: 5-Apr-2021
  • (2021)Resource and Performance Estimation for CNN Models using Machine Learning2021 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI51109.2021.00019(43-48)Online publication date: Jul-2021
  • (2021)ALOHA: A Unified Platform-Aware Evaluation Method for CNNs Execution on Heterogeneous Systems at the EdgeIEEE Access10.1109/ACCESS.2021.31152439(133289-133308)Online publication date: 2021
  • (2020)Parallel DNN Inference Framework Leveraging a Compact RISC-V ISA-based Multi-core SystemProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3394486.3403105(627-635)Online publication date: 23-Aug-2020
  • (2020)EncoDeepACM Transactions on Embedded Computing Systems10.1145/339190119:6(1-29)Online publication date: 29-Sep-2020
  • (2020)A Low-Power LSTM Processor for Multi-Channel Brain EEG Artifact Detection2020 21st International Symposium on Quality Electronic Design (ISQED)10.1109/ISQED48828.2020.9137056(105-110)Online publication date: Mar-2020
  • (2020)CSCMAC - Cyclic Sparsely Connected Neural Network Manycore Accelerator2020 21st International Symposium on Quality Electronic Design (ISQED)10.1109/ISQED48828.2020.9137013(311-316)Online publication date: Mar-2020

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media