research-article

High-performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System

Authors:

Xiaoming XiongAuthors Info & Claims

ACM Transactions on Embedded Computing Systems, Volume 22, Issue 6

Article No.: 90, Pages 1 - 20

https://doi.org/10.1145/3530818

Published: 09 November 2023 Publication History

Abstract

Deep convolutional neural networks (DNNs) have been widely used in many applications, particularly in machine vision. It is challenging to accelerate DNNs on embedded systems because real-world machine vision applications should reserve a lot of external memory bandwidth for other tasks, such as video capture and display, while leaving little bandwidth for accelerating DNNs. In order to solve this issue, in this study, we propose a high-throughput accelerator, called reconfigurable tiny neural network accelerator (ReTiNNA), for the bandwidth-limited system and present a real-time object detection system for the high-resolution video image. We first present a dedicated computation engine that takes different data mapping methods for various filter types to improve data reuse and reduce hardware resources. We then propose an adaptive layer-wise tiling strategy that tiles the feature maps into strips to reduce the control complexity of data transmission dramatically and to improve the efficiency of data transmission. Finally, a design space exploration (DSE) approach is presented to explore design space more accurately in the case of insufficient bandwidth to improve the performance of the low-bandwidth accelerator. With a low bandwidth of 2.23 GB/s and a low hardware consumption of 90.261K LUTs and 448 DSPs, ReTiNNA can still achieve a high performance of 155.86 GOPS on VGG16 and 68.20 GOPS on ResNet50, which is better than other state-of-the-art designs implemented on FPGA devices. Furthermore, the real-time object detection system can achieve a high object detection speed of 19 fps for high-resolution video.

References

[1]

L. Chen, Q. Ding, Q. Zou, Z. Chen, and L. Li. 2020. DenseLightNet: A light-weight vehicle detection network for autonomous driving. IEEE Transactions on Industrial Electronics 67, 12 (Dec. 2020), 10600–10609.

[2]

G. Li, S. K. Mandal, et al. 2021. FLASH: Fast neural architecture search with hardware optimization. ACM Transactions on Embedded Computing Systems 20, 5s (Oct. 2021), Article No. 63, 1–26.

Digital Library

[3]

G. E. Dahl, D. Yu, L. Deng, and A. Acero. 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing 20, 1 (2012), 30–42.

Digital Library

[4]

Q. N. Le and J. Jeon. 2010. Neural-network-based low-speed-damping controller for stepper motor with an FPGA. IEEE Transactions on Industrial Electronics 57, 9 (2010), 3167–3180.

[5]

C. Luo, S. X. Yang, X. Li, and M. Q. Meng. 2017. Neural-dynamics-driven complete area coverage navigation through cooperation of multiple mobile robots. IEEE Transactions on Industrial Electronics 64, 1 (2017), 750–760.

[6]

L. Xie, X. Xiang, H. Xu, L. Wang, L. Lin, and G. Yin. 2021. FFCNN: A deep neural network for surface defect detection of magnetic tile. IEEE Transactions on Industrial Electronics 68, 4 (April 2021), 3506–3516.

[7]

C. Hu and Y. Wang. 2020. An efficient convolutional neural network model based on object-level attention mechanism for casting defect detection on radiography images. IEEE Transactions on Industrial Electronics 67, 12 (Dec. 2020), 10922–10930.

[8]

S. Guo, B. Zhang, T. Yang, D. Lyu, and W. Gao. 2020. Multitask convolutional neural network with information fusion for bearing fault diagnosis and localization. IEEE Transactions on Industrial Electronics 67, 9 (Sept. 2020), 8005–8015.

[9]

X. Xie, D. Du, Q. Li, et al. 2017. Exploiting sparsity to accelerate fully connected layers of CNN-Based applications on mobile SoCs. ACM Transactions on Embedded Computing Systems 17, 2 (2017), 1–25.

[10]

J. J. Zhang, P. Raj, S. Zarar, et al. 2019. CompAct: On-chip compression of activations for low power systolic array based CNN acceleration. ACM Transactions on Embedded Computing Systems 18, 5s (October 2019), Article No. 47, 1–24.

Digital Library

[11]

Y. Chen, T. Yang, J. Emer, and V. Sze. 2019. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9, 2 (June 2019), 292–308.

[12]

I. Yeo, S.-G. Gi, G. Wang and B.-G. Lee. 2020. A hardware and energy-efficient online learning neural network with an RRAM crossbar array and stochastic neurons. IEEE Transactions on Industrial Electronics, Early Access (October 2020). DOI:

[13]

A. Ahmad and M. A. Pasha. 2020. FFConv: An FPGA-based accelerator for fast convolution layers in convolutional neural network. ACM Transactions on Embedded Computing Systems 19, 2 (2020), 1–24.

Digital Library

[14]

C. F. B. Fong, J. Mu, and W. Zhang. 2019. A cost-effective CNN accelerator design with configurable PU on FPGA. In 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI’19), 31–36.

[15]

S. Yin, et al. 2019. A high throughput acceleration for hybrid neural networks with efficient resource management on FPGA. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 38, 4 (April 2019), 678–691.

[16]

Z. Wang, K. Xu, S. Wu, L. Liu, L. Liu, and D. Wang. 2020. Sparse-YOLO: Hardware/software co-design of an FPGA accelerator for YOLOv2. IEEE Access 8 (2020), 116569–116585.

[17]

F. S. Hosseini, F. Meng, C. Yang, et al. 2021. Tolerating defects in low-power neural network accelerators via retraining-free weight approximation. ACM Transactions on Embedded Computing Systems 20, 5s (October 2021), Article No. 85, 1–21.

Digital Library

[18]

J. Guo, S. Yin, P. Ouyang, F. Tu, S. Tang, L. Liu, and S. Wei. 2018. Bit-width adaptive accelerator design for convolution neural network. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’18), 1–5.

[19]

S. Moini, B. Alizadeh, M. Emad, et al. 2017. A resource-limited hardware accelerator for convolutional neural networks in embedded vision applications. IEEE Transactions on Circuits & Systems II Express Briefs (2017), 1217–1221.

[20]

A. Aimar, H. Mostafa, E. Calabrese, et al. 2018. NullHop: A flexible convolutional neural network accelerator based on sparse representations of feature maps. IEEE Transactions on Neural Networks and Learning Systems (2018), 1–13.

[21]

V. Gokhale, A. Zaidy, A. X. M. Chang, and E. Culurciello. 2017. Snowflake: An efficient hardware accelerator for convolutional neural networks. In 2017 IEEE International Symposium on Circuits and Systems (ISCAS’17), 1–4.

[22]

X. Hu, Y. Zeng, Z. Li, X. Zheng, S. Cai, and X. Xiong. 2019. A resources-efficient configurable accelerator for deep convolutional neural networks. IEEE Access, 7 (2019), 72113–72124.

[23]

C. Zhang, P. Li, G. Sun, et al. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 161–170.

Digital Library

[24]

J. Qiu, J. Wang, S. Yao, et al. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In ACM/SIGDA International Symposium on Field-programmable Gate Arrays, 26–35.

Digital Library

[25]

F. Tu, S. Yin, P. Ouyang, et al. 2017. Deep convolutional neural network architecture with reconfigurable computation patterns. IEEE Transactions on Very Large Scale Integration Systems (2017), 2220–2233.

Digital Library

[26]

J. Cong and B. Xiao. 2014. Minimizing computation in convolutional neural networks. In Artificial Neural Networks and Machine Learning (ICANN’14), 281–290.

[27]

Y. H. Chen, T. Krishna, J. S. Emer, et al. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52, 1 (2017), 127–138.

[28]

M. Jaderberg, A. Vedaldi, and A. Zisserman. 2014. Speeding up convolutional neural networks with low rank expansions. Computer Science 4, 4 (2014), XIII.

[29]

S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding. In Proceedings of the 4th International Conference on Learning Representations (ICLR’16).

[30]

M. Courbariaux and Y. Bengio. 2016. BinaryNet: Training deep neural networks with weights and activations constrained to +1 or −1. 2016.

[31]

R. Zhao, W. Song, W. Zhang, T. Xing, J. Lin, M. Srivastava, R. Gupta, and Z. Zhang. Accelerating binarized convolutional neural networks with software-programmable FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17), ACM, 15–24.

[32]

N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y. LeCun. 2014. Fast convolutional nets with fbfft: A GPU performance evaluation. CoRR abs/1412.7580 (2014).

[33]

X. Wang, C. Wang, J. Cao, L. Gong, and X. Zhou. WinoNN: Optimizing FPGA-based convolutional neural network accelerators using sparse winograd algorithm. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 11 (November 2020), 4290–4302.

[34]

C. Yang, Y. Wang, X. Wang, and L. Geng. 2019. WRA: A 2.2-to-6.3 TOPS highly unified dynamically reconfigurable accelerator using a novel Winograd decomposition algorithm for convolutional neural networks. IEEE Transactions on Circuits and Systems I: Regular Papers 66, 9 (September 2019), 3480–3493.

[35]

J. Cong and J. Wang. 2018. PolySA: Polyhedral-based systolic array auto-compilation. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD’18), 1–8.

Digital Library

[36]

Xuechao Wei, et al. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17), 1–6.

Digital Library

[37]

K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Computer Science (2014).

[38]

Y. J. Wai, Z. Yussof, et al. 2018. Fixed point implementation of Tiny-Yolo-v2 using OpenCL on FPGA. International Journal of Advanced Computer Science & Applications 9, 10 (2018).

[39]

K. He, X. Zhang, S. Ren, et al. 2016. Deep residual learning for image recognition. Computer Vision and Pattern Recognition (CVPR'16) 1, 770–778.

Cited By

Li YWang XZhang HPan BQiu KKang WWang JZhao W(2024)Toward Energy-efficient STT-MRAM-based Near Memory Computing Architecture for Embedded SystemsACM Transactions on Embedded Computing Systems10.1145/365072923:3(1-24)Online publication date: 25-Apr-2024
https://dl.acm.org/doi/10.1145/3650729
Hu XLiu XLiu YZhang HHuang XGuan XLiang LTsui CXiong XCheng K(2023)A Tiny Accelerator for Mixed-Bit Sparse CNN Based on Efficient Fetch Method of SIMO SPadIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2023.325729870:8(3079-3083)Online publication date: Aug-2023
https://doi.org/10.1109/TCSII.2023.3257298
Zendrikov DSolinas SIndiveri G(2023)Brain-inspired methods for achieving robust computation in heterogeneous mixed-signal neuromorphic processing systemsNeuromorphic Computing and Engineering10.1088/2634-4386/ace64c3:3(034002)Online publication date: 25-Jul-2023
https://doi.org/10.1088/2634-4386/ace64c

Index Terms

High-performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System
1. Hardware
  1. Emerging technologies
    1. Biology-related information processing
      1. Neural systems
  2. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators

Recommendations

A High-Performance Reconfigurable Accelerator for Convolutional Neural Networks
ICMSSP '18: Proceedings of the 3rd International Conference on Multimedia Systems and Signal Processing

In this paper, we propose a new high-performance accelerator that supports a variety of convolutional neural networks (CNNs) such as GoogLeNet, ResNet and AlexNet. The proposed accelerator mainly includes 24 parallel PEs (processing engines) for ...
A Reconfigurable Accelerator for Sparse Convolutional Neural Networks
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Convolutional Neural Networks (CNNs) have been shown to be very useful in image recognition and other AI applications. CNNs are usually computationally intensive. To address the challenge of overwhelming calculation requirements, researchers have ...
An FPGA-based accelerator platform implements for convolutional neural network
HP3C '19: Proceedings of the 3rd International Conference on High Performance Compilation, Computing and Communications

In recent years, convolutional neural network (CNN) has become widely universal in large number of applications including computer vision, natural language processing and automatic driving. However, the CNN-based methods are computational-intensive and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 22, Issue 6

November 2023

428 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/3632298

Editor:
Tulika Mitra
National University of Singapore, Singapore

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 09 November 2023

Online AM: 02 May 2022

Accepted: 06 April 2022

Revised: 08 March 2022

Received: 11 October 2021

Published in TECS Volume 22, Issue 6

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Key-Area Research and Development Program of Guangdong Province

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
704
Total Downloads

Downloads (Last 12 months)394
Downloads (Last 6 weeks)14

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li YWang XZhang HPan BQiu KKang WWang JZhao W(2024)Toward Energy-efficient STT-MRAM-based Near Memory Computing Architecture for Embedded SystemsACM Transactions on Embedded Computing Systems10.1145/365072923:3(1-24)Online publication date: 25-Apr-2024
https://dl.acm.org/doi/10.1145/3650729
Hu XLiu XLiu YZhang HHuang XGuan XLiang LTsui CXiong XCheng K(2023)A Tiny Accelerator for Mixed-Bit Sparse CNN Based on Efficient Fetch Method of SIMO SPadIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2023.325729870:8(3079-3083)Online publication date: Aug-2023
https://doi.org/10.1109/TCSII.2023.3257298
Zendrikov DSolinas SIndiveri G(2023)Brain-inspired methods for achieving robust computation in heterogeneous mixed-signal neuromorphic processing systemsNeuromorphic Computing and Engineering10.1088/2634-4386/ace64c3:3(034002)Online publication date: 25-Jul-2023
https://doi.org/10.1088/2634-4386/ace64c

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents