Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

High-performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System

Published: 09 November 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Deep convolutional neural networks (DNNs) have been widely used in many applications, particularly in machine vision. It is challenging to accelerate DNNs on embedded systems because real-world machine vision applications should reserve a lot of external memory bandwidth for other tasks, such as video capture and display, while leaving little bandwidth for accelerating DNNs. In order to solve this issue, in this study, we propose a high-throughput accelerator, called reconfigurable tiny neural network accelerator (ReTiNNA), for the bandwidth-limited system and present a real-time object detection system for the high-resolution video image. We first present a dedicated computation engine that takes different data mapping methods for various filter types to improve data reuse and reduce hardware resources. We then propose an adaptive layer-wise tiling strategy that tiles the feature maps into strips to reduce the control complexity of data transmission dramatically and to improve the efficiency of data transmission. Finally, a design space exploration (DSE) approach is presented to explore design space more accurately in the case of insufficient bandwidth to improve the performance of the low-bandwidth accelerator. With a low bandwidth of 2.23 GB/s and a low hardware consumption of 90.261K LUTs and 448 DSPs, ReTiNNA can still achieve a high performance of 155.86 GOPS on VGG16 and 68.20 GOPS on ResNet50, which is better than other state-of-the-art designs implemented on FPGA devices. Furthermore, the real-time object detection system can achieve a high object detection speed of 19 fps for high-resolution video.

    References

    [1]
    L. Chen, Q. Ding, Q. Zou, Z. Chen, and L. Li. 2020. DenseLightNet: A light-weight vehicle detection network for autonomous driving. IEEE Transactions on Industrial Electronics 67, 12 (Dec. 2020), 10600–10609.
    [2]
    G. Li, S. K. Mandal, et al. 2021. FLASH: Fast neural architecture search with hardware optimization. ACM Transactions on Embedded Computing Systems 20, 5s (Oct. 2021), Article No. 63, 1–26.
    [3]
    G. E. Dahl, D. Yu, L. Deng, and A. Acero. 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing 20, 1 (2012), 30–42.
    [4]
    Q. N. Le and J. Jeon. 2010. Neural-network-based low-speed-damping controller for stepper motor with an FPGA. IEEE Transactions on Industrial Electronics 57, 9 (2010), 3167–3180.
    [5]
    C. Luo, S. X. Yang, X. Li, and M. Q. Meng. 2017. Neural-dynamics-driven complete area coverage navigation through cooperation of multiple mobile robots. IEEE Transactions on Industrial Electronics 64, 1 (2017), 750–760.
    [6]
    L. Xie, X. Xiang, H. Xu, L. Wang, L. Lin, and G. Yin. 2021. FFCNN: A deep neural network for surface defect detection of magnetic tile. IEEE Transactions on Industrial Electronics 68, 4 (April 2021), 3506–3516.
    [7]
    C. Hu and Y. Wang. 2020. An efficient convolutional neural network model based on object-level attention mechanism for casting defect detection on radiography images. IEEE Transactions on Industrial Electronics 67, 12 (Dec. 2020), 10922–10930.
    [8]
    S. Guo, B. Zhang, T. Yang, D. Lyu, and W. Gao. 2020. Multitask convolutional neural network with information fusion for bearing fault diagnosis and localization. IEEE Transactions on Industrial Electronics 67, 9 (Sept. 2020), 8005–8015.
    [9]
    X. Xie, D. Du, Q. Li, et al. 2017. Exploiting sparsity to accelerate fully connected layers of CNN-Based applications on mobile SoCs. ACM Transactions on Embedded Computing Systems 17, 2 (2017), 1–25.
    [10]
    J. J. Zhang, P. Raj, S. Zarar, et al. 2019. CompAct: On-chip compression of activations for low power systolic array based CNN acceleration. ACM Transactions on Embedded Computing Systems 18, 5s (October 2019), Article No. 47, 1–24.
    [11]
    Y. Chen, T. Yang, J. Emer, and V. Sze. 2019. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9, 2 (June 2019), 292–308.
    [12]
    I. Yeo, S.-G. Gi, G. Wang and B.-G. Lee. 2020. A hardware and energy-efficient online learning neural network with an RRAM crossbar array and stochastic neurons. IEEE Transactions on Industrial Electronics, Early Access (October 2020). DOI:
    [13]
    A. Ahmad and M. A. Pasha. 2020. FFConv: An FPGA-based accelerator for fast convolution layers in convolutional neural network. ACM Transactions on Embedded Computing Systems 19, 2 (2020), 1–24.
    [14]
    C. F. B. Fong, J. Mu, and W. Zhang. 2019. A cost-effective CNN accelerator design with configurable PU on FPGA. In 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI’19), 31–36.
    [15]
    S. Yin, et al. 2019. A high throughput acceleration for hybrid neural networks with efficient resource management on FPGA. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 38, 4 (April 2019), 678–691.
    [16]
    Z. Wang, K. Xu, S. Wu, L. Liu, L. Liu, and D. Wang. 2020. Sparse-YOLO: Hardware/software co-design of an FPGA accelerator for YOLOv2. IEEE Access 8 (2020), 116569–116585.
    [17]
    F. S. Hosseini, F. Meng, C. Yang, et al. 2021. Tolerating defects in low-power neural network accelerators via retraining-free weight approximation. ACM Transactions on Embedded Computing Systems 20, 5s (October 2021), Article No. 85, 1–21.
    [18]
    J. Guo, S. Yin, P. Ouyang, F. Tu, S. Tang, L. Liu, and S. Wei. 2018. Bit-width adaptive accelerator design for convolution neural network. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’18), 1–5.
    [19]
    S. Moini, B. Alizadeh, M. Emad, et al. 2017. A resource-limited hardware accelerator for convolutional neural networks in embedded vision applications. IEEE Transactions on Circuits & Systems II Express Briefs (2017), 1217–1221.
    [20]
    A. Aimar, H. Mostafa, E. Calabrese, et al. 2018. NullHop: A flexible convolutional neural network accelerator based on sparse representations of feature maps. IEEE Transactions on Neural Networks and Learning Systems (2018), 1–13.
    [21]
    V. Gokhale, A. Zaidy, A. X. M. Chang, and E. Culurciello. 2017. Snowflake: An efficient hardware accelerator for convolutional neural networks. In 2017 IEEE International Symposium on Circuits and Systems (ISCAS’17), 1–4.
    [22]
    X. Hu, Y. Zeng, Z. Li, X. Zheng, S. Cai, and X. Xiong. 2019. A resources-efficient configurable accelerator for deep convolutional neural networks. IEEE Access, 7 (2019), 72113–72124.
    [23]
    C. Zhang, P. Li, G. Sun, et al. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 161–170.
    [24]
    J. Qiu, J. Wang, S. Yao, et al. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In ACM/SIGDA International Symposium on Field-programmable Gate Arrays, 26–35.
    [25]
    F. Tu, S. Yin, P. Ouyang, et al. 2017. Deep convolutional neural network architecture with reconfigurable computation patterns. IEEE Transactions on Very Large Scale Integration Systems (2017), 2220–2233.
    [26]
    J. Cong and B. Xiao. 2014. Minimizing computation in convolutional neural networks. In Artificial Neural Networks and Machine Learning (ICANN’14), 281–290.
    [27]
    Y. H. Chen, T. Krishna, J. S. Emer, et al. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52, 1 (2017), 127–138.
    [28]
    M. Jaderberg, A. Vedaldi, and A. Zisserman. 2014. Speeding up convolutional neural networks with low rank expansions. Computer Science 4, 4 (2014), XIII.
    [29]
    S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding. In Proceedings of the 4th International Conference on Learning Representations (ICLR’16).
    [30]
    M. Courbariaux and Y. Bengio. 2016. BinaryNet: Training deep neural networks with weights and activations constrained to +1 or −1. 2016.
    [31]
    R. Zhao, W. Song, W. Zhang, T. Xing, J. Lin, M. Srivastava, R. Gupta, and Z. Zhang. Accelerating binarized convolutional neural networks with software-programmable FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17), ACM, 15–24.
    [32]
    N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y. LeCun. 2014. Fast convolutional nets with fbfft: A GPU performance evaluation. CoRR abs/1412.7580 (2014).
    [33]
    X. Wang, C. Wang, J. Cao, L. Gong, and X. Zhou. WinoNN: Optimizing FPGA-based convolutional neural network accelerators using sparse winograd algorithm. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 11 (November 2020), 4290–4302.
    [34]
    C. Yang, Y. Wang, X. Wang, and L. Geng. 2019. WRA: A 2.2-to-6.3 TOPS highly unified dynamically reconfigurable accelerator using a novel Winograd decomposition algorithm for convolutional neural networks. IEEE Transactions on Circuits and Systems I: Regular Papers 66, 9 (September 2019), 3480–3493.
    [35]
    J. Cong and J. Wang. 2018. PolySA: Polyhedral-based systolic array auto-compilation. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD’18), 1–8.
    [36]
    Xuechao Wei, et al. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17), 1–6.
    [37]
    K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Computer Science (2014).
    [38]
    Y. J. Wai, Z. Yussof, et al. 2018. Fixed point implementation of Tiny-Yolo-v2 using OpenCL on FPGA. International Journal of Advanced Computer Science & Applications 9, 10 (2018).
    [39]
    K. He, X. Zhang, S. Ren, et al. 2016. Deep residual learning for image recognition. Computer Vision and Pattern Recognition (CVPR'16) 1, 770–778.

    Cited By

    View all
    • (2024)Toward Energy-efficient STT-MRAM-based Near Memory Computing Architecture for Embedded SystemsACM Transactions on Embedded Computing Systems10.1145/365072923:3(1-24)Online publication date: 25-Apr-2024
    • (2023)A Tiny Accelerator for Mixed-Bit Sparse CNN Based on Efficient Fetch Method of SIMO SPadIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2023.325729870:8(3079-3083)Online publication date: Aug-2023
    • (2023)Brain-inspired methods for achieving robust computation in heterogeneous mixed-signal neuromorphic processing systemsNeuromorphic Computing and Engineering10.1088/2634-4386/ace64c3:3(034002)Online publication date: 25-Jul-2023

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Embedded Computing Systems
    ACM Transactions on Embedded Computing Systems  Volume 22, Issue 6
    November 2023
    428 pages
    ISSN:1539-9087
    EISSN:1558-3465
    DOI:10.1145/3632298
    • Editor:
    • Tulika Mitra
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Journal Family

    Publication History

    Published: 09 November 2023
    Online AM: 02 May 2022
    Accepted: 06 April 2022
    Revised: 08 March 2022
    Received: 11 October 2021
    Published in TECS Volume 22, Issue 6

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Convolutional neural networks
    2. reconfigurable
    3. accelerator
    4. real-time object detection system
    5. design space exploration

    Qualifiers

    • Research-article

    Funding Sources

    • Key-Area Research and Development Program of Guangdong Province

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)394
    • Downloads (Last 6 weeks)14
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Toward Energy-efficient STT-MRAM-based Near Memory Computing Architecture for Embedded SystemsACM Transactions on Embedded Computing Systems10.1145/365072923:3(1-24)Online publication date: 25-Apr-2024
    • (2023)A Tiny Accelerator for Mixed-Bit Sparse CNN Based on Efficient Fetch Method of SIMO SPadIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2023.325729870:8(3079-3083)Online publication date: Aug-2023
    • (2023)Brain-inspired methods for achieving robust computation in heterogeneous mixed-signal neuromorphic processing systemsNeuromorphic Computing and Engineering10.1088/2634-4386/ace64c3:3(034002)Online publication date: 25-Jul-2023

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media