Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

High Throughput FPGA-Based Object Detection via Algorithm-Hardware Co-Design

Published: 15 January 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Object detection and classification is a key task in many computer vision applications such as smart surveillance and autonomous vehicles. Recent advances in deep learning have significantly improved the quality of results achieved by these systems, making them more accurate and reliable in complex environments. Modern object detection systems make use of lightweight convolutional neural networks (CNNs) for feature extraction, coupled with single-shot multi-box detectors (SSDs) that generate bounding boxes around the identified objects along with their classification confidence scores. Subsequently, a non-maximum suppression (NMS) module removes any redundant detection boxes from the final output. Typical NMS algorithms must wait for all box predictions to be generated by the SSD-based feature extractor before processing them. This sequential dependency between box predictions and NMS results in a significant latency overhead and degrades the overall system throughput, even if a high-performance CNN accelerator is used for the SSD feature extraction component. In this paper, we present a novel pipelined NMS algorithm that eliminates this sequential dependency and associated NMS latency overhead. We then use our novel NMS algorithm to implement an end-to-end fully pipelined FPGA system for low-latency SSD-MobileNet-V1 object detection. Our system, implemented on an Intel Stratix 10 FPGA, runs at 400 MHz and achieves a throughput of 2,167 frames per second with an end-to-end batch-1 latency of 2.13 ms. Our system achieves 5.3× higher throughput and 5× lower latency compared to the best prior FPGA-based solution with comparable accuracy.

    References

    [1]
    Mohamed S. Abdelfattah, David Han, Andrew Bitar, Roberto DiCecco, Shane O’Connell, Nitika Shanker, Joseph Chu, Ian Prins, Joshua Fender, Andrew C. Ling, and Gordon R. Chiu. 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL). 411–4117.
    [2]
    Anupreetham Anupreetham, Mohamed Ibrahim, Mathew Hall, Andrew Boutros, Ajay Kuzhively, Abinash Mohanty, Eriko Nurvitadhi, Vaughn Betz, Yu Cao, and Jae-sun Seo. 2021. End-to-end FPGA-based object detection using pipelined CNN and non-maximum suppression. In 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). 76–82.
    [3]
    Andrew Boutros and Vaughn Betz. 2021. FPGA architecture: Principles and progression. IEEE Circuits and Systems Magazine 21, 2 (2021), 4–29.
    [4]
    Andrew Boutros, Eriko Nurvitadhi, Rui Ma, Sergey Gribok, Zhipeng Zhao, James C. Hoe, Vaughn Betz, and Martin Langhammer. 2020. Beyond peak performance: Comparing the real performance of AI-optimized FPGAs and GPUs. In 2020 International Conference on Field-Programmable Technology (ICFPT). 10–19.
    [5]
    Andrew Boutros, Sadegh Yazdanshenas, and Vaughn Betz. 2018. You cannot improve what you do not measure: FPGA vs. ASIC efficiency gaps for convolutional neural network inference. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 11, 3 (2018), 1–23.
    [6]
    Liang Cai, Feng Dong, Ke Chen, Kehua Yu, Wei Qu, and Jianfei Jiang. 2020. An FPGA based heterogeneous accelerator for single shot multibox detector (SSD). In 2020 IEEE 15th International Conference on Solid-State & Integrated Circuit Technology (ICSICT). 1–3.
    [7]
    François Chollet. 2017. Xception: Deep learning with depthwise separable convolutions. In Conference on Computer Vision and Pattern Recognition (CVPR).
    [8]
    Hongxiang Fan, Shuanglong Liu, Martin Ferianc, Ho-Cheung Ng, Zhiqiang Que, Shen Liu, Xinyu Niu, and Wayne Luk. 2018. A real-time object detection accelerator with compressed SSDLite on FPGA. In 2018 International Conference on Field-Programmable Technology (FPT). 14–21.
    [9]
    Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, Stephen Heil, Prerak Patel, Adam Sapek, Gabriel Weisz, Lisa Woods, Sitaram Lanka, Steven K. Reinhardt, Adrian M. Caulfield, Eric S. Chung, and Doug Burger. 2018. A configurable cloud-scale DNN processor for real-time AI. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 1–14.
    [10]
    Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. 580–587.
    [11]
    Mathew Hall and Vaughn Betz. 2020. From TensorFlow graphs to LUTs and wires: Automated sparse and physically aware CNN hardware generation. In 2020 International Conference on Field-Programmable Technology (ICFPT). 56–65.
    [12]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778.
    [13]
    Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861 (2017). arXiv:1704.04861http://arxiv.org/abs/1704.04861
    [14]
    Mohamed Ibrahim and Vaughn Betz. 2022. Extending Data Flow Architectures for Convolutional Neural Networks to Object Detection and Multiple FPGAs. Master’s thesis. The University of Toronto. https://tspace.library.utoronto.ca/handle/1807/123335
    [15]
    Licheng Jiao, Fan Zhang, Fang Liu, Shuyuan Yang, Lingling Li, Zhixi Feng, and Rong Qu. 2019. A survey of deep learning-based object detection. IEEE Access 7 (2019), 128837–128868.
    [16]
    Shreyas Kolala Venkataramanaiah, Yufei Ma, Shihui Yin, Eriko Nurvithadhi, Aravind Dasu, Yu Cao, and Jae-Sun Seo. 2019. Automatic compiler based FPGA accelerator for CNN training. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL). 166–172.
    [17]
    Martin Langhammer, Eriko Nurvitadhi, Bogdan Pasca, and Sergey Gribok. 2021. Stratix 10 NX architecture and applications(FPGA’21). Association for Computing Machinery, New York, NY, USA, 57–67.
    [18]
    Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. CoRR abs/1405.0312 (2014). arXiv:1405.0312http://arxiv.org/abs/1405.0312
    [19]
    Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. 2015. SSD: Single shot multibox detector. CoRR abs/1512.02325 (2015). arxiv:1512.02325http://arxiv.org/abs/1512.02325
    [20]
    Yufei Ma, Tu Zheng, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. 2018. Algorithm-hardware co-design of single shot detector for fast object detection on FPGAs. In IEEE International Conference on Computer-Aided Design (ICCAD).
    [21]
    Jian Meng, Shreyas Kolala Venkataramanaiah, Chuteng Zhou, Patrick Hansen, Paul Whatmough, and Jaesun Seo. 2021. FixyFPGA: Efficient FPGA accelerator for deep neural networks with high element-wise sparsity and without external memory access. In IEEE International Conference on Field-Programmable Logic and Applications (FPL). 9–16.
    [22]
    NVIDIA. 2019. NVIDIA Tesla deep learning product performance. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).
    [23]
    Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kanwar, David Lee, Jeffery Liao, Anton Lokhmotov, Francisco Massa, Peng Meng, Paulius Micikevicius, Colin Osborne, Gennady Pekhimenk, Arun Tejusve Raghunath Rajan, Dilip Sequeira, Ashish Sirasao, Fei Sun, Hanlin Tang, Michael Thomson, Frank Wei, Ephrem Wu, Lingjie Xu, Koichi Yamada, Bing Yu, George Yuan, Aaron Zhong, Peizhao Zhang, and Yuchen Zhou. 2019. MLPerf inference benchmark. CoRR abs/1911.02549 (2019). arXiv:1911.02549http://arxiv.org/abs/1911.02549
    [24]
    Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2015. You only look once: Unified, real-time object detection. CoRR abs/1506.02640 (2015). arXiv:1506.02640http://arxiv.org/abs/1506.02640
    [25]
    Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An incremental improvement. CoRR abs/1804.02767 (2018). arxiv:1804.02767http://arxiv.org/abs/1804.02767
    [26]
    Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. CoRR abs/1506.01497 (2015). arXiv:1506.01497http://arxiv.org/abs/1506.01497
    [27]
    Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. CoRR abs/1801.04381 (2018). arXiv:1801.04381http://arxiv.org/abs/1801.04381
    [28]
    Man Shi, Peng Ouyang, Shouyi Yin, Leibo Liu, and Shaojun Wei. 2019. A fast and power-efficient hardware architecture for non-maximum suppression. IEEE Transactions on Circuits and Systems II: Express Briefs 66, 11 (2019), 1870–1874.
    [29]
    Marius Stan, Mathew Hall, Mohamed Ibrahim, and Vaughn Betz. 2022. HPIPE NX: Boosting CNN inference acceleration performance with AI-optimized FPGAs. In International Conference on Field-Programmable Technology (FPT). IEEE, 1–9.
    [30]
    Amr Suleiman, Yu-Hsin Chen, Joel S. Emer, and Vivienne Sze. 2017. Towards closing the energy gap between HOG and CNN features for embedded vision. CoRR abs/1703.05853 (2017). arXiv:1703.05853http://arxiv.org/abs/1703.05853
    [31]
    Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. 2019. MnasNet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2820–2828.
    [32]
    Zixiao Wang, Ke Xu, Shuaixiao Wu, Li Liu, Lingzhi Liu, and Dong Wang. 2020. Sparse-YOLO: Hardware/software co-design of an FPGA accelerator for YOLOv2. IEEE Access 8 (2020), 116569–116585.
    [33]
    Di Wu, Yu Zhang, Xijie Jia, Lu Tian, Tianping Li, Lingzhi Sui, Dongliang Xie, and Yi Shan. 2019. A high-performance CNN processor based on FPGA for MobileNets. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL). 136–143.
    [34]
    Hui Zhang, Wei Wu, Yufei Ma, and Zhongfeng Wang. 2020. Efficient hardware post processing of anchor-based object detection on FPGA. In 2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). 580–585.
    [35]
    Tong Zhao, Lufeng Qiao, Qinghua Chen, Qingsong Zhang, and Na Li. 2020. A hardware accelerator based on neural network for object detection. Journal of Physics: Conference Series 1486, 2 (Apr.2020), 022045.
    [36]
    Zhengxia Zou, Zhenwei Shi, Yuhong Guo, and Jieping Ye. 2019. Object detection in 20 years: A survey. CoRR abs/1905.05055 (2019). arXiv:1905.05055http://arxiv.org/abs/1905.05055

    Index Terms

    1. High Throughput FPGA-Based Object Detection via Algorithm-Hardware Co-Design

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Reconfigurable Technology and Systems
        ACM Transactions on Reconfigurable Technology and Systems  Volume 17, Issue 1
        March 2024
        446 pages
        ISSN:1936-7406
        EISSN:1936-7414
        DOI:10.1145/3613534
        • Editor:
        • Deming Chen
        Issue’s Table of Contents

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 15 January 2024
        Online AM: 04 December 2023
        Accepted: 16 November 2023
        Revised: 27 September 2023
        Received: 27 May 2023
        Published in TRETS Volume 17, Issue 1

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. FPGA accelerator
        2. object detection
        3. algorithm-hardware co-design
        4. neural networks

        Qualifiers

        • Research-article

        Funding Sources

        • NSF
        • Intel ISRA program on FPGA
        • Intel/VMware Crossroads 3D-FPGA Academic Research Center
        • Intel/NSERC Industrial Research Chair in Programmable Silicon
        • Vector Institute for Artificial Intelligence

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 525
          Total Downloads
        • Downloads (Last 12 months)525
        • Downloads (Last 6 weeks)80

        Other Metrics

        Citations

        View Options

        Get Access

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        Full Text

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media