Towards High Accuracy Pedestrian Detection on Edge GPUs
Abstract
:1. Introduction
- We propose that the TA module improves pedestrian feature extraction, which utilizes the idea of dynamically adjusting k neighborhoods to improve the extraction ability of pedestrian features in the convolutional block attention module (CBAM) [11]. Then we integrate TA into the backbone network to increase the network’s attention to the pedestrian visible area.
- According to the a priori pedestrian aspect ratio information, different sizes of cavity convolution are designed to replace the pooling layer in SPP. PFM is constructed from the width of the network, which enhances the ability of the network to extract multi-scale features of pedestrians and thus improve the detection accuracy.
- To keep detection speed, the down-sampling process in PAN [12] is abandoned during feature fusion. The ghost-module-integrating the TA mechanism is used to build a one-way multi-scale feature fusion structure to realize the lightweight of the model.
2. Related Work
2.1. Optimization of Network Structure
2.2. Model Lightweight
3. Method
3.1. Overall Architecture
3.2. Precision Optimization
3.2.1. CSPDarkNet53 Network
3.2.2. TA Module
3.2.3. PFM Pedestrian Area Feature Extraction Module
3.3. Model Lightweight
3.3.1. Ghost Module Incorporating TA
3.3.2. One-Way Multi-Scale Feature Fusion Structure Design
4. Experiments
- Dataset. We use an outdoor pedestrian detection benchmark dataset called WiderPerson [28] and a pedestrian dataset named CityPersons [29]. The widerperson contains five types of pedestrian entities: pedestrians, riders, partially-visible persons, ignore regions and crowd. We choose the first three pedestrian entities. The original dataset contains 13,382 images of which 4382 images are used as the test set without labels. The training set, validation set and test set required for the experiment are re-divided according to the ratio of 7:1:2, with 6300, 900 and 1800 images respectively. The citypersons has 2975, 500 and 1525 images for training, validation and testing datasets. Because the authors did not disclose test set annotation information, the experiment was conducted on 500 images of the verification testing set. The training set and the validation set are reclassified in a 9:1 ratio on 2975 training set pictures. Because this dataset is small, we only tested the effects of the YOLOv4-TP (not the lightweight version) and also compared the YOLOv4.
- Parameter settings. The initial learning rate is 1 × 10−3, and the batches are set to eight; the iteration period for training is set to 300 epochs. When the accuracy of the validation set does not increase after 10 epochs, the learning rate is attenuated by half.
- Evaluation metrics. Precision is used to evaluate the probability that the detection model predicts a positive class and is indeed a positive class. Recall is used to evaluate the ability of the detection model to predict all positive detection boxes. Only using precision and recall cannot evaluate the performance of the detection model well, so combining precision and recall can yield another comprehensive indicator for evaluating the detection model: the F1 value. The higher the F1 value is, the better the model performance. A single category of Average Precision (AP) is represented by an area surrounded by P–R curves and axes, and the performance of the model in detecting pedestrians can be evaluated. AP50 and AP75 refer to the AP value at an IoU threshold of 0.5 and 0.75, respectively. The AP value in the strict sense is obtained by averaging the APs under different IoU thresholds, that is, from 0.5 to 0.95, and calculating the APs for pedestrian detection under the IoU threshold with an interval of 0.05 and then taking the average value. In this experiment, the AP value in the strict sense is used as the average precision index.
4.1. Quantitative Evaluation
4.2. Qualitative Evaluation
4.3. Ablation Study
5. Conclusions and Feature Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Li, Q.; Su, Y.; Gao, Y. OAF-Net: An Occlusion-Aware Anchor-Free Network for Pedestrian Detection in a Crowd. IEEE Trans. Intell. Transp. Syst. 2022, 1–10. [Google Scholar] [CrossRef]
- Kumar, R.; Deb, A. A Sparse-Dense HOG Window Sampling Technique for Fast Pedestrian Detection in Aerial Images. In Proceedings of the International Conference on Electrical and Electronics Engineering, Virtual, 27–28 August 2022; Springer: Singapore, 2022; pp. 437–450. [Google Scholar]
- Du, R.; Zhao, J.; Xie, J. Pedestrian Detection Based on Deep Learning Under the Background of University Epidemic Prevention. In Proceedings of the International Conference on Electrical and Electronics Engineering, Virtual, 27–28 August 2022; Springer: Cham, Switzerland, 2022; pp. 192–202. [Google Scholar]
- Wang, Y.; Yang, H. Multi-target Pedestrian Tracking Based on the YOLOv5 and DeepSORT. In Proceedings of the IEEE Asia-Pacific Conference on Image Processing, Electronics and Computers (IPEC), Dalian, China, 14–16 April 2022; pp. 508–514. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, ECCV, Amsterdam, The Netherlands, 8–10 October 2016; pp. 21–37. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Adam, H. Mobile-Nets: Efficient convolutional neural networks for mobile vision applications. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1704–1712. [Google Scholar]
- Ganesh, P.; Chen, Y.; Yang, Y.; Chen, D.; Winslett, M. YOLO-ReT: Towards high accuracy real-time object detection on edge GPUs. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2022; pp. 3267–3277. [Google Scholar]
- Barba-Guaman, L.; Eugenio Naranjo, J.; Ortiz, A. Deep learning framework for vehicle and pedestrian detection in rural roads on an embedded GPU. Electronics 2020, 9, 589. [Google Scholar] [CrossRef] [Green Version]
- Bochkovskiy, A.; Wang, C.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- Jiang, Z.; Zhao, L.; Li, S. Real-time object detection method based on improved YOLOv4-tiny. arXiv 2020, arXiv:2011.04244. [Google Scholar]
- Woo, S.; Park, J.; Lee, J. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision, ECCV, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vis-ion and Pattern Recognition, CVPR, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3–19. [Google Scholar]
- Lei, J.; Chen, Y.; Peng, B.; Huang, Q.; Ling, N.; Hou, C. Multi-Stream Region Proposal Network for Pedestrian Detection. In Proceedings of the IEEE International Conference on Multimedia & Expo Workshops, San Diego, CA, USA, 23–27 July 2018; pp. 1–16. [Google Scholar]
- Liu, W.; Liao, S.; Ren, W.; Hu, W.; Yu, Y. High-Level Semantic Feature Detection: A New Perspective for Pedestrian Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Long Beach, CA, USA, 15–20 June 2019; pp. 5187–5196. [Google Scholar]
- Zhang, L.; Lin, L.; Liang, X.; He, K. Is Faster R-CNN doing well for pedestrian detection? In Proceedings of the European Conference on Computer Vision, ECCV, Amsterdam, The Netherlands, 8–16 October 2016; pp. 443–457. [Google Scholar]
- Li, J.; Liang, X.; Shen, S.; Xu, T.; Feng, J.; Yan, S. Scale-aware fast R-CNN for pedestrian detection. IEEE Trans. Multim. 2017, 20, 443–457. [Google Scholar] [CrossRef] [Green Version]
- Yan, Y.; Ni, B.; Song, Z.; Ma, C.; Yan, Y.; Yang, X. Person ReIdentification via recurrent feature aggregation. IEEE Trans. Multim. 2016, 23, 443–457. [Google Scholar]
- Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
- Chollet, F. Xception: Deep learning with depth-wise separable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Seattle, WA, USA, 14–19 June 2017; pp. 1251–1258. [Google Scholar]
- Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
- Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient CNN architecture design. In Proceedings of the European Conference on Computer Vision, ECCV, Munich, Germany, 8–14 September 2018; pp. 74–81. [Google Scholar]
- Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features From Cheap Operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Seattle, WA, USA, 13–19 June 2020; pp. 1577–1586. [Google Scholar]
- Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Seattle, WA, USA, 13–19 June 2020; pp. 1571–1580. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Wang, Q.; Wu, B.; Zhu, P. ECANet: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Boston, MA, USA, 7–12 June 2015; pp. 11531–11539. [Google Scholar]
- Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
- Zhang, S.; Xie, Y.; Wan, J.; Xia, H.; Li, S.Z.; Guo, G. WiderPerson: A Diverse Dataset for Dense Pedestrian Detection in the Wild. IEEE Trans. Multimed. 2020, 22, 380–393. [Google Scholar] [CrossRef] [Green Version]
- Zhang, S.; Benenson, R.; Schiele, B. Citypersons: A diverse dataset for pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Honolulu, HI, USA, 21–26 July 2017; pp. 3213–3221. [Google Scholar]
- Fan, H.; Liu, S.; Ferianc, M.; Ng, H.C.; Que, Z.; Liu, S.; Luk, W. A Real-Time Object Detection Accelerator with Compressed SSDLite on FPGA. In Proceedings of the International Conference on Field-Programmable Technology, FPT, Naha, Japan, 10–14 December 2018; pp. 14–21. [Google Scholar]
- Gong, H.; Li, H.; Xu, K.; Zhang, Y. Object Detection Based on Improved YOLOv3-tiny. In Proceedings of the Chinese Automation Congress, CAC, Hangzhou, China, 22–24 December 2019; pp. 3240–3245. [Google Scholar]
Algorithm | Parameter/MB | Precision/% | Recall/% | AP/% | FPS |
---|---|---|---|---|---|
SSD-Lite | 32.5 | 53.1 | 46.4 | 48.7 | 22 |
YOLOv3-tiny | 35.6 | 51.6 | 43.7 | 46.9 | 20 |
YOLO-Slim | 28.7 | 57.8 | 51.2 | 53.4 | 25 |
YOLOv4-tiny | 25.4 | 55.9 | 50.5 | 51.6 | 29 |
YOLOv4-TP-tiny | 22.5 | 58.3 | 53.7 | 55.4 | 31 |
Algorithm | Parameter/MB | Precision/% | Recall/% | AP/% | FPS |
---|---|---|---|---|---|
Faster R-CNN | 528.7 | 69.4 | 63.6 | 67.4 | 28 |
SSD | 231.6 | 61.7 | 56.8 | 59.8 | 47 |
YOLOv4 | 244.2 | 70.9 | 67.2 | 68.4 | 69 |
YOLOv4-TP | 251.3 | 73.7 | 69.4 | 72.9 | 64 |
Model | FPS | Parameter/MB | AP/% |
---|---|---|---|
YOLOv4 | 8 | 244.2 | 61.3 |
YOLOv4-Tiny | 29 | 25.4 | 51.6 |
YOLOv4-Mobilenetv2 | 21 | 46.8 | 48.3 |
YOLOv4-Ghostnet | 25 | 42.7 | 50.7 |
YOLOv4-TP | 7 | 251.3 | 64.9 |
YOLOv4-TP-Tiny | 31 | 22.5 | 55.4 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhou, H.; Wu, T.; Sun, K.; Zhang, C. Towards High Accuracy Pedestrian Detection on Edge GPUs. Sensors 2022, 22, 5980. https://doi.org/10.3390/s22165980
Zhou H, Wu T, Sun K, Zhang C. Towards High Accuracy Pedestrian Detection on Edge GPUs. Sensors. 2022; 22(16):5980. https://doi.org/10.3390/s22165980
Chicago/Turabian StyleZhou, Huaping, Tao Wu, Kelei Sun, and Chunjiong Zhang. 2022. "Towards High Accuracy Pedestrian Detection on Edge GPUs" Sensors 22, no. 16: 5980. https://doi.org/10.3390/s22165980
APA StyleZhou, H., Wu, T., Sun, K., & Zhang, C. (2022). Towards High Accuracy Pedestrian Detection on Edge GPUs. Sensors, 22(16), 5980. https://doi.org/10.3390/s22165980