Fast and Accurate Detection of Dim and Small Targets for Smart Micro-Light Sight
Abstract
:1. Introduction
- Insufficient features: Small-scale objects cover fewer pixels, retaining less information. The feature representation is weak, and in typical application scenarios of object detection technology, small-scale object instances are accompanied by problems such as low resolution and blurred background.
- Information loss: After multiple convolution and pooling operations in convolutional neural networks, inevitable semantic information loss of small-scale objects occurs. Meanwhile, feature maps contain more unnecessary background information in small-scale objects, weakening the feature representation of small-scale objects.
- Complex background: In urban streets, parks, and other scenes, various complex backgrounds may exist around objects, including buildings, trees, vehicles, etc. These backgrounds can interfere with object detection.
- Difficult detection and localization: Due to their small size, the position of small-scale objects has more possibilities, such as corners or overlapping areas with other objects. Additionally, it can be challenging to distinguish small-scale objects in complicated scene surfaces from noise clutter and accurately locate their boundaries, which means that higher precision is required for positioning during detection.
- (1)
- We propose a lightweight dynamic task alignment detection head (LTD_Head) to address the misalignment issue between classification and localization tasks during prediction, caused by differences in feature spatial distribution. This aims to better coordinate classification and localization tasks, resulting in more accurate and consistent prediction outcomes.
- (2)
- We introduce an additional detection layer into the network architecture to extract detailed deep features of dim and small objects, addressing the issue of information loss while improving the network’s generalization capability for multi-scale object detection.
- (3)
- We propose an adaptive channel convolution module (ACConv), which reduces parameters and computational load significantly by calculating only the channels with larger weights (selected by SE module), thereby reducing redundant feature computation and improving network efficiency.
- (4)
- We introduce a large separable kernel attention module into the pyramid feature layer to help the model dynamically adjust the weights of feature mappings. This allows the model to focus more on the feature regions that are more important for the current task.
2. Related Works
2.1. Traditional Object Detection Methods
2.2. Deep Learning Object Detection Methods
2.3. Small Object Detection Methods
3. Methodology
3.1. Overall Architecture
3.2. Adaptive Channel Convolution (ACConv)
3.3. Lightweight Task Align Dynamic Detect Head (LTD_Head)
3.4. Spatial Pyramid Pooling-Faster with Large Separable Kernel (SPPFLska)
4. Experimental Results and Analysis
4.1. Experimental Settings and Dataset
4.2. Evaluation Metrics
4.3. Experimental Results and Analysis
4.3.1. Ablation Experiment
4.3.2. Comparison of Different Detection Algorithms
4.3.3. Generalization Experiment
4.4. Detection Result
4.4.1. Detection Result Comparison on Datasets
4.4.2. Detection Result on Micro-Light Sight
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Xiao, Y.; Zhou, K.; Cui, G.; Jia, L.; Fang, Z.; Yang, X.; Xia, Q. Deep learning for occluded and multi-scale pedestrian detection: A review. IET Image Process. 2021, 15, 286–301. [Google Scholar] [CrossRef]
- Sun, J.; Wang, Z. Vehicle And Pedestrian Detection Algorithm Based on Improved YOLOv5. IAENG Int. J. Comput. Sci. 2023, 50, 1401–1409. [Google Scholar]
- Liu, L.; Lu, S.; Zhong, R.; Wu, B.; Yao, Y.; Zhang, Q.; Shi, W. Computing systems for autonomous driving: State of the art and challenges. IEEE Internet Things J. 2020, 8, 6469–6486. [Google Scholar] [CrossRef]
- Levinson, J.; Askeland, J.; Becker, J.; Dolson, J.; Held, D.; Kammel, S.; Kolter, J.Z.; Langer, D.; Pink, O.; Pratt, V.; et al. Towards fully autonomous driving: Systems and algorithms. In Proceedings of the 2011 IEEE Intelligent Vehicles Symposium (IV), Baden, Germany, 5–9 June 2011; pp. 163–168. [Google Scholar]
- Gawande, U.; Hajari, K.; Golhar, Y. Pedestrian detection and tracking in video surveillance system: Issues, comprehensive review, and challenges. In Recent Trends in Computational Intelligence; Intech Open Publ.: London, UK, 2020; pp. 1–24. [Google Scholar] [CrossRef]
- Khan, M.A.; Javed, K.; Khan, S.A.; Saba, T.; Habib, U.; Khan, J.A.; Abbasi, A.A. Human action recognition using fusion of multiview and deep features: An application to video surveillance. Multimed. Tools Appl. 2024, 83, 14885–14911. [Google Scholar] [CrossRef]
- Alfred Daniel, J.; Chandru Vignesh, C.; Muthu, B.A.; Senthil Kumar, R.; Sivaparthipan, C.; Marin, C.E.M. Fully convolutional neural networks for LIDAR–camera fusion for pedestrian detection in autonomous vehicle. Multimed. Tools Appl. 2023, 82, 25107–25130. [Google Scholar] [CrossRef]
- Ahmed, S.; Kallu, K.D.; Ahmed, S.; Cho, S.H. Hand gestures recognition using radar sensors for human-computer-interaction: A review. Remote Sens. 2021, 13, 527. [Google Scholar] [CrossRef]
- Schmid, C.; Soatto, S.; Tomasi, C. Conference on Computer Vision and Pattern Recognition; IEEE Computer Society: Washington, DC, USA, 2005. [Google Scholar]
- Huang, L.; Chen, C.; Yun, J.; Sun, Y.; Tian, J.; Hao, Z.; Yu, H.; Ma, H. Multi-scale feature fusion convolutional neural network for indoor small target detection. Front. Neurorobotics 2022, 16, 881021. [Google Scholar] [CrossRef] [PubMed]
- Mordan, T.; Thome, N.; Henaff, G.; Cord, M. End-to-end learning of latent deformable part-based representations for object detection. Int. J. Comput. Vis. 2019, 127, 1659–1679. [Google Scholar] [CrossRef]
- Wang, X.; Han, T.X.; Yan, S. An HOG-LBP human detector with partial occlusion handling. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 27 September–4 October 2009; pp. 32–39. [Google Scholar]
- Chen, P.H.; Lin, C.J.; Schölkopf, B. A tutorial on ν-support vector machines. Appl. Stoch. Models Bus. Ind. 2005, 21, 111–136. [Google Scholar] [CrossRef]
- Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
- Uijlings, J.R.; Van De Sande, K.E.; Gevers, T.; Smeulders, A.W. Selective search for object recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef]
- Vedaldi, A.; Gulshan, V.; Varma, M.; Zisserman, A. Multiple kernels for object detection. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 27 September–4 October 2009; pp. 606–613. [Google Scholar]
- Yu, Y.; Zhang, J.; Huang, Y.; Zheng, S.; Ren, W.; Wang, C.; Huang, K.; Tan, T. Object detection by context and boosted HOG-LBP. In Proceedings of the ECCV Workshop on PASCAL VOC, Crete, Greece, 5–11 September 2010. [Google Scholar]
- Liu, T.; Cheng, J.; Yang, M.; Du, X.; Luo, X.; Zhang, L. Pedestrian detection method based on self-learning. In Proceedings of the 2019 IEEE 4th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chengdu, China, 20–22 December 2019; Volume 1, pp. 2161–2165. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- Wu, W.; Liu, H.; Li, L.; Long, Y. Application of local fully Convolutional Neural Network combined with YOLO v5 algorithm in small target detection of remote sensing image. PLoS ONE 2021, 16, e0259283. [Google Scholar]
- Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Piscataway, NJ, USA, 11–18 December 2015; Volume 2, pp. 1440–1448. [Google Scholar]
- Gong, H.; Li, H.; Xu, K.; Zhang, Y. Object detection based on improved YOLOv3-tiny. In Proceedings of the 2019 Chinese Automation Congress (CAC), Hangzhou, China, 22–24 November 2019; pp. 3240–3245. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12009–12019. [Google Scholar]
- Yin, Y.; Zhang, Z.; Wei, L.; Geng, C.; Ran, H.; Zhu, H. Pedestrian detection algorithm integrating large kernel attention and YOLOV5 lightweight model. PLoS ONE 2023, 18, e0294865. [Google Scholar] [CrossRef] [PubMed]
- Xu, Z.; Pan, S.; Ma, X. A Pedestrian Detection Method Based on Small Sample Data Set. In Proceedings of the 2023 IEEE International Conference on Image Processing and Computer Applications (ICIPCA), Changchun, China, 11–13 August 2023; pp. 669–674. [Google Scholar]
- Chen, H.; Guo, X. Multi-scale feature fusion pedestrian detection algorithm based on Transformer. In Proceedings of the 2023 4th International Conference on Computer Vision, Image and Deep Learning (CVIDL), Zhuhai, China, 12–14 May 2023; pp. 536–540. [Google Scholar]
- Bai, Y.; Zhang, Y.; Ding, M.; Ghanem, B. Finding tiny faces in the wild with generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 21–30. [Google Scholar]
- Bai, Y.; Zhang, Y.; Ding, M.; Ghanem, B. Sod-mtgan: Small object detection via multi-task generative adversarial network. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 206–221. [Google Scholar]
- Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. Tood: Task-aligned one-stage object detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE Computer Society, Montreal, BC, Canada, 11–17 October 2021; pp. 3490–3499. [Google Scholar]
- Lau, K.W.; Po, L.M.; Rehman, Y.A.U. Large separable kernel attention: Rethinking the large kernel attention design in cnn. Expert Syst. Appl. 2024, 236, 121352. [Google Scholar] [CrossRef]
- Zhang, S.; Benenson, R.; Schiele, B. Citypersons: A diverse dataset for pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3213–3221. [Google Scholar]
- Zhang, S.; Xie, Y.; Wan, J.; Xia, H.; Li, S.Z.; Guo, G. Widerperson: A diverse dataset for dense pedestrian detection in the wild. IEEE Trans. Multimed. 2019, 22, 380–393. [Google Scholar] [CrossRef]
- Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef] [PubMed]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Part I 14, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
- Jiang, Z.; Zhao, L.; Li, S.; Jia, Y. Real-time object detection method based on improved YOLOv4-tiny. arXiv 2020, arXiv:2011.04244. [Google Scholar]
- Ma, L.; Zhao, L.; Wang, Z.; Zhang, J.; Chen, G. Detection and counting of small target apples under complicated environments by using improved YOLOv7-tiny. Agronomy 2023, 13, 1419. [Google Scholar] [CrossRef]
Name | Parameter |
---|---|
Operating system | Linux |
Programing language | Python3.8 |
CPU | Intel(R) Xeon(R) Silver 4210 CPU @ 2.20 GHz |
GPU | NVIDIA GeForce RTX 2080 |
Pytorch | 1.11.0 |
CUDA | 11.6 |
Hypeparameters | Value |
---|---|
Epoch | 300 |
Batch size | 8 |
Initial learning rate | 0.01 |
Optimizer | SGD |
Model | SPPFLska | ACConv | LTD_HEAD |
Add One Head | [email protected]/% | P | R | Params/M |
---|---|---|---|---|---|---|---|---|
Baseline | 62.28 | 77.34 | 53.51 | 3.01 | ||||
Exp1 | ✓ | 63.29 | 82.13 | 50.76 | 3.21 | |||
Exp2 | ✓ | 62.46 | 78.49 | 53.14 | 2.82 | |||
Exp3 | ✓ | 63.11 | 77.86 | 52.18 | 1.94 | |||
Exp4 | ✓ | 64.30 | 78.72 | 54.28 | 2.92 | |||
Exp5 | ✓ | ✓ | ✓ | ✓ | 66.60 | 79.5 | 56.71 | 2.03 |
Model | [email protected]/% | P/% | R/% | Params | GFLOPs |
---|---|---|---|---|---|
Faster-RCNN | 39.5 | 69.4 | 63.6 | 136.8 | 370.0 |
RepLoss | 39.1 | 50.3 | 46.9 | 112.2 | 183 |
ALFNet | 43.1 | 55.9 | 48.7 | 23.5 | 171 |
GoogleNet | 42.63 | 63.6 | 47.3 | 5 | 2 |
SSD | 33.3 | 51.7 | 56.8 | 26.3 | 8.5 |
RetinaNet | 41.3 | 52.1 | 47.2 | 34.3 | 37.4 |
DETR | 43.4 | 56.0 | 48.2 | 41 | 86 |
YOLOv3 | 40.6 | 49.7 | 50.6 | 61.9 | 66.3 |
YOLOv4-tiny | 30.2 | 65.4 | 51.3 | 6.4 | 21.8 |
YOLOv5s | 53.4 | 67.7 | 50.3 | 7.2 | 16.6 |
YOLOv7-tiny | 54.6 | 71.2 | 48.9 | 6.0 | 13.2 |
YOLOv8n | 62.3 | 77.3 | 53.5 | 3.0 | 8.9 |
DS_YOLO(ours) | 66.6 | 79.5 | 56.7 | 2.0 | 6.4 |
Model | WiderPerson Dataset | DOTA Dataset | TinyPerson Dataset | ||||||
---|---|---|---|---|---|---|---|---|---|
[email protected]/% | Params | GFLOPs | [email protected]/% | Params | GFLOPs | [email protected]/% | Params | GFLOPs | |
Faster-RCNN | 61.5 | 136.8 | 370.0 | 54.1 | 136.8 | 370.0 | 5.1 | 136.8 | 370.0 |
RepLoss | 61.2 | 112.2 | 183 | 65.2 | 112.2 | 183 | 3.6 | 112.2 | 183 |
ALFNet | 60.1 | 23.5 | 171 | 72.1 | 23.5 | 171 | 44 | 23.5 | 171 |
GoogleNet | 59 | 5 | 2 | 49.5 | 5 | 2 | 6.2 | 5 | 2 |
SSD | 57.4 | 26.3 | 8.5 | 50.1 | 26.3 | 8.5 | 3.7 | 26.3 | 8.5 |
RetinaNet | 52.0 | 34.3 | 37.4 | 29.9 | 34.3 | 37.4 | 3.9 | 34.3 | 37.4 |
DETR | 49.8 | 41 | 86 | 53.5 | 41 | 86 | 6.3 | 41 | 86 |
YOLOv3 | 60.3 | 61.9 | 66.3 | 60.0 | 61.9 | 66.3 | 16.3 | 61.9 | 66.3 |
YOLOv4-tiny | 64.1 | 6.4 | 21.8 | 64.4 | 6.4 | 21.8 | 10.9 | 6.4 | 21.8 |
YOLOv5s | 68.2 | 7.2 | 16.6 | 67.9 | 7.2 | 16.6 | 11.3 | 7.2 | 16.6 |
YOLOv7-tiny | 70.6 | 6.0 | 13.2 | 70.1 | 6.0 | 13.2 | 13.2 | 6.0 | 13.2 |
YOLOv8n | 74.6 | 3.0 | 8.9 | 81.2 | 3.0 | 8.9 | 18.1 | 3.0 | 8.9 |
DS_YOLO | 75.57 | 2.0 | 6.4 | 82.8 | 2.0 | 6.4 | 18.9 | 2.0 | 6.4 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wei, J.; Che, K.; Gong, J.; Zhou, Y.; Lv, J.; Que, L.; Liu, H.; Len, Y. Fast and Accurate Detection of Dim and Small Targets for Smart Micro-Light Sight. Electronics 2024, 13, 3301. https://doi.org/10.3390/electronics13163301
Wei J, Che K, Gong J, Zhou Y, Lv J, Que L, Liu H, Len Y. Fast and Accurate Detection of Dim and Small Targets for Smart Micro-Light Sight. Electronics. 2024; 13(16):3301. https://doi.org/10.3390/electronics13163301
Chicago/Turabian StyleWei, Jia, Kai Che, Jiayuan Gong, Yun Zhou, Jian Lv, Longcheng Que, Hu Liu, and Yuanbin Len. 2024. "Fast and Accurate Detection of Dim and Small Targets for Smart Micro-Light Sight" Electronics 13, no. 16: 3301. https://doi.org/10.3390/electronics13163301