WCNN3D: Wavelet Convolutional Neural Network-Based 3D Object Detection for Autonomous Driving
Abstract
:1. Introduction
2. Related Work
2.1. 3D Object Detection Using Camera Images
2.2. 3D Object Detection Using LiDAR Point Clouds
2.3. Contributions
- 1.
- To the best of our knowledge, our wavelet-based convolutional neural network model that incorporates the high-frequency component into the spatial domain to enhance the performance is the first work in 3D object detection.
- 2.
- The model has high-frequency and low-frequency filters covering a large spatial domain to enlarge the receptive field, which improves the performance of the models.
- 3.
- We removed the standard pooling operation of CNN to avoid information loss and to design wavelet-based CNN without pooling using the subsampling operation. The biorthogonal property of wavelets helps us to perform subsampling without information loss. We applied IWT to enrich the feature maps for the detection head and fully recover the lost detail information.
3. Wavelet-Based Convolutional Neural Network
4. WCNN3D Network Architecture and Implementation Details
4.1. Pillar FeatureNet
4.2. Base Network
4.3. Detection Head
4.4. Loss
4.5. Dataset and Data Augmentation
4.5.1. Dataset
4.5.2. 3D Data Augmentation
5. Results and Discussions
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
CNN | Convolutional Neural Network |
DL | Deep Learning |
DWT | Discrete Wavelet Transform |
IOU | Intersection Over Union |
IWT | Inverse Wavelet Transform |
mAP | Mean Average Precision |
MWCNN | Multi-level Wavelet Convolutional Neural Network |
NMS | Non-maximal Suppression |
ReLU | Rectified Linear Unit |
RCNN | Region-based Convolutional Neural Network |
SSD | Single Shot MultiBox Detector |
References
- Waymo. Waymo Driver-Waymo; Technical Report; WAYMO: Mountain View, CA, USA, 2022. [Google Scholar]
- Dechant, M. Self-Driving Car Technology—Between Man and Machine; Technical Report; Bosch: Gerlingen, Germany, 2022. [Google Scholar]
- Dong, Y.; Su, H.; Zhu, J.; Zhang, B. Improving interpretability of deep neural networks with semantic information. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4306–4314. [Google Scholar]
- Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 818–833. [Google Scholar]
- Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
- Wang, P.; Chen, P.; Yuan, Y.; Liu, D.; Huang, Z.; Hou, X.; Cottrell, G. Understanding convolution for semantic segmentation. In Proceedings of the 2018 IEEE winter conference on applications of computer vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1451–1460. [Google Scholar]
- Rippel, O.; Snoek, J.; Adams, R.P. Spectral Representations for Convolutional Neural Networks. arXiv 2015, arXiv:1506.03767. Available online: http://xxx.lanl.gov/abs/1506.03767 (accessed on 28 August 2022).
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
- Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar]
- Alaba, S.; Ball, J. Deep Learning-based Image 3D Object Detection for Autonomous Driving: Review. TechRxiv 2022, Preprint. [Google Scholar] [CrossRef]
- Wang, Y.; Chao, W.L.; Garg, D.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-lidar from visual depth estimation: Bridging the gap in 3D object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8445–8453. [Google Scholar]
- Ma, X.; Wang, Z.; Li, H.; Zhang, P.; Ouyang, W.; Fan, X. Accurate monocular 3d object detection via color-embedded 3D reconstruction for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 6851–6860. [Google Scholar]
- Chen, X.; Kundu, K.; Zhu, Y.; Berneshawi, A.G.; Ma, H.; Fidler, S.; Urtasun, R. 3D object proposals for accurate object class detection. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 424–432. [Google Scholar]
- Shi, Y.; Mi, Z.; Guo, Y. Stereo CenterNet based 3D Object Detection for Autonomous Driving. arXiv 2021, arXiv:2103.11071. [Google Scholar] [CrossRef]
- Mousavian, A.; Anguelov, D.; Flynn, J.; Kosecka, J. 3D bounding box estimation using deep learning and geometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7074–7082. [Google Scholar]
- Liu, Z.; Wu, Z.; Tóth, R. Smoke: Single-stage monocular 3D object detection via keypoint estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 996–997. [Google Scholar]
- Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3D object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4490–4499. [Google Scholar]
- Yan, Y.; Mao, Y.; Li, B. Second: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
- Graham, B.; van der Maaten, L. Submanifold sparse convolutional networks. arXiv 2017, arXiv:1706.01307. [Google Scholar]
- Kuang, H.; Wang, B.; An, J.; Zhang, M.; Zhang, Z. Voxel-FPN: Multi-scale voxel feature aggregation for 3D object detection from LIDAR point clouds. Sensors 2020, 20, 704. [Google Scholar] [CrossRef] [PubMed]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
- Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. arXiv 2017, arXiv:1706.02413. [Google Scholar]
- Barrera, A.; Guindel, C.; Beltrán, J.; García, F. Birdnet+: End-to-end 3D object detection in lidar bird’s eye view. In Proceedings of the 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), Rhodes, Greece, 20–23 September 2020; pp. 1–6. [Google Scholar]
- Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1907–1915. [Google Scholar]
- Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3D proposal generation and object detection from view aggregation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–8. [Google Scholar]
- Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. Pv-rcnn: Point-voxel feature set abstraction for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10529–10538. [Google Scholar]
- Caine, B.; Roelofs, R.; Vasudevan, V.; Ngiam, J.; Chai, Y.; Chen, Z.; Shlens, J. Pseudo-labeling for Scalable 3D Object Detection. arXiv 2021, arXiv:2103.02093. [Google Scholar]
- Xu, Q.; Zhou, Y.; Wang, W.; Qi, C.R.; Anguelov, D. Spg: Unsupervised domain adaptation for 3D object detection via semantic point generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 15446–15456. [Google Scholar]
- Zhou, Y.; Sun, P.; Zhang, Y.; Anguelov, D.; Gao, J.; Ouyang, T.; Guo, J.; Ngiam, J.; Vasudevan, V. End-to-end multi-view fusion for 3D object detection in lidar point clouds. In Proceedings of the Conference on Robot Learning, Virtual, 16–18 November 2020; pp. 923–932. [Google Scholar]
- Fujieda, S.; Takayama, K.; Hachisuka, T. Wavelet convolutional neural networks. arXiv 2018, arXiv:1805.08620. [Google Scholar]
- Li, Q.; Shen, L. Wavesnet: Wavelet integrated deep networks for image segmentation. arXiv 2020, arXiv:2005.14461. [Google Scholar]
- Liu, P.; Zhang, H.; Lian, W.; Zuo, W. Multi-level wavelet convolutional neural networks. IEEE Access 2019, 7, 74973–74985. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- Alaba, S.; Gurbuz, A.; Ball, J. A Comprehensive Survey of Deep Learning Multisensor Fusion-based 3D Object Detection for Autonomous Driving: Methods, Challenges, Open Issues, and Future Directions. TechRxiv 2022, Preprint. [Google Scholar] [CrossRef]
- Fujieda, S.; Takayama, K.; Hachisuka, T. Wavelet convolutional neural networks for texture classification. arXiv 2017, arXiv:1707.07394. [Google Scholar]
- Kanchana, M.; Varalakshmi, P. Texture classification using discrete shearlet transform. Int. J. Sci. Res. 2013, 5, 3. [Google Scholar] [CrossRef]
- Mallat, S.G. Multifrequency channel decompositions of images and wavelet models. IEEE Trans. Acoust. Speech Signal Process. 1989, 37, 2091–2110. [Google Scholar] [CrossRef] [Green Version]
- Hien, D. A Guide to Receptive Field Arithmetic for Convolutional Neural Networks. 2017. Available online: https://medium.com/syncedreview/a-guide-to-receptive-field-arithmetic-for-convolutional-neural-networks-42f33d4378e0 (accessed on 28 August 2022).
- Araujo, A.; Norris, W.; Sim, J. Computing Receptive Fields of Convolutional Neural Networks. Distill 2019. Available online: https://distill.pub/2019/computing-receptive-fields (accessed on 28 August 2022). [CrossRef]
- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
- Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the ICML, Haifa, Israel, 21–24 June 2010. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
- Yang, B.; Luo, W.; Urtasun, R. Pixor: Real-time 3D object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June2018; pp. 7652–7660. [Google Scholar]
- Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3D object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June2018; pp. 918–927. [Google Scholar]
- Yang, B.; Liang, M.; Urtasun, R. Hdnet: Exploiting hd maps for 3D object detection. In Proceedings of the Conference on Robot Learning, Zurich, Switzerland, 29–31 October 2018; pp. 146–155. [Google Scholar]
Methods | mAP | Car | Pedestrian | Cyclist | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Easy | Mod. | Hard | Easy | Mod. | Hard | Easy | Mod. | Hard | ||
MV3D [24] | - | 86.62 | 78.93 | 69.80 | - | - | - | - | - | - |
F-pointNet [49] | 65.20 | 91.17 | 84.67 | 74.77 | 57.13 | 49.57 | 45.48 | 77.26 | 61.37 | 53.78 |
PIXOR++ [50] | - | 89.38 | 83.70 | 77.97 | - | - | - | - | - | - |
VoxelNet [17] | 58.25 | 89.35 | 79.26 | 77.39 | 46.13 | 40.74 | 38.11 | 66.70 | 54.76 | 50.55 |
SECOND [18] | 60.56 | 88.07 | 79.37 | 77.95 | 55.10 | 46.27 | 44.76 | 73.67 | 56.04 | 48.78 |
PointPillars [9] | 66.19 | 88.35 | 86.10 | 79.83 | 58.66 | 50.23 | 47.19 | 79.19 | 62.25 | 56.00 |
PV-RCNN [26] | 70.04 | 94.98 | 90.65 | 86.14 | 59.86 | 50.57 | 46.74 | 82.49 | 68.89 | 62.41 |
WCNN3D (Haar-2L) ( ours) | 70.99 | 89.77 | 87.76 | 86.32 | 67.12 | 62.71 | 59.04 | 80.74 | 62.49 | 59.28 |
WCNN3D (Haar-3L) (ours) | 71.02 | 90.05 | 87.51 | 86.11 | 69.20 | 63.29 | 59.26 | 81.54 | 62.26 | 58.32 |
WCNN3D (Haar-4L) (ours) | - | 90.02 | 87.90 | 86.35 | - | - | - | - | - | - |
WCNN3D (Db4-2L) (ours) | 71.66 | 90.11 | 87.83 | 86.27 | 69.48 | 64.52 | 60.07 | 83.69 | 62.63 | 59.45 |
WCNN3D (Db4-3L) (ours) | 71.84 | 90.12 | 87.97 | 86.46 | 68.40 | 63.20 | 59.36 | 82.78 | 64.34 | 60.29 |
WCNN3D (Db4-4L) (ours) | - | 90.20 | 88.04 | 86.31 | - | - | - | - | - |
Methods | mAP | Car | Pedestrian | Cyclist | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Easy | Mod. | Hard | Easy | Mod. | Hard | Easy | Mod. | Hard | ||
MV3D [24] | - | 74.97 | 63.63 | 54.0 | - | - | - | - | - | - |
F-pointNet [49] | 56.04 | 82.19 | 69.79 | 60.59 | 50.53 | 42.15 | 38.08 | 72.27 | 56.17 | 49.01 |
VoxelNet [17] | 49.05 | 77.47 | 65.11 | 57.73 | 39.48 | 33.69 | 31.5 | 61.22 | 48.36 | 44.37 |
SECOND [18] | 56.69 | 83.13 | 73.66 | 66.20 | 41.07 | 42.56 | 37.29 | 70.51 | 53.85 | 46.90 |
PointPillars [9] | 59.20 | 79.05 | 74.99 | 68.30 | 52.08 | 43.53 | 41.49 | 75.78 | 59.07 | 52.92 |
PV-RCNN [26] | 62.81 | 90.25 | 81.43 | 76.82 | 52.17 | 43.29 | 40.29 | 78.60 | 63.71 | 57.65 |
WCNN3D (Haar-2L) ( ours) | 64.36 | 87.09 | 77.39 | 75.49 | 58.62 | 54.57 | 50.02 | 79.56 | 61.13 | 57.37 |
WCNN3D (Haar-3L) (ours) | 63.16 | 87.80 | 77.61 | 75.71 | 58.86 | 53.41 | 48.94 | 79.74 | 58.47 | 54.71 |
WCNN3D (Haar-4L) (ours) | - | 87.84 | 77.67 | 76.00 | - | - | - | - | - | - |
WCNN3D (Db4-2L) (ours) | 65.41 | 87.75 | 77.56 | 75.40 | 61.93 | 57.67 | 52.06 | 82.74 | 61.01 | 57.66 |
WCNN3D (Db4-3L) (ours) | 64.16 | 88.57 | 78.04 | 76.16 | 57.88 | 53.75 | 49.82 | 80.14 | 60.70 | 56.70 |
WCNN3D (Db4-4L) (ours) | - | 87.83 | 77.75 | 75.74 | - | - | - | - | - | - |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Alaba, S.Y.; Ball, J.E. WCNN3D: Wavelet Convolutional Neural Network-Based 3D Object Detection for Autonomous Driving. Sensors 2022, 22, 7010. https://doi.org/10.3390/s22187010
Alaba SY, Ball JE. WCNN3D: Wavelet Convolutional Neural Network-Based 3D Object Detection for Autonomous Driving. Sensors. 2022; 22(18):7010. https://doi.org/10.3390/s22187010
Chicago/Turabian StyleAlaba, Simegnew Yihunie, and John E. Ball. 2022. "WCNN3D: Wavelet Convolutional Neural Network-Based 3D Object Detection for Autonomous Driving" Sensors 22, no. 18: 7010. https://doi.org/10.3390/s22187010