Article

OccFusion: Depth Estimation Free Multi-sensor Fusion for 3D Occupancy Prediction

Authors:

Zixin LiuAuthors Info & Claims

Computer Vision – ACCV 2024: 17th Asian Conference on Computer Vision, Hanoi, Vietnam, December 8–12, 2024, Proceedings, Part X

Pages 232 - 249

https://doi.org/10.1007/978-981-96-0972-7_14

Published: 10 December 2024 Publication History

Abstract

3D occupancy prediction based on multi-sensor fusion, crucial for a reliable autonomous driving system, enables fine-grained under- standing of 3D scenes. Previous fusion-based 3D occupancy predictions relied on depth estimation for processing 2D image features. However, depth estimation is an ill-posed problem, hindering the accuracy and robustness of these methods. Furthermore, fine-grained occupancy prediction demands extensive computational resources. To address these issues, we propose OccFusion, a depth estimation free multi-modal fusion framework. Additionally, we introduce a generalizable active training method and an active decoder that can be applied to any occupancy prediction model, with the potential to enhance their performance. Experiments conducted on nuScenes-Occupancy and nuScenes-Occ3D demonstrate our framework’s superior performance. Detailed ablation studies highlight the effectiveness of each proposed method.

References

[1]

Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke, S., Stachniss, C., Gall, J.: Semantickitti: A dataset for semantic scene understanding of lidar sequences. In: ICCV. pp. 9297–9307 (2019).

[2]

Berman, M., Triki, A.R., Blaschko, M.B.: The lovasz-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In: CVPR. pp. 4413–4421 (2018).

[3]

Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. In: CVPR. pp. 11618–11628 (2020).

[4]

Cao, A., de Charette, R.: Monoscene: Monocular 3d semantic scene completion. In: CVPR. pp. 3981–3991 (2022

[5]

Chang, M.F., Lambert, J., Sangkloy, P., Singh, J., Bak, S., Hartnett, A., Wang, D., Carr, P., Lucey, S., Ramanan, D., Hays, J.: Argoverse: 3d tracking and forecasting with rich maps. In: CVPR. pp. 8748–8757 (2019).

[6]

Chen, X., Lin, K., Qian, C., Zeng, G., Li, H.: 3d sketch-aware semantic scene completion via semi-supervised structure prior. In: CVPR. pp. 4192–4201 (2020

[7]

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR. pp. 248–255 (2009).

[8]

Felzenszwalb PF, Girshick RB, McAllester DA, and Ramanan D Object detection with discriminatively trained part-based models IEEE Trans. Pattern Anal. Mach. Intell. 2010 32 9 1627-1645

Digital Library

[9]

Firman, M., Aodha, O.M., Julier, S., Brostow, G.J.: Structured prediction of unobserved voxels from a single depth image. In: CVPR. pp. 5431–5440 (2016).

[10]

Gal, Y., Islam, R., Ghahramani, Z.: Deep bayesian active learning with image data. In: ICML. pp. 1183–1192 (2017)

[11]

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. pp. 770–778 (2016

[12]

Houlsby, N., Huszar, F., Ghahramani, Z., Lengyel, M.: Bayesian active learning for classification and preference learning. aeXiv preprint arXiv:1112.5745 (2011)

[13]

Hua, B.S., Pham, Q.H., Nguyen, D.T., Tran, M.K., Yu, L.F., Yeung, S.K.: Scenenn: A scene meshes dataset with annotations. In: 3DV. pp. 92–101 (2016).

[14]

Huang, J., Huang, G.: Bevdet4d: Exploit temporal cues in multi-camera 3d object detection. CoRR abs/2203.17054 (2022).

[15]

Huang, J., Huang, G., Zhu, Z., Du, D.: Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. aeXiv preprint arXiv:2112.11790 (2021)

[16]

Huang, Y., Zheng, W., Zhang, Y., Zhou, J., Lu, J.: Tri-perspective view for vision-based 3d semantic occupancy prediction. In: CVPR. pp. 9223–9232 (2023).

[17]

Kim, J., Choi, J., Kim, Y., Koh, J., Chung, C.C., Choi, J.W.: Robust camera lidar sensor fusion via deep gated information fusion network. In: IV. pp. 1620–1625 (2018).

Digital Library

[18]

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2015)

[19]

Kirsch, A., van Amersfoort, J., Gal, Y.: Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning. In: NeurIPS. pp. 7024–7035 (2019)

[20]

Li, J., Han, K., Wang, P., Liu, Y., Yuan, X.: Anisotropic convolutional networks for 3d semantic scene completion. In: CVPR. pp. 3348–3356 (2020).

[21]

Li, Y., Yu, Z., Choy, C.B., Xiao, C., Álvarez, J.M., Fidler, S., Feng, C., Anandkumar, A.: Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. In: CVPR. pp. 9087–9098 (2023).

[22]

Li, Y., Yu, A.W., Meng, T., Caine, B., Ngiam, J., Peng, D., Shen, J., Lu, Y., Zhou, D., Le, Q.V., Yuille, A.L., Ta, M.: Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection. In: CVPR. pp. 17161–17170 (2022).

[23]

Li, Y., Bao, H., Ge, Z., Yang, J., Sun, J., Li, Z.: Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In: Williams, B., Chen, Y., Neville, J. (eds.) AAAI. pp. 1486–1494. AAAI Press (2023).

Digital Library

[24]

Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., Yu, Q., Dai, J.: Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: ECCV. pp. 1–18 (2022).

Digital Library

[25]

Li, Z., Yu, Z., Austin, D., Fang, M., Lan, S., Kautz, J., Álvarez, J.M.: FB-OCC: 3d occupancy prediction based on forward-backward view transformation. CoRR abs/2307.01492 (2023).

[26]

Li, Z., Yu, Z., Austin, D., Fang, M., Lan, S., Kautz, J., Álvarez, J.M.: Fb-occ: 3d occupancy prediction based on forward-backward view transformation. aeXiv preprint arXIv:2307.01492 (2023).

[27]

Lin, T., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature pyramid networks for object detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. pp. 936–944. IEEE Computer Society (2017).

[28]

Lin, T., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV. pp. 2999–3007. IEEE Computer Society (2017).

[29]

Liu, P., Wang, L., Ranjan, R., He, G., Zhao, L.: A survey on active deep learning: From model driven to data driven. ACM Comput. Surv. 54(10s), 221:1–221:34 (2022).

Digital Library

[30]

Liu, Y., Wang, T., Zhang, X., Sun, J.: Petr: Position embedding transformation for multi-view 3d object detection. In: ECCV. pp. 531–548 (2022).

Digital Library

[31]

Liu, Z., Tang, H., Amini, A., Yang, X., Mao, H., Rus, D.L., Han, S.: Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In: IEEE International Conference on Robotics and Automation, ICRA 2023, London, UK, May 29 - June 2, 2023. pp. 2774–2781 (2023)

[32]

Lu, Y., Zhu, X., Wang, T., Ma, Y.: Octreeocc: Efficient and multi-granularity occupancy prediction using octree queries. CoRR abs/2312.03774 (2023).

[33]

Miao, R., Liu, W., Chen, M., Gong, Z., Xu, W., Hu, C., Zhou, S.: Occdepth: A depth-aware method for 3d semantic scene completion. arXiv preprint arXiv:2302.13540 (2023).

[34]

Min, C., Xiao, L., Zhao, D., Nie, Y., Dai, B.: Uniscene: Multi-camera unified pre-training via 3d scene reconstruction. arXiv preprint arXiv:2305.18829 (2023).

[35]

Ming, Z., Berrio, J.S., Shan, M., Worrall, S.: Occfusion: A straightforward and effective multi-sensor fusion framework for 3d occupancy prediction. CoRR abs/2403.01644 (2024).

[36]

Pan J, Wang Z, and Wang L Co-occ: Coupling explicit feature fusion with volume rendering regularization for multi-modal 3d semantic occupancy prediction IEEE Robotics Autom. Lett. 2024 9 6 5687-5694

[37]

Pan, M., Liu, J., Zhang, R., Huang, P., Li, X., Liu, L., Zhang, S.: Renderocc: Vision-centric 3d occupancy prediction with 2d rendering supervision. CoRR abs/2309.09502 (2023).

[38]

Philion, J., Fidler, S.: Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In: ECCV. pp. 194–210 (2020).

Digital Library

[39]

Qi, C.R., Zhou, Y., Najibi, M., Sun, P., Vo, K., Deng, B., Anguelov, D.: Offboard 3d object detection from point cloud sequences. In: CVPR. pp. 6134–6144 (2021).

[40]

Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In: NeurIPS. pp. 5099–5108 (2017)

[41]

Reading, C., Harakeh, A., Chae, J., Waslander, S.L.: Categorical depth distribution network for monocular 3d object detection. In: CVPR. pp. 8555–8564 (2021).

[42]

Roldão, L., de Charette, R., Verroust-Blondet, A.: Lmscnet: Lightweight multiscale 3d semantic completion. In: 3DV. pp. 111–119 (2020).

[43]

Shrivastava, A., Gupta, A., Girshick, R.B.: Training region-based object detectors with online hard example mining. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. pp. 761–769 (2016).

[44]

Sung, K.K.: Learning and example selection for object and pattern detection. Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, MA, USA (1995), https://hdl.handle.net/1721.1/9836

[45]

Tian, X., Jiang, T., Yun, L., Mao, Y., Yang, H., Wang, Y., Wang, Y., Zhao, H.: Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. In: NeurIPS. pp. 64318–64330 (2023)

[46]

Vobecky, A., Siméoni, O., Hurych, D., Gidaris, S., Bursuc, A., Pérez, P., Sivic, J.: Pop-3d: Open-vocabulary 3d occupancy prediction from images. arXiv preprint arXiv:2401.09413 (2024).

[47]

Vora, S., Lang, A.H., Helou, B., Beijbom, O.: Pointpainting: Sequential fusion for 3d object detection. In: CVPR. pp. 4603–4611 (2020).

[48]

Wang, X., Zhu, Z., Xu, W., Zhang, Y., Wei, Y., Chi, X., Ye, Y., Du, D., Lu, J., Wang, X.: Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception. In: ICCV. pp. 17804–17813 (2023).

[49]

Wang, Y., Chao, W.L., Garg, D., Hariharan, B., Campbell, M., Weinberger, K.Q.: Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In: CVPR. pp. 8445–8453 (2019).

[50]

Wang, Y., Chen, Y., Liao, X., Fan, L., Zhang, Z.: Panoocc: Unified occupancy representation for camera-based 3d panoptic segmentation. CoRR abs/2306.10013 (2023).

[51]

Wei, Y., Zhao, L., Zheng, W., Zhu, Z., Zhou, J., Lu, J.: Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In: ICCV. pp. 21672–21683 (2023).

[52]

Yan, X., Gao, J., Li, J., Zhang, R., Li, Z., Huang, R., Cui, S.: Sparse single sweep lidar point cloud segmentation via learning contextual shape priors from scene completion. In: AAAI. pp. 3101–3109 (2021).

[53]

Yan, Y., Mao, Y., Li, B.: Second: Sparsely embedded convolutional detection. Sensors 18(10) (2018).

[54]

Yang, C., Chen, Y., Tian, H., Tao, C., Zhu, X., Zhang, Z., Huang, G., Li, H., Qiao, Y., Lu, L., Zhou, J., Dai, J.: Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In: CVPR. pp. 17830–17839 (2023).

[55]

Yin, T., Zhou, X., Krähenbühl, P.: Multimodal virtual point 3d detection. In: NeurIPS. pp. 16494–16507 (2021)

[56]

Zhang, C., Yan, J., Wei, Y., Li, J., Liu, L., Tang, Y., Duan, Y., Lu, J.: Occnerf: Self-supervised multi-camera occupancy prediction with neural radiance fields. arXiv preprint arXiv:2312.09243 (2023).

[57]

Zhang, Y., Zheng, W., Zhu, Z., Huang, G., Lu, J., Zhou, J.: A simple baseline for multi-camera 3d object detection. In: AAAI. pp. 3507–3515 (2023).

Digital Library

[58]

Zhang, Y., Zhu, Z., Du, D.: Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction. In: ICCV. pp. 9399–9409 (2023).

[59]

Zhou, B., Krähenbühl, P.: Cross-view transformers for real-time map-view semantic segmentation. In: CVPR. pp. 13750–13759 (2022).

[60]

Zhou, Y., Tuzel, O.: Voxelnet: End-to-end learning for point cloud based 3d object detection. In: CVPR. pp. 4490–4499 (2018).

[61]

Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: Deformable transformers for end-to-end object detection. In: ICLR (2021)

Index Terms

OccFusion: Depth Estimation Free Multi-sensor Fusion for 3D Occupancy Prediction
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
2. Social and professional topics
  1. Professional topics
    1. History of computing
      1. History of computing theory

Index terms have been assigned to the content through auto-classification.

Recommendations

FusionOcc: Multi-Modal Fusion for 3D Occupancy Prediction
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

3D occupancy prediction (OCC) aims to estimate and predict the semantic occupancy state of the surrounding environment, which is crucial for scene understanding and reconstruction in the real world. However, existing methods for 3D OCC mainly rely on ...
3D object detection algorithm based on multi-sensor segmental fusion of frustum association for autonomous driving
Abstract
The rotation characteristics of point clouds are challenging to capture in current multimodal fusion methods for 3D object detection. A single fusion method cannot well balance the accuracy and speed in object detection. Therefore, a multi-sensor ...
Multi-sensor information fusion and coordinate attention-based fault diagnosis method and its interpretability research
Abstract
It is always challenging and meaningful to further enhance the feature extraction capability of the convolutional neural network (CNN) and understand the internal working principle of CNN. In order to ensure that CNN focuses on more key ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ACCV 2024: 17th Asian Conference on Computer Vision, Hanoi, Vietnam, December 8–12, 2024, Proceedings, Part X

Dec 2024

505 pages

ISBN:978-981-96-0971-0

DOI:10.1007/978-981-96-0972-7

Editors:
Minsu Cho
Pohang University of Science and Technology (POSTECH), Pohang, Korea (Republic of)
,
Ivan Laptev
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates
,
Du Tran
Google, Mountain View, CA, USA
,
Angela Yao
National University of Singapore, Singapore, Singapore
,
Hongbin Zha
https://ror.org/02v51f717Peking University, Beijing, China

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 10 December 2024

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Table of Conten