Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

3D Data Augmentation for Driving Scenes on Camera

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15036))

Included in the following conference series:

Abstract

Driving scenes are extremely diverse and complicated that it is impossible to collect all cases with human effort alone. While data augmentation is an effective technique to enrich the training data, existing methods for camera data in autonomous driving applications are confined to the 2D image plane, which may not optimally increase data diversity in 3D real-world scenarios. To this end, we propose a 3D data augmentation approach termed Drive-3DAug, aiming at augmenting the driving scenes on camera in the 3D space. We first utilize Neural Radiance Field (NeRF) to reconstruct the 3D models of background and foreground objects. Then, augmented driving scenes can be obtained by placing the 3D objects with adapted location and orientation at the pre-defined valid region of backgrounds. As such, the training database could be effectively scaled up. However, the 3D object modeling is constrained to the image quality and the limited viewpoints. To overcome these problems, we modify the original NeRF by introducing a geometric rectified loss and a symmetric-aware training strategy. We evaluate our method for the camera-only monocular 3D detection task on the Waymo and nuScences datasets. The proposed data augmentation approach contributes to a gain of \(1.7\%\) and \(1.4\%\) in terms of detection accuracy, on Waymo and nuScences respectively. Furthermore, the constructed 3D models serve as digital driving assets and could be recycled for different detectors or other 3D perception tasks.

Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 74.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Brazil, G., Liu, X.: M3d-rpn: monocular 3d region proposal network for object detection. In: ICCV, pp. 9287–9296 (2019)

    Google Scholar 

  2. Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: a multimodal dataset for autonomous driving. In: CVPR, pp. 11621–11631 (2020)

    Google Scholar 

  3. Choi, J., Song, Y., Kwak, N.: Part-aware data augmentation for 3d object detection in point cloud. In: IROS, pp. 3391–3397 (2021)

    Google Scholar 

  4. Cubuk, E.D., Zoph, B., Mané, D., Vasudevan, V., Le, Q.V.: Autoaugment: learning augmentation policies from data (2018). http://arxiv.org/abs/1805.09501

  5. Deng, K., Liu, A., Zhu, J.Y., Ramanan, D.: Depth-supervised nerf: fewer views and faster training for free. In: CVPR, pp. 12882–12891 (2022)

    Google Scholar 

  6. Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: an open urban driving simulator. In: CORL, pp. 1–16 (2017)

    Google Scholar 

  7. Fang, J., Zuo, X., Zhou, D., Jin, S., Wang, S., Zhang, L.: Lidar-aug: a general rendering-based augmentation framework for 3d object detection. In: CVPR, pp. 4710–4720 (2021)

    Google Scholar 

  8. Fridovich-Keil, S., Yu, A., Tancik, M., Chen, Q., Recht, B., Kanazawa, A.: Plenoxels: radiance fields without neural networks. In: CVPR, pp. 5501–5510 (2022)

    Google Scholar 

  9. Ghiasi, G., Cui, Y., Srinivas, A., Qian, R., Lin, T.Y., Cubuk, E.D., Le, Q.V., Zoph, B.: Simple copy-paste is a strong data augmentation method for instance segmentation. In: CVPR, pp. 2918–2928 (2021)

    Google Scholar 

  10. Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., Gaidon, A.: 3d packing for self-supervised monocular depth estimation. In: CVPR, pp. 2485–2494 (2020)

    Google Scholar 

  11. Hung, W.C., Kretzschmar, H., Casser, V., Hwang, J.J., Anguelov, D.: Let-3d-ap: longitudinal error tolerant 3d average precision for camera-only 3d detection (2022). arXiv:2206.07705

  12. Hung, W.C., Kretzschmar, H., Casser, V., Hwang, J.J., Anguelov, D.: Let-3d-ap: longitudinal error tolerant 3d average precision for camera-only 3d detection (2022). arXiv:2206.07705

  13. Kundu, A., Genova, K., Yin, X., Fathi, A., Pantofaru, C., Guibas, L.J., Tagliasacchi, A., Dellaert, F., Funkhouser, T.A.: Panoptic neural fields: a semantic object-aware neural scene representation. In: CVPR, pp. 12861–12871 (2022)

    Google Scholar 

  14. Li, H., Li, Y., Wang, H., Zeng, J., Xu, H., Cai, P., Chen, L., Yan, J., Xu, F., Xiong, L., Wang, J., Zhu, F., Yan, K., Xu, C., Wang, T., Xia, F., Mu, B., Peng, Z., Lin, D., Qiao, Y.: Open-sourced data ecosystem in autonomous driving: the present and future (2024)

    Google Scholar 

  15. Li, H., Sima, C., Dai, J., Wang, W., Lu, L., Wang, H., Zeng, J., Li, Z., Yang, J., Deng, H., Tian, H., Xie, E., Xie, J., Chen, L., Li, T., Li, Y., Gao, Y., Jia, X., Liu, S., Shi, J., Lin, D., Qiao, Y.: Delving into the devils of bird’s-eye-view perception: a review, evaluation and recipe. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–20 (2023). https://doi.org/10.1109/TPAMI.2023.3333838

  16. Li, P., Zhao, H., Liu, P., Cao, F.: Rtm3d: real-time monocular 3d detection from object keypoints for autonomous driving. In: ECCV, pp. 644–660. Springer (2020)

    Google Scholar 

  17. Li, Z., Li, L., Ma, Z., Zhang, P., Chen, J., Zhu, J.: Read: large-scale neural scene rendering for autonomous driving (2022). arXiv:2205.05509.

  18. Lian, Q., Ye, B., Xu, R., Yao, W., Zhang, T.: Exploring geometric consistency for monocular 3d object detection. In: CVPR, pp. 1685–1694 (2022),

    Google Scholar 

  19. Liu, Z., Wu, Z., Tóth, R.: Smoke: single-stage monocular 3d object detection via keypoint estimation. In: CVPRW, pp. 996–997 (2020),

    Google Scholar 

  20. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: representing scenes as neural radiance fields for view synthesis. In: ECCV, pp. 99–106 (2020)

    Google Scholar 

  21. Müller, N., Simonelli, A., Porzi, L., Bulò, S.R., Nießner, M., Kontschieder, P.: Autorf: learning 3d object radiance fields from single view observations. In: CVPR, pp. 3971–3980 (2022)

    Google Scholar 

  22. Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. 41(4), 102, 1–102:15 (2022). https://doi.org/10.1145/3528223.3530127, https://doi.org/10.1145/3528223.3530127

  23. Park, K., Sinha, U., Barron, J.T., Bouaziz, S., Goldman, D.B., Seitz, S.M., Martin-Brualla, R.: Nerfies: deformable neural radiance fields. In: ICCV, pp. 5865–5874 (2021)

    Google Scholar 

  24. Reuse, M., Simon, M., Sick, B.: About the ambiguity of data augmentation for 3d object detection in autonomous driving. In: ICCVW, pp. 979–987 (2021)

    Google Scholar 

  25. Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR, pp. 4104–4113 (2016)

    Google Scholar 

  26. Sun, C., Sun, M., Chen, H.T.: Direct voxel grid optimization: super-fast convergence for radiance fields reconstruction. In: CVPR, pp. 5459–5469 (2022)

    Google Scholar 

  27. Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: CVPR, pp. 2446–2454 (2020)

    Google Scholar 

  28. Tancik, M., Casser, V., Yan, X., Pradhan, S., Mildenhall, B., Srinivasan, P.P., Barron, J.T., Kretzschmar, H.: Block-nerf: scalable large scene neural view synthesis. In: CVPR, pp. 8248–8258 (2022)

    Google Scholar 

  29. Wang, T., Zhu, X., Pang, J., Lin, D.: Fcos3d: fully convolutional one-stage monocular 3d object detection. In: CVPR (2021)

    Google Scholar 

  30. Wang, X., Kong, T., Shen, C., Jiang, Y., Li, L.: Solo: segmenting objects by locations. In: ECCV, pp. 649–665 (2020)

    Google Scholar 

  31. Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In: CoRL, pp. 180–191. PMLR (2022)

    Google Scholar 

  32. Weng, X., Kitani, K.: Monocular 3d object detection with pseudo-lidar point cloud. In: ICCVW (2019)

    Google Scholar 

  33. Yang, J., Gao, S., Qiu, Y., Chen, L., Li, T., Dai, B., Chitta, K., Wu, P., Zeng, J., Luo, P., et al.: Generalized predictive model for autonomous driving (2024). arXiv:2403.09630

  34. Yang, Z., Chen, L., Sun, Y., Li, H.: Visual point cloud forecasting enables scalable autonomous driving (2023). arXiv:2312.17655

  35. Zhang, W., Wang, Z., Loy, C.C.: Exploring data augmentation for multi-modality 3d object detection (2020). arXiv:2012.12741

  36. Zoph, B., Cubuk, E.D., Ghiasi, G., Lin, T.Y., Shlens, J., Le, Q.V.: Learning data augmentation strategies for object detection. In: ECCV, pp. 566–583. Springer (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongyang Li .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3259 KB)

Appendices

Appendix

A Experiments

1.1 A.1 Implementation Details

Building Digital Driving Asset. We use the SOLO v2 [30] trained on COCO as the instance segmentation model for scene decomposition. We consider vehicle (Waymo) or car (nuScenes) as the foreground objects, since they are the most important components in driving scenes For 3D model reconstruction, we use the same model configuration as the DVGO [26] with our proposed techniques. We use 30–40 consecutive frames spanning an area of about 100-200 meters as one background and train each background model with 40,000 iterations. For the background voxel grid, we set the resolution as \(330^3\) with a voxel size of 0.25-0.3m. The object models are trained with 20-60 consecutive frames, and we set the voxel size as 0.25m consistent with the background voxel size. The grid point number is about 1,000. Considering the construction cost and limitation of NeRF model for extreme illumination conditions, we select a subset of 100 sunny scenarios for each dataset. We use them for the data augmentation.

Applying Data Augmentation. Our method generates new data through rendering the recomposed scenes of the randomly selected 3D models. We manipulate the object in 3D space by 3D location jittering and orientation jittering and place it on the valid region of arbitrary backgrounds. The location jittering is defined by the maximum translation \((T_x, T_y)\) along the x-direction and the y-direction, and the orientation jittering is given by maximum rotation angle \(T_{\theta }\). We consider two data augmentation strategies to valid the effectiveness of adding rotation and translation for 3D perception. The first is \({{\boldsymbol{Drive-3DAug w/o RT}}}\), in which we set translation \(T_x=0\), \(T_y=0\), and rotation \(T_{\theta }=0\) and 1-2 new objects are arbitrarily pasted into the background scene on average. The second is \({{\boldsymbol{Drive-3DAug w/ RT}}}\) with \(T_x=20\)m, \(T_y=5\)m, and \(T_{\theta }=30^{\circ }\). As for determining the valid region in the background scene, we set the pillar \(Z_p\) resolution as 2m\(\times \)2m, \(\delta _1=30\), and \(\delta _2=15\) in Eq. 6. We generate 12 new images for every background model. This progress is offline data augmentation and these images are repeatedly used by different detectors.

Training Detectors. Two typical camera-based monocular 3D object detectors, FCOS3D [29] and SMOKE [19] are utilized to investigate the performance of our proposed method as they are two of the most popular and commonly used mono3D detectors. We maintain the same hyperparameters of detectors for all experiments in our study, introduced in the supplementary. We use all the scenes in the training set to train the detectors and evaluate the detectors on the entire validation set. However, we sample training data every 3 frames on Waymo during training due to limited computational resources. Besides, although our method can be applied to any view of cameras, we only take into account images taken by the front camera on Waymo and nuScenes in this work because of the computational resource. For the usage of the generated images, we randomly replace the image in a batch with the augmented data if it belongs to the scenes in the digital driving asset.

1.2 A.2 Benchmark Details

Waymo Dataset. The Waymo Open Dataset [27] is a large-scale dataset for autonomous driving that contains 798 scenes in the training dataset and 202 scenes in the validation dataset. The image resolution for the front camera is \(1920 \times 1280\). Waymo uses the LET-AP, the average precision with longitudinal error tolerance, to evaluate detection models. Besides, Waymo also adopts the LET-APL and LET-APH metrics, which are the longitudinal affinity weighted LET-AP and the heading accuracy weighted LET-AP, respectively.

nuScenes Dataset. The nuScenes [2] is a widely used benchmark for 3D object detection. It contains 700 training scenes and 150 validation scenes. The resolution of each image is 1600 \(\times \) 900. As for the metrics, nuScenes computes mAP using the center distance on the ground plane to match the predicted boxe and the ground truth. It also contains different types of true positive metrics (TP metrics). We use ATE, ASE and AOE in this paper, for measuring the errors of translation, scale and orientation, respectively.

1.3 A.3 Ablative Studies

Depth Supervision. We qualitatively and quantitatively investigate the effect of depth supervision on background model training and 3D augmentation. Table 4 shows that the 3D augmentation based on background model trained with depth supervision has better performance, with LET-AP (0.590 vs. 0.585). Figure 8 shows that NeRF can reconstruct the background with high quality given depth supervision, and the 3D background model quality can be decreased without depth information. Thus, LET-AP, LET-APH and LET-APL on car have a slight decrease with 0.001 for 3D augmentation using background model without supervision.

Table 4. Ablation Study of Drive-3DAug for FCOS3D on Waymo validation set. 3DAug means we use DVGO [26] for data augmentation. DS means depth supervision.
Fig. 8.
figure 8

Visualization of rendered depth map. The model with depth supervision depicts better performance.

Reconstruction Cost. Table 5 depicts the comparison of reconstruction speed on a NVIDIA V100 GPU between previous methods and our method. Unlike the MLP-based NeRF [20] which needs more than 20 hours of training for one background, it takes about 0.5h for the voxel-based NeRF with depth supervision. For the object model, the reconstruction time is within minutes. Considering the model size of our NeRF is rather small, we can run multiple reconstructions in parallel. Moreover, once these models are trained, they could be recycled for different detectors as a digital driving assets.

Table 5. Reconstruction cost of different methods for one background, assessed on a NVIDIA V100 GPU with 16GB memory.

B Discussion

Driving scenes are extremely complicated, including lots of object categories under different illumination and weather conditions. Currently, our method only augments limited classes of objects under good illumination conditions. It is worth to include more situations as the digital driving asset for the future work. Besides, our method only changes the object with geometric transformation. More augmentation strategies, such as manipulating the background or changing the appearance of objects are also worth for further exploration.

Fig. 9.
figure 9

Corner Cases generated by the Drive-3DAug, where (a), (b) illustrate occlusion situations, (c) demonstrates strange vehicle heading direction, and (d) shows a vehicle appended on slope road. Cars with 3D bounding boxes are rendered by NeRF.

Corner Case Generation. Drive-3DAug is able to generate many photographic data for various corner cases without much effort for autonomous driving systems. As described in Fig. 9, we use our method to simulate several corner cases including the car occluded by the environment, the car appearing on the road with strange positions and headings, and the car on the slope, which are hard to collect in the real world. This shows that 3D data augmentation can help alleviate issues of autonomous driving caused by plenty of corner cases.

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tong, W. et al. (2025). 3D Data Augmentation for Driving Scenes on Camera. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15036. Springer, Singapore. https://doi.org/10.1007/978-981-97-8508-7_4

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-8508-7_4

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-8507-0

  • Online ISBN: 978-981-97-8508-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics