3D Data Augmentation for Driving Scenes on Camera

Tong, Wenwen; Xie, Jiangwei; Li, Tianyu; Li, Yang; Deng, Hanming; Dai, Bo; Lu, Lewei; Zhao, Hao; Yan, Junchi; Li, Hongyang

doi:10.1007/978-981-97-8508-7_4

Wenwen Tong¹⁵,
Jiangwei Xie¹⁵,
Tianyu Li^16,17,
Yang Li¹⁶,
Hanming Deng¹⁵,
Bo Dai¹⁸,
Lewei Lu¹⁵,
Hao Zhao¹⁹,
Junchi Yan¹⁹ &
…
Hongyang Li^16,20

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15036))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

62 Accesses
1 Citations

Abstract

Driving scenes are extremely diverse and complicated that it is impossible to collect all cases with human effort alone. While data augmentation is an effective technique to enrich the training data, existing methods for camera data in autonomous driving applications are confined to the 2D image plane, which may not optimally increase data diversity in 3D real-world scenarios. To this end, we propose a 3D data augmentation approach termed Drive-3DAug, aiming at augmenting the driving scenes on camera in the 3D space. We first utilize Neural Radiance Field (NeRF) to reconstruct the 3D models of background and foreground objects. Then, augmented driving scenes can be obtained by placing the 3D objects with adapted location and orientation at the pre-defined valid region of backgrounds. As such, the training database could be effectively scaled up. However, the 3D object modeling is constrained to the image quality and the limited viewpoints. To overcome these problems, we modify the original NeRF by introducing a geometric rectified loss and a symmetric-aware training strategy. We evaluate our method for the camera-only monocular 3D detection task on the Waymo and nuScences datasets. The proposed data augmentation approach contributes to a gain of $1.7\%$ and $1.4\%$ in terms of detection accuracy, on Waymo and nuScences respectively. Furthermore, the constructed 3D models serve as digital driving assets and could be recycled for different detectors or other 3D perception tasks.

Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Augmented Reality Meets Computer Vision: Efficient Data Generation for Urban Driving Scenes

Article 07 March 2018

Towards Robust Monocular Depth Estimation: A New Baseline and Benchmark

Article 20 January 2024

Camera Height Doesn’t Change: Unsupervised Training for Metric Monocular Road-Scene Depth Estimation

References

Brazil, G., Liu, X.: M3d-rpn: monocular 3d region proposal network for object detection. In: ICCV, pp. 9287–9296 (2019)
Google Scholar
Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: a multimodal dataset for autonomous driving. In: CVPR, pp. 11621–11631 (2020)
Google Scholar
Choi, J., Song, Y., Kwak, N.: Part-aware data augmentation for 3d object detection in point cloud. In: IROS, pp. 3391–3397 (2021)
Google Scholar
Cubuk, E.D., Zoph, B., Mané, D., Vasudevan, V., Le, Q.V.: Autoaugment: learning augmentation policies from data (2018). http://arxiv.org/abs/1805.09501
Deng, K., Liu, A., Zhu, J.Y., Ramanan, D.: Depth-supervised nerf: fewer views and faster training for free. In: CVPR, pp. 12882–12891 (2022)
Google Scholar
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: an open urban driving simulator. In: CORL, pp. 1–16 (2017)
Google Scholar
Fang, J., Zuo, X., Zhou, D., Jin, S., Wang, S., Zhang, L.: Lidar-aug: a general rendering-based augmentation framework for 3d object detection. In: CVPR, pp. 4710–4720 (2021)
Google Scholar
Fridovich-Keil, S., Yu, A., Tancik, M., Chen, Q., Recht, B., Kanazawa, A.: Plenoxels: radiance fields without neural networks. In: CVPR, pp. 5501–5510 (2022)
Google Scholar
Ghiasi, G., Cui, Y., Srinivas, A., Qian, R., Lin, T.Y., Cubuk, E.D., Le, Q.V., Zoph, B.: Simple copy-paste is a strong data augmentation method for instance segmentation. In: CVPR, pp. 2918–2928 (2021)
Google Scholar
Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., Gaidon, A.: 3d packing for self-supervised monocular depth estimation. In: CVPR, pp. 2485–2494 (2020)
Google Scholar
Hung, W.C., Kretzschmar, H., Casser, V., Hwang, J.J., Anguelov, D.: Let-3d-ap: longitudinal error tolerant 3d average precision for camera-only 3d detection (2022). arXiv:2206.07705
Hung, W.C., Kretzschmar, H., Casser, V., Hwang, J.J., Anguelov, D.: Let-3d-ap: longitudinal error tolerant 3d average precision for camera-only 3d detection (2022). arXiv:2206.07705
Kundu, A., Genova, K., Yin, X., Fathi, A., Pantofaru, C., Guibas, L.J., Tagliasacchi, A., Dellaert, F., Funkhouser, T.A.: Panoptic neural fields: a semantic object-aware neural scene representation. In: CVPR, pp. 12861–12871 (2022)
Google Scholar
Li, H., Li, Y., Wang, H., Zeng, J., Xu, H., Cai, P., Chen, L., Yan, J., Xu, F., Xiong, L., Wang, J., Zhu, F., Yan, K., Xu, C., Wang, T., Xia, F., Mu, B., Peng, Z., Lin, D., Qiao, Y.: Open-sourced data ecosystem in autonomous driving: the present and future (2024)
Google Scholar
Li, H., Sima, C., Dai, J., Wang, W., Lu, L., Wang, H., Zeng, J., Li, Z., Yang, J., Deng, H., Tian, H., Xie, E., Xie, J., Chen, L., Li, T., Li, Y., Gao, Y., Jia, X., Liu, S., Shi, J., Lin, D., Qiao, Y.: Delving into the devils of bird’s-eye-view perception: a review, evaluation and recipe. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–20 (2023). https://doi.org/10.1109/TPAMI.2023.3333838
Li, P., Zhao, H., Liu, P., Cao, F.: Rtm3d: real-time monocular 3d detection from object keypoints for autonomous driving. In: ECCV, pp. 644–660. Springer (2020)
Google Scholar
Li, Z., Li, L., Ma, Z., Zhang, P., Chen, J., Zhu, J.: Read: large-scale neural scene rendering for autonomous driving (2022). arXiv:2205.05509.
Lian, Q., Ye, B., Xu, R., Yao, W., Zhang, T.: Exploring geometric consistency for monocular 3d object detection. In: CVPR, pp. 1685–1694 (2022),
Google Scholar
Liu, Z., Wu, Z., Tóth, R.: Smoke: single-stage monocular 3d object detection via keypoint estimation. In: CVPRW, pp. 996–997 (2020),
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: representing scenes as neural radiance fields for view synthesis. In: ECCV, pp. 99–106 (2020)
Google Scholar
Müller, N., Simonelli, A., Porzi, L., Bulò, S.R., Nießner, M., Kontschieder, P.: Autorf: learning 3d object radiance fields from single view observations. In: CVPR, pp. 3971–3980 (2022)
Google Scholar
Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. 41(4), 102, 1–102:15 (2022). https://doi.org/10.1145/3528223.3530127, https://doi.org/10.1145/3528223.3530127
Park, K., Sinha, U., Barron, J.T., Bouaziz, S., Goldman, D.B., Seitz, S.M., Martin-Brualla, R.: Nerfies: deformable neural radiance fields. In: ICCV, pp. 5865–5874 (2021)
Google Scholar
Reuse, M., Simon, M., Sick, B.: About the ambiguity of data augmentation for 3d object detection in autonomous driving. In: ICCVW, pp. 979–987 (2021)
Google Scholar
Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR, pp. 4104–4113 (2016)
Google Scholar
Sun, C., Sun, M., Chen, H.T.: Direct voxel grid optimization: super-fast convergence for radiance fields reconstruction. In: CVPR, pp. 5459–5469 (2022)
Google Scholar
Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: CVPR, pp. 2446–2454 (2020)
Google Scholar
Tancik, M., Casser, V., Yan, X., Pradhan, S., Mildenhall, B., Srinivasan, P.P., Barron, J.T., Kretzschmar, H.: Block-nerf: scalable large scene neural view synthesis. In: CVPR, pp. 8248–8258 (2022)
Google Scholar
Wang, T., Zhu, X., Pang, J., Lin, D.: Fcos3d: fully convolutional one-stage monocular 3d object detection. In: CVPR (2021)
Google Scholar
Wang, X., Kong, T., Shen, C., Jiang, Y., Li, L.: Solo: segmenting objects by locations. In: ECCV, pp. 649–665 (2020)
Google Scholar
Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In: CoRL, pp. 180–191. PMLR (2022)
Google Scholar
Weng, X., Kitani, K.: Monocular 3d object detection with pseudo-lidar point cloud. In: ICCVW (2019)
Google Scholar
Yang, J., Gao, S., Qiu, Y., Chen, L., Li, T., Dai, B., Chitta, K., Wu, P., Zeng, J., Luo, P., et al.: Generalized predictive model for autonomous driving (2024). arXiv:2403.09630
Yang, Z., Chen, L., Sun, Y., Li, H.: Visual point cloud forecasting enables scalable autonomous driving (2023). arXiv:2312.17655
Zhang, W., Wang, Z., Loy, C.C.: Exploring data augmentation for multi-modality 3d object detection (2020). arXiv:2012.12741
Zoph, B., Cubuk, E.D., Ghiasi, G., Lin, T.Y., Shlens, J., Le, Q.V.: Learning data augmentation strategies for object detection. In: ECCV, pp. 566–583. Springer (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

SenseTime, Sha Tin, Hong Kong
Wenwen Tong, Jiangwei Xie, Hanming Deng & Lewei Lu
OpenDriveLab, Shanghai AI Laboratory, Shanghai, China
Tianyu Li, Yang Li & Hongyang Li
Fudan University, Shanghai, China
Tianyu Li
Tsinghua University, Beijing, China
Bo Dai
Shanghai Jiao Tong University, Shanghai, China
Hao Zhao & Junchi Yan
University of Hong Kong, Pok Fu Lam, Hong Kong
Hongyang Li

Authors

Wenwen Tong
View author publications
You can also search for this author in PubMed Google Scholar
Jiangwei Xie
View author publications
You can also search for this author in PubMed Google Scholar
Tianyu Li
View author publications
You can also search for this author in PubMed Google Scholar
Yang Li
View author publications
You can also search for this author in PubMed Google Scholar
Hanming Deng
View author publications
You can also search for this author in PubMed Google Scholar
Bo Dai
View author publications
You can also search for this author in PubMed Google Scholar
Lewei Lu
View author publications
You can also search for this author in PubMed Google Scholar
Hao Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Junchi Yan
View author publications
You can also search for this author in PubMed Google Scholar
Hongyang Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongyang Li .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Zhouchen Lin
Nankai University, Tianjin, China
Ming-Ming Cheng
Chinese Academy of Sciences, Beijing, China
Ran He
Xinjiang University, Ürümqi, Xinjiang, China
Kurban Ubul
Xinjiang University, Ürümqi, China
Wushouer Silamu
Peking University, Beijing, China
Hongbin Zha
Tsinghua University, Beijing, China
Jie Zhou
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3259 KB)

Appendices

Appendix

A Experiments

1.1 A.1 Implementation Details

Building Digital Driving Asset. We use the SOLO v2 [30] trained on COCO as the instance segmentation model for scene decomposition. We consider vehicle (Waymo) or car (nuScenes) as the foreground objects, since they are the most important components in driving scenes For 3D model reconstruction, we use the same model configuration as the DVGO [26] with our proposed techniques. We use 30–40 consecutive frames spanning an area of about 100-200 meters as one background and train each background model with 40,000 iterations. For the background voxel grid, we set the resolution as $330^3$ with a voxel size of 0.25-0.3m. The object models are trained with 20-60 consecutive frames, and we set the voxel size as 0.25m consistent with the background voxel size. The grid point number is about 1,000. Considering the construction cost and limitation of NeRF model for extreme illumination conditions, we select a subset of 100 sunny scenarios for each dataset. We use them for the data augmentation.

Applying Data Augmentation. Our method generates new data through rendering the recomposed scenes of the randomly selected 3D models. We manipulate the object in 3D space by 3D location jittering and orientation jittering and place it on the valid region of arbitrary backgrounds. The location jittering is defined by the maximum translation $(T_x, T_y)$ along the x-direction and the y-direction, and the orientation jittering is given by maximum rotation angle $T_{\theta }$. We consider two data augmentation strategies to valid the effectiveness of adding rotation and translation for 3D perception. The first is ${{\boldsymbol{Drive-3DAug w/o RT}}}$, in which we set translation $T_x=0$, $T_y=0$, and rotation $T_{\theta }=0$ and 1-2 new objects are arbitrarily pasted into the background scene on average. The second is ${{\boldsymbol{Drive-3DAug w/ RT}}}$ with $T_x=20$m, $T_y=5$m, and $T_{\theta }=30^{\circ }$. As for determining the valid region in the background scene, we set the pillar $Z_p$ resolution as 2m$\times $2m, $\delta _1=30$, and $\delta _2=15$ in Eq. 6. We generate 12 new images for every background model. This progress is offline data augmentation and these images are repeatedly used by different detectors.

Training Detectors. Two typical camera-based monocular 3D object detectors, FCOS3D [29] and SMOKE [19] are utilized to investigate the performance of our proposed method as they are two of the most popular and commonly used mono3D detectors. We maintain the same hyperparameters of detectors for all experiments in our study, introduced in the supplementary. We use all the scenes in the training set to train the detectors and evaluate the detectors on the entire validation set. However, we sample training data every 3 frames on Waymo during training due to limited computational resources. Besides, although our method can be applied to any view of cameras, we only take into account images taken by the front camera on Waymo and nuScenes in this work because of the computational resource. For the usage of the generated images, we randomly replace the image in a batch with the augmented data if it belongs to the scenes in the digital driving asset.

1.2 A.2 Benchmark Details

Waymo Dataset. The Waymo Open Dataset [27] is a large-scale dataset for autonomous driving that contains 798 scenes in the training dataset and 202 scenes in the validation dataset. The image resolution for the front camera is $1920 \times 1280$. Waymo uses the LET-AP, the average precision with longitudinal error tolerance, to evaluate detection models. Besides, Waymo also adopts the LET-APL and LET-APH metrics, which are the longitudinal affinity weighted LET-AP and the heading accuracy weighted LET-AP, respectively.

nuScenes Dataset. The nuScenes [2] is a widely used benchmark for 3D object detection. It contains 700 training scenes and 150 validation scenes. The resolution of each image is 1600 $\times $ 900. As for the metrics, nuScenes computes mAP using the center distance on the ground plane to match the predicted boxe and the ground truth. It also contains different types of true positive metrics (TP metrics). We use ATE, ASE and AOE in this paper, for measuring the errors of translation, scale and orientation, respectively.

1.3 A.3 Ablative Studies

Depth Supervision. We qualitatively and quantitatively investigate the effect of depth supervision on background model training and 3D augmentation. Table 4 shows that the 3D augmentation based on background model trained with depth supervision has better performance, with LET-AP (0.590 vs. 0.585). Figure 8 shows that NeRF can reconstruct the background with high quality given depth supervision, and the 3D background model quality can be decreased without depth information. Thus, LET-AP, LET-APH and LET-APL on car have a slight decrease with 0.001 for 3D augmentation using background model without supervision.

Table 4. Ablation Study of Drive-3DAug for FCOS3D on Waymo validation set. 3DAug means we use DVGO [26] for data augmentation. DS means depth supervision.

Full size table

Reconstruction Cost. Table 5 depicts the comparison of reconstruction speed on a NVIDIA V100 GPU between previous methods and our method. Unlike the MLP-based NeRF [20] which needs more than 20 hours of training for one background, it takes about 0.5h for the voxel-based NeRF with depth supervision. For the object model, the reconstruction time is within minutes. Considering the model size of our NeRF is rather small, we can run multiple reconstructions in parallel. Moreover, once these models are trained, they could be recycled for different detectors as a digital driving assets.

Table 5. Reconstruction cost of different methods for one background, assessed on a NVIDIA V100 GPU with 16GB memory.

Full size table

B Discussion

Driving scenes are extremely complicated, including lots of object categories under different illumination and weather conditions. Currently, our method only augments limited classes of objects under good illumination conditions. It is worth to include more situations as the digital driving asset for the future work. Besides, our method only changes the object with geometric transformation. More augmentation strategies, such as manipulating the background or changing the appearance of objects are also worth for further exploration.

Corner Case Generation. Drive-3DAug is able to generate many photographic data for various corner cases without much effort for autonomous driving systems. As described in Fig. 9, we use our method to simulate several corner cases including the car occluded by the environment, the car appearing on the road with strange positions and headings, and the car on the slope, which are hard to collect in the real world. This shows that 3D data augmentation can help alleviate issues of autonomous driving caused by plenty of corner cases.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tong, W. et al. (2025). 3D Data Augmentation for Driving Scenes on Camera. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15036. Springer, Singapore. https://doi.org/10.1007/978-981-97-8508-7_4

Download citation

DOI: https://doi.org/10.1007/978-981-97-8508-7_4
Published: 03 November 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8507-0
Online ISBN: 978-981-97-8508-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics