Abstract
Driving scenes are extremely diverse and complicated that it is impossible to collect all cases with human effort alone. While data augmentation is an effective technique to enrich the training data, existing methods for camera data in autonomous driving applications are confined to the 2D image plane, which may not optimally increase data diversity in 3D real-world scenarios. To this end, we propose a 3D data augmentation approach termed Drive-3DAug, aiming at augmenting the driving scenes on camera in the 3D space. We first utilize Neural Radiance Field (NeRF) to reconstruct the 3D models of background and foreground objects. Then, augmented driving scenes can be obtained by placing the 3D objects with adapted location and orientation at the pre-defined valid region of backgrounds. As such, the training database could be effectively scaled up. However, the 3D object modeling is constrained to the image quality and the limited viewpoints. To overcome these problems, we modify the original NeRF by introducing a geometric rectified loss and a symmetric-aware training strategy. We evaluate our method for the camera-only monocular 3D detection task on the Waymo and nuScences datasets. The proposed data augmentation approach contributes to a gain of \(1.7\%\) and \(1.4\%\) in terms of detection accuracy, on Waymo and nuScences respectively. Furthermore, the constructed 3D models serve as digital driving assets and could be recycled for different detectors or other 3D perception tasks.
Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Brazil, G., Liu, X.: M3d-rpn: monocular 3d region proposal network for object detection. In: ICCV, pp. 9287–9296 (2019)
Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: a multimodal dataset for autonomous driving. In: CVPR, pp. 11621–11631 (2020)
Choi, J., Song, Y., Kwak, N.: Part-aware data augmentation for 3d object detection in point cloud. In: IROS, pp. 3391–3397 (2021)
Cubuk, E.D., Zoph, B., Mané, D., Vasudevan, V., Le, Q.V.: Autoaugment: learning augmentation policies from data (2018). http://arxiv.org/abs/1805.09501
Deng, K., Liu, A., Zhu, J.Y., Ramanan, D.: Depth-supervised nerf: fewer views and faster training for free. In: CVPR, pp. 12882–12891 (2022)
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: an open urban driving simulator. In: CORL, pp. 1–16 (2017)
Fang, J., Zuo, X., Zhou, D., Jin, S., Wang, S., Zhang, L.: Lidar-aug: a general rendering-based augmentation framework for 3d object detection. In: CVPR, pp. 4710–4720 (2021)
Fridovich-Keil, S., Yu, A., Tancik, M., Chen, Q., Recht, B., Kanazawa, A.: Plenoxels: radiance fields without neural networks. In: CVPR, pp. 5501–5510 (2022)
Ghiasi, G., Cui, Y., Srinivas, A., Qian, R., Lin, T.Y., Cubuk, E.D., Le, Q.V., Zoph, B.: Simple copy-paste is a strong data augmentation method for instance segmentation. In: CVPR, pp. 2918–2928 (2021)
Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., Gaidon, A.: 3d packing for self-supervised monocular depth estimation. In: CVPR, pp. 2485–2494 (2020)
Hung, W.C., Kretzschmar, H., Casser, V., Hwang, J.J., Anguelov, D.: Let-3d-ap: longitudinal error tolerant 3d average precision for camera-only 3d detection (2022). arXiv:2206.07705
Hung, W.C., Kretzschmar, H., Casser, V., Hwang, J.J., Anguelov, D.: Let-3d-ap: longitudinal error tolerant 3d average precision for camera-only 3d detection (2022). arXiv:2206.07705
Kundu, A., Genova, K., Yin, X., Fathi, A., Pantofaru, C., Guibas, L.J., Tagliasacchi, A., Dellaert, F., Funkhouser, T.A.: Panoptic neural fields: a semantic object-aware neural scene representation. In: CVPR, pp. 12861–12871 (2022)
Li, H., Li, Y., Wang, H., Zeng, J., Xu, H., Cai, P., Chen, L., Yan, J., Xu, F., Xiong, L., Wang, J., Zhu, F., Yan, K., Xu, C., Wang, T., Xia, F., Mu, B., Peng, Z., Lin, D., Qiao, Y.: Open-sourced data ecosystem in autonomous driving: the present and future (2024)
Li, H., Sima, C., Dai, J., Wang, W., Lu, L., Wang, H., Zeng, J., Li, Z., Yang, J., Deng, H., Tian, H., Xie, E., Xie, J., Chen, L., Li, T., Li, Y., Gao, Y., Jia, X., Liu, S., Shi, J., Lin, D., Qiao, Y.: Delving into the devils of bird’s-eye-view perception: a review, evaluation and recipe. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–20 (2023). https://doi.org/10.1109/TPAMI.2023.3333838
Li, P., Zhao, H., Liu, P., Cao, F.: Rtm3d: real-time monocular 3d detection from object keypoints for autonomous driving. In: ECCV, pp. 644–660. Springer (2020)
Li, Z., Li, L., Ma, Z., Zhang, P., Chen, J., Zhu, J.: Read: large-scale neural scene rendering for autonomous driving (2022). arXiv:2205.05509.
Lian, Q., Ye, B., Xu, R., Yao, W., Zhang, T.: Exploring geometric consistency for monocular 3d object detection. In: CVPR, pp. 1685–1694 (2022),
Liu, Z., Wu, Z., Tóth, R.: Smoke: single-stage monocular 3d object detection via keypoint estimation. In: CVPRW, pp. 996–997 (2020),
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: representing scenes as neural radiance fields for view synthesis. In: ECCV, pp. 99–106 (2020)
Müller, N., Simonelli, A., Porzi, L., Bulò, S.R., Nießner, M., Kontschieder, P.: Autorf: learning 3d object radiance fields from single view observations. In: CVPR, pp. 3971–3980 (2022)
Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. 41(4), 102, 1–102:15 (2022). https://doi.org/10.1145/3528223.3530127, https://doi.org/10.1145/3528223.3530127
Park, K., Sinha, U., Barron, J.T., Bouaziz, S., Goldman, D.B., Seitz, S.M., Martin-Brualla, R.: Nerfies: deformable neural radiance fields. In: ICCV, pp. 5865–5874 (2021)
Reuse, M., Simon, M., Sick, B.: About the ambiguity of data augmentation for 3d object detection in autonomous driving. In: ICCVW, pp. 979–987 (2021)
Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR, pp. 4104–4113 (2016)
Sun, C., Sun, M., Chen, H.T.: Direct voxel grid optimization: super-fast convergence for radiance fields reconstruction. In: CVPR, pp. 5459–5469 (2022)
Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: CVPR, pp. 2446–2454 (2020)
Tancik, M., Casser, V., Yan, X., Pradhan, S., Mildenhall, B., Srinivasan, P.P., Barron, J.T., Kretzschmar, H.: Block-nerf: scalable large scene neural view synthesis. In: CVPR, pp. 8248–8258 (2022)
Wang, T., Zhu, X., Pang, J., Lin, D.: Fcos3d: fully convolutional one-stage monocular 3d object detection. In: CVPR (2021)
Wang, X., Kong, T., Shen, C., Jiang, Y., Li, L.: Solo: segmenting objects by locations. In: ECCV, pp. 649–665 (2020)
Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In: CoRL, pp. 180–191. PMLR (2022)
Weng, X., Kitani, K.: Monocular 3d object detection with pseudo-lidar point cloud. In: ICCVW (2019)
Yang, J., Gao, S., Qiu, Y., Chen, L., Li, T., Dai, B., Chitta, K., Wu, P., Zeng, J., Luo, P., et al.: Generalized predictive model for autonomous driving (2024). arXiv:2403.09630
Yang, Z., Chen, L., Sun, Y., Li, H.: Visual point cloud forecasting enables scalable autonomous driving (2023). arXiv:2312.17655
Zhang, W., Wang, Z., Loy, C.C.: Exploring data augmentation for multi-modality 3d object detection (2020). arXiv:2012.12741
Zoph, B., Cubuk, E.D., Ghiasi, G., Lin, T.Y., Shlens, J., Le, Q.V.: Learning data augmentation strategies for object detection. In: ECCV, pp. 566–583. Springer (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendices
Appendix
A Experiments
1.1 A.1 Implementation Details
Building Digital Driving Asset. We use the SOLO v2 [30] trained on COCO as the instance segmentation model for scene decomposition. We consider vehicle (Waymo) or car (nuScenes) as the foreground objects, since they are the most important components in driving scenes For 3D model reconstruction, we use the same model configuration as the DVGO [26] with our proposed techniques. We use 30–40 consecutive frames spanning an area of about 100-200 meters as one background and train each background model with 40,000 iterations. For the background voxel grid, we set the resolution as \(330^3\) with a voxel size of 0.25-0.3m. The object models are trained with 20-60 consecutive frames, and we set the voxel size as 0.25m consistent with the background voxel size. The grid point number is about 1,000. Considering the construction cost and limitation of NeRF model for extreme illumination conditions, we select a subset of 100 sunny scenarios for each dataset. We use them for the data augmentation.
Applying Data Augmentation. Our method generates new data through rendering the recomposed scenes of the randomly selected 3D models. We manipulate the object in 3D space by 3D location jittering and orientation jittering and place it on the valid region of arbitrary backgrounds. The location jittering is defined by the maximum translation \((T_x, T_y)\) along the x-direction and the y-direction, and the orientation jittering is given by maximum rotation angle \(T_{\theta }\). We consider two data augmentation strategies to valid the effectiveness of adding rotation and translation for 3D perception. The first is \({{\boldsymbol{Drive-3DAug w/o RT}}}\), in which we set translation \(T_x=0\), \(T_y=0\), and rotation \(T_{\theta }=0\) and 1-2 new objects are arbitrarily pasted into the background scene on average. The second is \({{\boldsymbol{Drive-3DAug w/ RT}}}\) with \(T_x=20\)m, \(T_y=5\)m, and \(T_{\theta }=30^{\circ }\). As for determining the valid region in the background scene, we set the pillar \(Z_p\) resolution as 2m\(\times \)2m, \(\delta _1=30\), and \(\delta _2=15\) in Eq. 6. We generate 12 new images for every background model. This progress is offline data augmentation and these images are repeatedly used by different detectors.
Training Detectors. Two typical camera-based monocular 3D object detectors, FCOS3D [29] and SMOKE [19] are utilized to investigate the performance of our proposed method as they are two of the most popular and commonly used mono3D detectors. We maintain the same hyperparameters of detectors for all experiments in our study, introduced in the supplementary. We use all the scenes in the training set to train the detectors and evaluate the detectors on the entire validation set. However, we sample training data every 3 frames on Waymo during training due to limited computational resources. Besides, although our method can be applied to any view of cameras, we only take into account images taken by the front camera on Waymo and nuScenes in this work because of the computational resource. For the usage of the generated images, we randomly replace the image in a batch with the augmented data if it belongs to the scenes in the digital driving asset.
1.2 A.2 Benchmark Details
Waymo Dataset. The Waymo Open Dataset [27] is a large-scale dataset for autonomous driving that contains 798 scenes in the training dataset and 202 scenes in the validation dataset. The image resolution for the front camera is \(1920 \times 1280\). Waymo uses the LET-AP, the average precision with longitudinal error tolerance, to evaluate detection models. Besides, Waymo also adopts the LET-APL and LET-APH metrics, which are the longitudinal affinity weighted LET-AP and the heading accuracy weighted LET-AP, respectively.
nuScenes Dataset. The nuScenes [2] is a widely used benchmark for 3D object detection. It contains 700 training scenes and 150 validation scenes. The resolution of each image is 1600 \(\times \) 900. As for the metrics, nuScenes computes mAP using the center distance on the ground plane to match the predicted boxe and the ground truth. It also contains different types of true positive metrics (TP metrics). We use ATE, ASE and AOE in this paper, for measuring the errors of translation, scale and orientation, respectively.
1.3 A.3 Ablative Studies
Depth Supervision. We qualitatively and quantitatively investigate the effect of depth supervision on background model training and 3D augmentation. Table 4 shows that the 3D augmentation based on background model trained with depth supervision has better performance, with LET-AP (0.590 vs. 0.585). Figure 8 shows that NeRF can reconstruct the background with high quality given depth supervision, and the 3D background model quality can be decreased without depth information. Thus, LET-AP, LET-APH and LET-APL on car have a slight decrease with 0.001 for 3D augmentation using background model without supervision.
Reconstruction Cost. Table 5 depicts the comparison of reconstruction speed on a NVIDIA V100 GPU between previous methods and our method. Unlike the MLP-based NeRF [20] which needs more than 20 hours of training for one background, it takes about 0.5h for the voxel-based NeRF with depth supervision. For the object model, the reconstruction time is within minutes. Considering the model size of our NeRF is rather small, we can run multiple reconstructions in parallel. Moreover, once these models are trained, they could be recycled for different detectors as a digital driving assets.
B Discussion
Driving scenes are extremely complicated, including lots of object categories under different illumination and weather conditions. Currently, our method only augments limited classes of objects under good illumination conditions. It is worth to include more situations as the digital driving asset for the future work. Besides, our method only changes the object with geometric transformation. More augmentation strategies, such as manipulating the background or changing the appearance of objects are also worth for further exploration.
Corner Case Generation. Drive-3DAug is able to generate many photographic data for various corner cases without much effort for autonomous driving systems. As described in Fig. 9, we use our method to simulate several corner cases including the car occluded by the environment, the car appearing on the road with strange positions and headings, and the car on the slope, which are hard to collect in the real world. This shows that 3D data augmentation can help alleviate issues of autonomous driving caused by plenty of corner cases.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Tong, W. et al. (2025). 3D Data Augmentation for Driving Scenes on Camera. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15036. Springer, Singapore. https://doi.org/10.1007/978-981-97-8508-7_4
Download citation
DOI: https://doi.org/10.1007/978-981-97-8508-7_4
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8507-0
Online ISBN: 978-981-97-8508-7
eBook Packages: Computer ScienceComputer Science (R0)