Article

Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering

Authors:

Merlin Nimier-David,

Nandita Vijaykumar,

Zian WangAuthors Info & Claims

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXI

Pages 446 - 465

https://doi.org/10.1007/978-3-031-73030-6_25

Published: 24 November 2024 Publication History

Abstract

The correct insertion of virtual objects in images of real-world scenes requires a deep understanding of the scene’s lighting, geometry and materials, as well as the image formation process. While recent large-scale diffusion models have shown strong generative and inpainting capabilities, we find that current models do not sufficiently “understand” the scene shown in a single picture to generate consistent lighting effects (shadows, bright reflections, etc.) while preserving the identity and details of the composited object. We propose using a personalized large diffusion model as guidance to a physically based inverse rendering process. Our method recovers scene lighting and tone-mapping parameters, allowing the photorealistic composition of arbitrary virtual objects in single frames or videos of indoor or outdoor scenes. Our physically based pipeline further enables automatic materials and tone-mapping refinement.

References

[1]

Poly Haven - The Public 3D Asset Library. https://polyhaven.com

[2]

Balaji, Y., et al.: eDiff-i: text-to-image diffusion models with ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022)

[3]

Bangaru, S.P., Li, T.M., Durand, F.: Unbiased warped-area sampling for differentiable rendering. ACM Trans. Graph. 39(6) (2020).

Digital Library

[4]

Barron JT and Malik J Shape, illumination, and reflectance from shading IEEE Trans. Pattern Anal. Mach. Intell. 2014 37 8 1670-1687

Digital Library

[5]

Barrow H, Tenenbaum J, Hanson A, and Riseman E Recovering intrinsic scene characteristics Comput. Vis. Syst 1978 2 3-26

[6]

Bell S, Bala K, and Snavely N Intrinsic images in the wild ACM Trans. Graph. (TOG) 2014 33 4 159

Digital Library

[7]

Black MJ and Anandan P The robust estimation of multiple motions: parametric and piecewise-smooth flow fields Comput. Vis. Image Underst. 1996 63 1 75-104

Digital Library

[8]

Boss, M., Jampani, V., Kim, K., Lensch, H.P., Kautz, J.: Two-shot spatially-varying BRDF and shape estimation. In: CVPR (2020)

[9]

Bousseau A, Paris S, and Durand F User-assisted intrinsic images ACM Trans. Graph. (TOG) 2009 28 130

Digital Library

[10]

Chari, P., et al.: Personalized restoration via dual-pivot tuning. arXiv preprint arXiv:2312.17234 (2023)

[11]

Dai, X., et al.: Emu: enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807 (2023)

[12]

Dastjerdi, M.R.K., Eisenmann, J., Hold-Geoffroy, Y., Lalonde, J.F.: EverLight: indoor-outdoor editable HDR lighting estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7420–7429 (2023)

[13]

Durkan, C., Bekasov, A., Murray, I., Papamakarios, G.: Neural spline flows. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

[14]

Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: personalizing text-to-image generation using textual inversion (2022).

[15]

Gardner, M.A., Hold-Geoffroy, Y., Sunkavalli, K., Gagné, C., Lalonde, J.F.: Deep parametric indoor lighting estimation. In: ICCV, pp. 7175–7183 (2019)

[16]

Gardner, M.A., et al.: Learning to predict indoor illumination from a single image. arXiv preprint arXiv:1704.00090 (2017)

[17]

Garon, M., Sunkavalli, K., Hadap, S., Carr, N., Lalonde, J.F.: Fast spatially-varying indoor lighting estimation. In: CVPR, pp. 6908–6917 (2019)

[18]

Grosse, R., Johnson, M.K., Adelson, E.H., Freeman, W.T.: Ground truth dataset and baseline evaluations for intrinsic image algorithms. In: ICCV, pp. 2335–2342. IEEE (2009)

[19]

Haque, A., Tancik, M., Efros, A., Holynski, A., Kanazawa, A.: Instruct-NeRF2NeRF: editing 3D scenes with instructions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)

[20]

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. arXiv preprint arxiv:2006.11239 (2020)

[21]

Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

[22]

Hold-Geoffroy, Y., Athawale, A., Lalonde, J.F.: Deep sky modeling for single image outdoor lighting estimation. In: CVPR, pp. 6927–6935 (2019)

[23]

Hold-Geoffroy, Y., Sunkavalli, K., Hadap, S., Gambaretto, E., Lalonde, J.F.: Deep outdoor illumination estimation. In: CVPR, pp. 7312–7321 (2017)

[24]

Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=nZeVKeeFYf9

[25]

Jakob, W., et al.: Mitsuba 3 renderer (2022). https://mitsuba-renderer.org

[26]

Kajiya, J.T.: The rendering equation. In: Proceedings of the 13th Annual Conference on Computer Graphics and Interactive Techniques, pp. 143–150 (1986)

[27]

Karimi Dastjerdi, M.R., Hold-Geoffroy, Y., Eisenmann, J., Khodadadeh, S., Lalonde, J.F.: Guided co-modulated GAN for 360 field of view extrapolation. In: International Conference on 3D Vision (3DV) (2022)

[28]

Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., Schindler, K.: Repurposing diffusion-based image generators for monocular depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

[29]

Kocsis, P., Sitzmann, V., Nießner, M.: Intrinsic image diffusion for single-view material estimation. In: arxiv (2023)

[30]

Kovacs, B., Bell, S., Snavely, N., Bala, K.: Shading annotations in the wild. In: CVPR, pp. 6998–7007 (2017)

[31]

Land EH and McCann JJ Lightness and retinex theory Josa 1971 61 1 1-11

[32]

LeGendre, C., et al.: DeepLight: learning illumination for unconstrained mobile mixed reality. In: CVPR, pp. 5918–5928 (2019)

[33]

Li, T.M., Aittala, M., Durand, F., Lehtinen, J.: Differentiable monte Carlo ray tracing through edge sampling. ACM Trans. Graph. 37(6) (2018).

Digital Library

[34]

Li, Z., Snavely, N.: CGintrinsics: better intrinsic image decomposition through physically-based rendering. In: ECCV, pp. 371–387 (2018)

[35]

Li, Z., Shafiei, M., Ramamoorthi, R., Sunkavalli, K., Chandraker, M.: Inverse rendering for complex indoor scenes: shape, spatially-varying lighting and SVBRDF from a single image. In: CVPR, pp. 2475–2484 (2020)

[36]

Li Z, Xu Z, Ramamoorthi R, Sunkavalli K, and Chandraker M Learning to reconstruct shape and spatially-varying reflectance from a single image ACM Trans. Graph. (TOG) 2018 37 6 1-11

Digital Library

[37]

Li, Z., Yu, L., Okunev, M., Chandraker, M., Dong, Z.: Spatiotemporally consistent HDR indoor lighting estimation. ACM Trans. Graph. 42(3) (2023).

Digital Library

[38]

Li, Z., et al.: OpenRooms: an end-to-end open framework for photorealistic indoor scene datasets. arXiv preprint arXiv:2007.12868 (2020)

[39]

Loubet, G., Holzschuch, N., Jakob, W.: Reparameterizing discontinuous integrands for differentiable rendering. ACM Trans. Graph. 38(6) (2019).

Digital Library

[40]

Lyu, L., et al.: Diffusion posterior illumination for ambiguity-aware inverse rendering. ACM Trans. Graph. 42(6) (2023)

[41]

Meng, C., et al.: SDEdit: guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations (2022)

[42]

Nimier-David, M., Speierer, S., Ruiz, B., Jakob, W.: Radiative backpropagation: an adjoint method for lightning-fast differentiable rendering. ACM Trans. Graph. 39(4) (2020).

Digital Library

[43]

Nimier-David, M., Vicini, D., Zeltner, T., Jakob, W.: Mitsuba 2: a retargetable forward and inverse renderer. ACM Trans. Graph. 38(6) (2019).

Digital Library

[44]

Phongthawee, P., et al.: DiffusionLight: light probes for free by painting a chrome ball. In: ArXiv (2023)

[45]

Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: DreamFusion: Text-to-3D using 2D diffusion. arXiv (2022)

[46]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021)

[47]

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation (2022)

[48]

Saharia C et al. Photorealistic text-to-image diffusion models with deep language understanding Adv. Neural. Inf. Process. Syst. 2022 35 36479-36494

[49]

Sarkar, A., Mai, H., Mahapatra, A., Lazebnik, S., Forsyth, D., Bhattad, A.: Shadows don’t lie and lines can’t bend! generative models don’t know projective geometry...for now (2023)

[50]

Sengupta, S., Gu, J., Kim, K., Liu, G., Jacobs, D.W., Kautz, J.: Neural inverse rendering of an indoor scene from a single image. In: ICCV (2019)

[51]

Shah, V., et al.: ZipLoRA: any subject in any style by effectively merging LoRAs (2023)

[52]

Shi, J., Xiong, W., Lin, Z., Jung, H.J.: InstantBooth: personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411 (2023)

[53]

Song, S., Funkhouser, T.: Neural Illumination: lighting prediction for indoor environments. In: CVPR, pp. 6918–6926 (2019)

[54]

Srinivasan, P.P., Mildenhall, B., Tancik, M., Barron, J.T., Tucker, R., Snavely, N.: Lighthouse: predicting lighting volumes for spatially-coherent illumination. In: CVPR, pp. 8080–8089 (2020)

[55]

Sun, P., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: CVPR (2020)

[56]

Tang, J., Zhu, Y., Wang, H., Chan, J.H., Li, S., Shi, B.: Estimating spatially-varying lighting in urban scenes with disentangled representation. In: ECCV (2022)

[57]

Tang, L., et al.: RealFill: reference-driven generation for authentic image completion. arXiv preprint arXiv:2309.16668 (2023)

[58]

Tao, A., Sapra, K., Catanzaro, B.: Hierarchical multi-scale attention for semantic segmentation. arXiv preprint arXiv:2005.10821 (2020)

[59]

Veach, E., Guibas, L.J.: Optimally combining sampling techniques for monte Carlo rendering. In: Proceedings of the 22nd Annual Conference on Computer Graphics and Interactive Techniques, pp. 419–428 (1995)

[60]

Vicini, D., Speierer, S., Jakob, W.: Path replay backpropagation: differentiating light paths using constant memory and linear time. ACM Trans. Graph. 40(4) (2021).

Digital Library

[61]

Wang, G., Yang, Y., Loy, C.C., Liu, Z.: StyleLight: HDR panorama generation for lighting estimation and editing. In: European Conference on Computer Vision (ECCV) (2022)

[62]

Wang, Z., Chen, W., Acuna, D., Kautz, J., Fidler, S.: Neural light field estimation for street scenes with differentiable virtual object insertion. In: ECCV (2022)

[63]

Wang, Z., Philion, J., Fidler, S., Kautz, J.: Learning indoor inverse rendering with 3D spatially-varying lighting. In: ICCV (2021)

[64]

Wimbauer, F., Wu, S., Rupprecht, C.: De-rendering 3D objects in the wild. In: CVPR (2022)

[65]

Yan, K., Lassner, C., Budge, B., Dong, Z., Zhao, S.: Efficient estimation of boundary integrals for path-space differentiable rendering. ACM Trans. Graph. 41(4) (2022).

Digital Library

[66]

Yang, J., et al.: EmerNeRF: emergent spatial-temporal scene decomposition via self-supervision. arXiv preprint arXiv:2311.02077 (2023)

[67]

Yu, H.X., et al.: Accidental light probes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12521–12530 (2023)

[68]

Yu, X., Guo, Y.C., Li, Y., Liang, D., Zhang, S.H., Qi, X.: Text-to-3D with classifier score distillation (2023)

[69]

Yu, Y., Smith, W.A.: InverseRenderNet: learning single image inverse rendering. In: CVPR (2019)

[70]

Zhan, F., et al.: EMLight: lighting estimation via spherical distribution approximation. In: Proceedings of the AAAI Conference on Artificial Intelligence (2021)

[71]

Zhang, C., Miller, B., Yan, K., Gkioulekas, I., Zhao, S.: Path-space differentiable rendering. ACM Trans. Graph. 39(4) (2020).

Digital Library

[72]

Zhang, C., Yu, Z., Zhao, S.: Path-space differentiable rendering of participating media. ACM Trans. Graph. 40(4) (2021).

Digital Library

[73]

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)

[74]

Zhang, Z., Roussel, N., Jakob, W.: Projective sampling for differentiable rendering of geometry. ACM Trans. Graph. 42(6) (2023).

Digital Library

[75]

Zhao, Q., Tan, P., Dai, Q., Shen, L., Wu, E., Lin, S.: A closed-form solution to retinex with nonlocal texture constraints. IEEE Trans. Pattern Anal. Mach. Intell. 34(7), 1437–1444 (2012)

[76]

Zhao, Y., Guo, T.: POINTAR: efficient lighting estimation for mobile augmented reality. arXiv preprint arXiv:2004.00006 (2020)

[77]

Zhu, Y., Zhang, Y., Li, S., Shi, B.: Spatially-varying outdoor lighting estimation from intrinsics. In: CVPR (2021)

Index Terms

Index terms have been assigned to the content through auto-classification.

Recommendations

Image-based rendering of diffuse, specular and glossy surfaces from a single image
SIGGRAPH '01: Proceedings of the 28th annual conference on Computer graphics and interactive techniques

In this paper, we present a new method to recover an approximation of the bidirectional reflectance distribution function (BRDF) of the surfaces present in a real scene. This is done from a single photograph and a 3D geometric model of the scene. The ...
Diffusion Posterior Illumination for Ambiguity-Aware Inverse Rendering

Inverse rendering, the process of inferring scene properties from images, is a challenging inverse problem. The task is ill-posed, as many different scene configurations can give rise to the same image. Most existing solutions incorporate priors into the ...
A signal-processing framework for inverse rendering
SIGGRAPH '01: Proceedings of the 28th annual conference on Computer graphics and interactive techniques

Realism in computer-generated images requires accurate input models for lighting, textures and BRDFs. One of the best ways of obtaining high-quality data is through measurements of scene attributes from real photographs by inverse rendering. However, ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXI

Sep 2024

590 pages

ISBN:978-3-031-73029-0

DOI:10.1007/978-3-031-73030-6

Editors:
Aleš Leonardis
https://ror.org/03angcq70University of Birmingham, Birmingham, UK
,
Elisa Ricci
https://ror.org/05trd4x28University of Trento, Trento, Italy
,
Stefan Roth
https://ror.org/05n911h24Technical University of Darmstadt, Darmstadt, Germany
,
Olga Russakovsky
https://ror.org/00hx57361Princeton University, Princeton, NJ, USA
,
Torsten Sattler
https://ror.org/03kqpb082Czech Technical University in Prague, Prague, Czech Republic
,
Gül Varol
https://ror.org/02nwvxz07École des Ponts ParisTech, Marne-la-Vallée, France

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 24 November 2024

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Table of Conten