Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-031-73030-6_25guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering

Published: 24 November 2024 Publication History

Abstract

The correct insertion of virtual objects in images of real-world scenes requires a deep understanding of the scene’s lighting, geometry and materials, as well as the image formation process. While recent large-scale diffusion models have shown strong generative and inpainting capabilities, we find that current models do not sufficiently “understand” the scene shown in a single picture to generate consistent lighting effects (shadows, bright reflections, etc.) while preserving the identity and details of the composited object. We propose using a personalized large diffusion model as guidance to a physically based inverse rendering process. Our method recovers scene lighting and tone-mapping parameters, allowing the photorealistic composition of arbitrary virtual objects in single frames or videos of indoor or outdoor scenes. Our physically based pipeline further enables automatic materials and tone-mapping refinement.

References

[1]
Poly Haven - The Public 3D Asset Library. https://polyhaven.com
[2]
Balaji, Y., et al.: eDiff-i: text-to-image diffusion models with ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022)
[3]
Bangaru, S.P., Li, T.M., Durand, F.: Unbiased warped-area sampling for differentiable rendering. ACM Trans. Graph. 39(6) (2020).
[4]
Barron JT and Malik J Shape, illumination, and reflectance from shading IEEE Trans. Pattern Anal. Mach. Intell. 2014 37 8 1670-1687
[5]
Barrow H, Tenenbaum J, Hanson A, and Riseman E Recovering intrinsic scene characteristics Comput. Vis. Syst 1978 2 3-26
[6]
Bell S, Bala K, and Snavely N Intrinsic images in the wild ACM Trans. Graph. (TOG) 2014 33 4 159
[7]
Black MJ and Anandan P The robust estimation of multiple motions: parametric and piecewise-smooth flow fields Comput. Vis. Image Underst. 1996 63 1 75-104
[8]
Boss, M., Jampani, V., Kim, K., Lensch, H.P., Kautz, J.: Two-shot spatially-varying BRDF and shape estimation. In: CVPR (2020)
[9]
Bousseau A, Paris S, and Durand F User-assisted intrinsic images ACM Trans. Graph. (TOG) 2009 28 130
[10]
Chari, P., et al.: Personalized restoration via dual-pivot tuning. arXiv preprint arXiv:2312.17234 (2023)
[11]
Dai, X., et al.: Emu: enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807 (2023)
[12]
Dastjerdi, M.R.K., Eisenmann, J., Hold-Geoffroy, Y., Lalonde, J.F.: EverLight: indoor-outdoor editable HDR lighting estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7420–7429 (2023)
[13]
Durkan, C., Bekasov, A., Murray, I., Papamakarios, G.: Neural spline flows. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
[14]
Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: personalizing text-to-image generation using textual inversion (2022).
[15]
Gardner, M.A., Hold-Geoffroy, Y., Sunkavalli, K., Gagné, C., Lalonde, J.F.: Deep parametric indoor lighting estimation. In: ICCV, pp. 7175–7183 (2019)
[16]
Gardner, M.A., et al.: Learning to predict indoor illumination from a single image. arXiv preprint arXiv:1704.00090 (2017)
[17]
Garon, M., Sunkavalli, K., Hadap, S., Carr, N., Lalonde, J.F.: Fast spatially-varying indoor lighting estimation. In: CVPR, pp. 6908–6917 (2019)
[18]
Grosse, R., Johnson, M.K., Adelson, E.H., Freeman, W.T.: Ground truth dataset and baseline evaluations for intrinsic image algorithms. In: ICCV, pp. 2335–2342. IEEE (2009)
[19]
Haque, A., Tancik, M., Efros, A., Holynski, A., Kanazawa, A.: Instruct-NeRF2NeRF: editing 3D scenes with instructions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)
[20]
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. arXiv preprint arxiv:2006.11239 (2020)
[21]
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
[22]
Hold-Geoffroy, Y., Athawale, A., Lalonde, J.F.: Deep sky modeling for single image outdoor lighting estimation. In: CVPR, pp. 6927–6935 (2019)
[23]
Hold-Geoffroy, Y., Sunkavalli, K., Hadap, S., Gambaretto, E., Lalonde, J.F.: Deep outdoor illumination estimation. In: CVPR, pp. 7312–7321 (2017)
[24]
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=nZeVKeeFYf9
[25]
Jakob, W., et al.: Mitsuba 3 renderer (2022). https://mitsuba-renderer.org
[26]
Kajiya, J.T.: The rendering equation. In: Proceedings of the 13th Annual Conference on Computer Graphics and Interactive Techniques, pp. 143–150 (1986)
[27]
Karimi Dastjerdi, M.R., Hold-Geoffroy, Y., Eisenmann, J., Khodadadeh, S., Lalonde, J.F.: Guided co-modulated GAN for 360 field of view extrapolation. In: International Conference on 3D Vision (3DV) (2022)
[28]
Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., Schindler, K.: Repurposing diffusion-based image generators for monocular depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
[29]
Kocsis, P., Sitzmann, V., Nießner, M.: Intrinsic image diffusion for single-view material estimation. In: arxiv (2023)
[30]
Kovacs, B., Bell, S., Snavely, N., Bala, K.: Shading annotations in the wild. In: CVPR, pp. 6998–7007 (2017)
[31]
Land EH and McCann JJ Lightness and retinex theory Josa 1971 61 1 1-11
[32]
LeGendre, C., et al.: DeepLight: learning illumination for unconstrained mobile mixed reality. In: CVPR, pp. 5918–5928 (2019)
[33]
Li, T.M., Aittala, M., Durand, F., Lehtinen, J.: Differentiable monte Carlo ray tracing through edge sampling. ACM Trans. Graph. 37(6) (2018).
[34]
Li, Z., Snavely, N.: CGintrinsics: better intrinsic image decomposition through physically-based rendering. In: ECCV, pp. 371–387 (2018)
[35]
Li, Z., Shafiei, M., Ramamoorthi, R., Sunkavalli, K., Chandraker, M.: Inverse rendering for complex indoor scenes: shape, spatially-varying lighting and SVBRDF from a single image. In: CVPR, pp. 2475–2484 (2020)
[36]
Li Z, Xu Z, Ramamoorthi R, Sunkavalli K, and Chandraker M Learning to reconstruct shape and spatially-varying reflectance from a single image ACM Trans. Graph. (TOG) 2018 37 6 1-11
[37]
Li, Z., Yu, L., Okunev, M., Chandraker, M., Dong, Z.: Spatiotemporally consistent HDR indoor lighting estimation. ACM Trans. Graph. 42(3) (2023).
[38]
Li, Z., et al.: OpenRooms: an end-to-end open framework for photorealistic indoor scene datasets. arXiv preprint arXiv:2007.12868 (2020)
[39]
Loubet, G., Holzschuch, N., Jakob, W.: Reparameterizing discontinuous integrands for differentiable rendering. ACM Trans. Graph. 38(6) (2019).
[40]
Lyu, L., et al.: Diffusion posterior illumination for ambiguity-aware inverse rendering. ACM Trans. Graph. 42(6) (2023)
[41]
Meng, C., et al.: SDEdit: guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations (2022)
[42]
Nimier-David, M., Speierer, S., Ruiz, B., Jakob, W.: Radiative backpropagation: an adjoint method for lightning-fast differentiable rendering. ACM Trans. Graph. 39(4) (2020).
[43]
Nimier-David, M., Vicini, D., Zeltner, T., Jakob, W.: Mitsuba 2: a retargetable forward and inverse renderer. ACM Trans. Graph. 38(6) (2019).
[44]
Phongthawee, P., et al.: DiffusionLight: light probes for free by painting a chrome ball. In: ArXiv (2023)
[45]
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: DreamFusion: Text-to-3D using 2D diffusion. arXiv (2022)
[46]
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021)
[47]
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation (2022)
[48]
Saharia C et al. Photorealistic text-to-image diffusion models with deep language understanding Adv. Neural. Inf. Process. Syst. 2022 35 36479-36494
[49]
Sarkar, A., Mai, H., Mahapatra, A., Lazebnik, S., Forsyth, D., Bhattad, A.: Shadows don’t lie and lines can’t bend! generative models don’t know projective geometry...for now (2023)
[50]
Sengupta, S., Gu, J., Kim, K., Liu, G., Jacobs, D.W., Kautz, J.: Neural inverse rendering of an indoor scene from a single image. In: ICCV (2019)
[51]
Shah, V., et al.: ZipLoRA: any subject in any style by effectively merging LoRAs (2023)
[52]
Shi, J., Xiong, W., Lin, Z., Jung, H.J.: InstantBooth: personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411 (2023)
[53]
Song, S., Funkhouser, T.: Neural Illumination: lighting prediction for indoor environments. In: CVPR, pp. 6918–6926 (2019)
[54]
Srinivasan, P.P., Mildenhall, B., Tancik, M., Barron, J.T., Tucker, R., Snavely, N.: Lighthouse: predicting lighting volumes for spatially-coherent illumination. In: CVPR, pp. 8080–8089 (2020)
[55]
Sun, P., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: CVPR (2020)
[56]
Tang, J., Zhu, Y., Wang, H., Chan, J.H., Li, S., Shi, B.: Estimating spatially-varying lighting in urban scenes with disentangled representation. In: ECCV (2022)
[57]
Tang, L., et al.: RealFill: reference-driven generation for authentic image completion. arXiv preprint arXiv:2309.16668 (2023)
[58]
Tao, A., Sapra, K., Catanzaro, B.: Hierarchical multi-scale attention for semantic segmentation. arXiv preprint arXiv:2005.10821 (2020)
[59]
Veach, E., Guibas, L.J.: Optimally combining sampling techniques for monte Carlo rendering. In: Proceedings of the 22nd Annual Conference on Computer Graphics and Interactive Techniques, pp. 419–428 (1995)
[60]
Vicini, D., Speierer, S., Jakob, W.: Path replay backpropagation: differentiating light paths using constant memory and linear time. ACM Trans. Graph. 40(4) (2021).
[61]
Wang, G., Yang, Y., Loy, C.C., Liu, Z.: StyleLight: HDR panorama generation for lighting estimation and editing. In: European Conference on Computer Vision (ECCV) (2022)
[62]
Wang, Z., Chen, W., Acuna, D., Kautz, J., Fidler, S.: Neural light field estimation for street scenes with differentiable virtual object insertion. In: ECCV (2022)
[63]
Wang, Z., Philion, J., Fidler, S., Kautz, J.: Learning indoor inverse rendering with 3D spatially-varying lighting. In: ICCV (2021)
[64]
Wimbauer, F., Wu, S., Rupprecht, C.: De-rendering 3D objects in the wild. In: CVPR (2022)
[65]
Yan, K., Lassner, C., Budge, B., Dong, Z., Zhao, S.: Efficient estimation of boundary integrals for path-space differentiable rendering. ACM Trans. Graph. 41(4) (2022).
[66]
Yang, J., et al.: EmerNeRF: emergent spatial-temporal scene decomposition via self-supervision. arXiv preprint arXiv:2311.02077 (2023)
[67]
Yu, H.X., et al.: Accidental light probes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12521–12530 (2023)
[68]
Yu, X., Guo, Y.C., Li, Y., Liang, D., Zhang, S.H., Qi, X.: Text-to-3D with classifier score distillation (2023)
[69]
Yu, Y., Smith, W.A.: InverseRenderNet: learning single image inverse rendering. In: CVPR (2019)
[70]
Zhan, F., et al.: EMLight: lighting estimation via spherical distribution approximation. In: Proceedings of the AAAI Conference on Artificial Intelligence (2021)
[71]
Zhang, C., Miller, B., Yan, K., Gkioulekas, I., Zhao, S.: Path-space differentiable rendering. ACM Trans. Graph. 39(4) (2020).
[72]
Zhang, C., Yu, Z., Zhao, S.: Path-space differentiable rendering of participating media. ACM Trans. Graph. 40(4) (2021).
[73]
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)
[74]
Zhang, Z., Roussel, N., Jakob, W.: Projective sampling for differentiable rendering of geometry. ACM Trans. Graph. 42(6) (2023).
[75]
Zhao, Q., Tan, P., Dai, Q., Shen, L., Wu, E., Lin, S.: A closed-form solution to retinex with nonlocal texture constraints. IEEE Trans. Pattern Anal. Mach. Intell. 34(7), 1437–1444 (2012)
[76]
Zhao, Y., Guo, T.: POINTAR: efficient lighting estimation for mobile augmented reality. arXiv preprint arXiv:2004.00006 (2020)
[77]
Zhu, Y., Zhang, Y., Li, S., Shi, B.: Spatially-varying outdoor lighting estimation from intrinsics. In: CVPR (2021)

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXI
Sep 2024
590 pages
ISBN:978-3-031-73029-0
DOI:10.1007/978-3-031-73030-6
  • Editors:
  • Aleš Leonardis,
  • Elisa Ricci,
  • Stefan Roth,
  • Olga Russakovsky,
  • Torsten Sattler,
  • Gül Varol

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 24 November 2024

Author Tags

  1. Inverse rendering
  2. Diffusion models
  3. Personalization
  4. Virtual object insertion
  5. Physically based rendering

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Feb 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media