LatentEditor: Text Driven Local Editing of 3D Scenes

Khalid, Umar; Iqbal, Hasan; Karim, Nazmul; Tayyab, Muhammad; Hua, Jing; Chen, Chen

doi:10.1007/978-3-031-73039-9_21

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15122))

Included in the following conference series:

European Conference on Computer Vision

110 Accesses
1 Citations

Abstract

While neural fields have made significant strides in view synthesis and scene reconstruction, editing them poses a formidable challenge due to their implicit encoding of geometry and texture information from multi-view inputs. In this paper, we introduce LatentEditor, an innovative framework designed to empower users with the ability to perform precise and locally controlled editing of neural fields using text prompts. Leveraging denoising diffusion models, we successfully embed real-world scenes into the latent space, resulting in a faster and more adaptable NeRF backbone for editing compared to traditional methods. To enhance editing precision, we introduce a delta score to calculate the 2D mask in the latent space that serves as a guide for local modifications while preserving irrelevant regions. Our novel pixel-level scoring approach harnesses the power of InstructPix2Pix (IP2P) to discern the disparity between IP2P conditional and unconditional noise predictions in the latent space. The edited latents conditioned on the 2D masks are then iteratively updated in the training set to achieve 3D local editing. Our approach achieves faster editing speeds and superior output quality compared to existing 3D editing models, bridging the gap between textual instructions and high-quality 3D scene editing in latent space. We show the superiority of our approach on four benchmark 3D datasets, LLFF [26], IN2N [8], NeRFStudio [44] and NeRF-Art [47]. Project Page: https://latenteditor.github.io/.

U. Khalid, H. Iqbal and N. Karim—Equal Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Watch Your Steps: Local Image and Scene Editing by Text Instructions

DATENeRF: Depth-Aware Text-Based Editing of NeRFs

3DEgo: 3D Editing on the Go!

References

Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: CVPR 2022, pp. 18208–18218 (2022)
Google Scholar
Brooks, T., Holynski, A., Efros, A.A.: InstructPix2Pix: learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18392–18402 (2023)
Google Scholar
Chen, J., et al.: Animatable neural radiance fields from monocular RGB videos. arXiv preprint arXiv:2106.13629 (2021)
Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: DiffEdit: diffusion-based semantic image editing with mask guidance. In: The Eleventh International Conference on Learning Representations (2023). https://openreview.net/forum?id=3lge0p5o-M-
Frakes, E., Khalid, U., Chen, C.: Efficient and consistent zero-shot video generation with diffusion models. In: Kehtarnavaz, N., Shirvaikar, M.V. (eds.) Real-Time Image Processing and Deep Learning 2024, vol. 13034, p. 1303407. International Society for Optics and Photonics, SPIE (2024). https://doi.org/10.1117/12.3013575
Gordon, O., Avrahami, O., Lischinski, D.: Blended-NeRF: zero-shot object generation and blending in existing neural radiance fields. arXiv preprint arXiv:2306.12760 (2023)
Gu, J., Liu, L., Wang, P., Theobalt, C.: StyleNeRF: a style-based 3D-aware generator for high-resolution image synthesis. arXiv preprint arXiv:2110.08985 (2021)
Haque, A., Tancik, M., Efros, A.A., Holynski, A., Kanazawa, A.: Instruct-NeRF2NeRF: editing 3D scenes with instructions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19740–19750 (2023)
Google Scholar
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851 (2020)
Google Scholar
Iqbal, H., Khalid, U., Chen, C., Hua, J.: Unsupervised anomaly detection in medical images using masked diffusion model. In: Cao, X., Xu, X., Rekik, I., Cui, Z., Ouyang, X. (eds.) MLMI 2023. LNCS, vol. 14348, pp. 372–381. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-45673-2_37
Chapter Google Scholar
Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided object generation with dream fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 867–876 (2022)
Google Scholar
Karim, N., Khalid, U., Iqbal, H., Hua, J., Chen, C.: Free-editor: zero-shot text-driven 3D scene editing. arXiv preprint arXiv:2312.13663 (2023)
Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6007–6017 (2023)
Google Scholar
Kim, H., Lee, G., Choi, Y., Kim, J.H., Zhu, J.Y.: 3D-aware blending with generative nerfs. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22906–22918 (2023)
Google Scholar
Krishnamoorthy, A., Menon, D.: Matrix inversion using Cholesky decomposition. In: 2013 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA), pp. 70–72. IEEE (2013)
Google Scholar
Kuang, Z., Luan, F., Bi, S., Shu, Z., Wetzstein, G., Sunkavalli, K.: PaletteNeRF: palette-based appearance editing of neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20691–20700 (2023)
Google Scholar
Li, L.H., et al.: Grounded language-image pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10965–10975 (2022)
Google Scholar
Liu, H.K., Shen, I., Chen, B.Y., et al.: NeRF-in: free-form nerf inpainting with RGB-D priors. arXiv preprint arXiv:2206.04901 (2022)
Liu, L., Gu, J., Zaw Lin, K., et al.: Neural sparse voxel fields. In: NeurIPS, vol. 2020, no. 33, pp. 15651–15663 (2020)
Google Scholar
Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: zero-shot one image to 3D object. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9298–9309 (2023)
Google Scholar
Liu, S., Zhang, X., Zhang, Z., Zhang, R., Zhu, J.Y., Russell, B.: Editing conditional radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5773–5783 (2021)
Google Scholar
Meng, C., et al.: SDEdit: guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations (2021)
Google Scholar
Metzer, G., Richardson, E., Patashnik, O., Giryes, R., Cohen-Or, D.: Latent-NeRF for shape-guided generation of 3D shapes and textures. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12663–12673 (2023)
Google Scholar
Mikaeili, A., Perel, O., Safaee, M., Cohen-Or, D., Mahdavi-Amiri, A.: SKED: sketch-guided text-based 3D editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14607–14619 (2023)
Google Scholar
Mildenhall, B., et al.: Local light field fusion: practical view synthesis with prescriptive sampling guidelines. ACM Trans. Graph. (TOG) 38(4), 1–14 (2019)
Article Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)
Article Google Scholar
Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. (ToG) 41(4), 1–15 (2022)
Article Google Scholar
Nichol, A., et al: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)
Park, H.S., Jun, C.H.: A simple and fast algorithm for K-medoids clustering. Expert Syst. Appl. 36(2), 3336–3341 (2009)
Article Google Scholar
Park, K., Henzler, P., Mildenhall, B., Barron, J.T., Martin-Brualla, R.: CamP: camera preconditioning for neural radiance fields. ACM Trans. Graph. (TOG) 42(6), 1–11 (2023)
Article Google Scholar
Ponimatkin, G., Labbé, Y., Russell, B., Aubry, M., Sivic, J.: Focal length and object pose estimation via render and compare. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3825–3834 (2022)
Google Scholar
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: DreamFusion: text-to-3D using 2D diffusion. In: The Eleventh International Conference on Learning Representations (2023). https://openreview.net/forum?id=FjNys5c7VyY
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Raj, A., et al.: DreamBooth3D: subject-driven text-to-3D generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2349–2359 (2023)
Google Scholar
Rajič, F., Ke, L., Tai, Y.W., Tang, C.K., Danelljan, M., Yu, F.: Segment anything meets point tracking. arXiv preprint arXiv:2307.01197 (2023)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125 (2022)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR 2022, pp. 10684–10695 (2022)
Google Scholar
Saharia, C., Chan, W., Saxena, S.E.A.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS 2022, vol. 35, pp. 36479–36494 (2022)
Google Scholar
Sella, E., Fiebelman, G., Hedman, P., Averbuch-Elor, H.: Vox-E: text-guided voxel editing of 3D objects. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 430–440 (2023)
Google Scholar
Shao, R., et al.: Control4D: efficient 4D portrait editing with text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4556–4567 (2024)
Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
Suvorov, R., et al.: Resolution-robust large mask inpainting with Fourier convolutions. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2149–2159 (2022)
Google Scholar
Tancik, M., et al.: Nerfstudio: a modular framework for neural radiance field development. In: ACM SIGGRAPH 2023 Conference Proceedings, pp. 1–12 (2023)
Google Scholar
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Wang, C., Chai, M., He, M., et al.: CLIP-NeRF: text-and-image driven manipulation of neural radiance fields. In: CVPR 2022, pp. 3835–3844 (2022)
Google Scholar
Wang, C., Jiang, R., Chai, M., He, M., Chen, D., Liao, J.: NeRF-Art: text-driven neural radiance fields stylization. IEEE Trans. Vis. Comput. Graph. (2023)
Google Scholar
Wang, C., Wu, X., Guo, Y.C., et al.: NeRF-SR: high quality neural radiance fields using supersampling. In: ACM MM 2022, pp. 6445–6454 (2022)
Google Scholar
Wang, P., Liu, L., Liu, Y., Theobalt, C., Komura, T., Wang, W.: NeuS: learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689 (2021)
Xiang, F., Xu, Z., Hasan, M., et al.: NeuTex: neural texture mapping for volumetric neural rendering. In: CVPR 2021, pp. 7119–7128 (2021)
Google Scholar
Yang, B., Bao, C., Zeng, J.E.A.: NeuMesh: learning disentangled neural mesh-based implicit field for geometry and texture editing. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13676, pp. 597–614. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19787-1_34
Chapter Google Scholar
Ye, M., Danelljan, M., Yu, F., Ke, L.: Gaussian grouping: segment and edit anything in 3D scenes. arXiv preprint arXiv:2312.00732 (2023)
Zhang, K., et al.: ARF: artistic radiance fields. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13691, pp. 717–733. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19821-2_41
Chapter Google Scholar
Zhang, K., Riegler, G., Snavely, N., Koltun, V.: NeRF++: analyzing and improving neural radiance fields. arXiv preprint arXiv:2010.07492 (2020)
Zhuang, J., Wang, C., Lin, L., Liu, L., Li, G.: DreamEditor: text-driven 3D scene editing with neural fields. In: SIGGRAPH Asia 2023 Conference Papers, pp. 1–10 (2023)
Google Scholar

Download references

Acknowledgement

This work was partially supported by the NSF under Grant Numbers OAC-1910469 and OAC-2311245.

Author information

Authors and Affiliations

Center for Research in Computer Vision, University of Central Florida, Orlando, FL, USA
Umar Khalid, Nazmul Karim, Muhammad Tayyab & Chen Chen
Department of Computer Science, Wayne State University, Detroit, MI, USA
Hasan Iqbal & Jing Hua

Authors

Umar Khalid
View author publications
You can also search for this author in PubMed Google Scholar
Hasan Iqbal
View author publications
You can also search for this author in PubMed Google Scholar
Nazmul Karim
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Tayyab
View author publications
You can also search for this author in PubMed Google Scholar
Jing Hua
View author publications
You can also search for this author in PubMed Google Scholar
Chen Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Umar Khalid .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4549 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Khalid, U., Iqbal, H., Karim, N., Tayyab, M., Hua, J., Chen, C. (2025). LatentEditor: Text Driven Local Editing of 3D Scenes. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15122. Springer, Cham. https://doi.org/10.1007/978-3-031-73039-9_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-73039-9_21
Published: 31 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73038-2
Online ISBN: 978-3-031-73039-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

LatentEditor: Text Driven Local Editing of 3D Scenes

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Watch Your Steps: Local Image and Scene Editing by Text Instructions

DATENeRF: Depth-Aware Text-Based Editing of NeRFs

3DEgo: 3D Editing on the Go!

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 4549 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

LatentEditor: Text Driven Local Editing of 3D Scenes

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Watch Your Steps: Local Image and Scene Editing by Text Instructions

DATENeRF: Depth-Aware Text-Based Editing of NeRFs

3DEgo: 3D Editing on the Go!

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 4549 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation