Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3550469.3555392acmconferencesArticle/Chapter ViewAbstractPublication Pagessiggraph-asiaConference Proceedingsconference-collections
research-article

CLIP-Mesh: Generating textured meshes from text using pretrained image-text models

Published: 30 November 2022 Publication History

Abstract

We present a technique for zero-shot generation of a 3D model using only a target text prompt. Without any 3D supervision our method deforms the control shape of a limit subdivided surface along with its texture map and normal map to obtain a 3D asset that corresponds to the input text prompt and can be easily deployed into games or modeling applications. We rely only on a pre-trained CLIP model that compares the input text prompt with differentiably rendered images of our 3D model. While previous works have focused on stylization or required training of generative models we perform optimization on mesh parameters directly to generate shape, texture or both. To constrain the optimization to produce plausible meshes and textures we introduce a number of techniques using image augmentations and the use of a pretrained prior that generates CLIP image embeddings given a text embedding.

Supplemental Material

MP4 File
presentation
MP4 File
Supplementary Video

References

[1]
Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, 2015. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012(2015).
[2]
Kevin Chen, Christopher B Choy, Manolis Savva, Angel X Chang, Thomas Funkhouser, and Silvio Savarese. 2018. Text2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings. arXiv preprint arXiv:1803.08495(2018).
[3]
Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Edward Raff. 2022. VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance. arXiv preprint arXiv:2204.08583(2022).
[4]
Kentaro Fukamizu, Masaaki Kondo, and Ryuichi Sakamoto. 2019. Generation High resolution 3D model from natural language by Generative Adversarial Network. arXiv preprint arXiv:1901.07165(2019).
[5]
Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. 2019. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations. https://openreview.net/forum?id=Bygh9j09KX
[6]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems 27 (2014).
[7]
Jon Hasselgren, Jacob Munkberg, Jaakko Lehtinen, Miika Aittala, and Samuli Laine. 2021. Appearance-Driven Automatic 3D Model Simplification. In Eurographics Symposium on Rendering.
[8]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33 (2020), 6840–6851.
[9]
Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, and Ziwei Liu. 2022. AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars. ACM Transactions on Graphics (TOG) 41, 4 (2022), 1–19.
[10]
Ajay Jain, Ben Mildenhall, Jonathan T Barron, Pieter Abbeel, and Ben Poole. 2021. Zero-Shot Text-Guided Object Generation with Dream Fields. arXiv preprint arXiv:2112.01455(2021).
[11]
Nikolay Jetchev. 2021. ClipMatrix: Text-controlled Creation of 3D Textured Meshes. arXiv preprint arXiv:2109.12922(2021).
[12]
Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo Aila. 2020. Modular Primitives for High-Performance Differentiable Rendering. ACM Transactions on Graphics 39, 6 (2020).
[13]
Charles Loop. 1987. Smooth Subdivision Surfaces Based on Triangles. Ph. D. Dissertation. https://www.microsoft.com/en-us/research/publication/smooth-subdivision-surfaces-based-on-triangles/
[14]
Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and Rana Hanocka. 2021. Text2Mesh: Text-Driven Neural Stylization for Meshes. arXiv preprint arXiv:2112.03221(2021).
[15]
Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV.
[16]
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2021. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741(2021).
[17]
Dong Huk Park, Samaneh Azadi, Xihui Liu, Trevor Darrell, and Anna Rohrbach. 2021. Benchmark for Compositional Text-to-Image Synthesis. In NeurIPS Datasets and Benchmarks.
[18]
Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. 2021. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2085–2094.
[19]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748–8763.
[20]
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. ArXiv abs/2204.06125(2022).
[21]
Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. 2021. Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction. In International Conference on Computer Vision.
[22]
Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. 2021. ImageNet-21K Pretraining for the Masses. arxiv:2104.10972 [cs.CV]
[23]
Aditya Sanghi, Hang Chu, Joseph G Lambourne, Ye Wang, Chin-Yi Cheng, and Marco Fumero. 2021. Clip-forge: Towards zero-shot text-to-shape generation. arXiv preprint arXiv:2110.02624(2021).
[24]
Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. 2021. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs. ArXiv abs/2111.02114(2021).
[25]
Jos Stam. 1998. Evaluation of loop subdivision surfaces. In SIGGRAPH’98 CDROM Proceedings. Citeseer.

Cited By

View all
  • (2024)DreamCraft: Text-Guided Generation of Functional 3D Environments in MinecraftProceedings of the 19th International Conference on the Foundations of Digital Games10.1145/3649921.3649943(1-15)Online publication date: 21-May-2024
  • (2024)Revisit MTN: High-resolution Features Deserve More AttentionProceedings of the 1st ICMR Workshop on Multimedia Object Re-Identification10.1145/3643490.3661808(1-4)Online publication date: 10-Jun-2024
  • (2024)DreamFont3D: Personalized Text-to-3D Artistic Font GenerationACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657476(1-11)Online publication date: 13-Jul-2024
  • Show More Cited By

Index Terms

  1. CLIP-Mesh: Generating textured meshes from text using pretrained image-text models

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SA '22: SIGGRAPH Asia 2022 Conference Papers
      November 2022
      482 pages
      ISBN:9781450394703
      DOI:10.1145/3550469
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 30 November 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. CLIP
      2. geometric modeling
      3. machine learning
      4. neural networks

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Data Availability

      Funding Sources

      • nserc
      • NSERC

      Conference

      SA '22
      Sponsor:
      SA '22: SIGGRAPH Asia 2022
      December 6 - 9, 2022
      Daegu, Republic of Korea

      Acceptance Rates

      Overall Acceptance Rate 178 of 869 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)477
      • Downloads (Last 6 weeks)25
      Reflects downloads up to 18 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)DreamCraft: Text-Guided Generation of Functional 3D Environments in MinecraftProceedings of the 19th International Conference on the Foundations of Digital Games10.1145/3649921.3649943(1-15)Online publication date: 21-May-2024
      • (2024)Revisit MTN: High-resolution Features Deserve More AttentionProceedings of the 1st ICMR Workshop on Multimedia Object Re-Identification10.1145/3643490.3661808(1-4)Online publication date: 10-Jun-2024
      • (2024)DreamFont3D: Personalized Text-to-3D Artistic Font GenerationACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657476(1-11)Online publication date: 13-Jul-2024
      • (2024)Generative Escher MeshesACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657452(1-11)Online publication date: 13-Jul-2024
      • (2024)Text‐to‐3D Shape GenerationComputer Graphics Forum10.1111/cgf.1506143:2Online publication date: 30-Apr-2024
      • (2024)FontCLIP: A Semantic Typography Visual‐Language Model for Multilingual Font ApplicationsComputer Graphics Forum10.1111/cgf.1504343:2Online publication date: 30-Apr-2024
      • (2024)Surface‐aware Mesh Texture Synthesis with Pre‐trained 2D CNNsComputer Graphics Forum10.1111/cgf.1501643:2Online publication date: 23-Apr-2024
      • (2024)StyleAvatar: Stylizing Animatable Head Avatars2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00848(8663-8672)Online publication date: 3-Jan-2024
      • (2024)DreamSpace: Dreaming Your Room Space with Text-Driven Panoramic Texture Propagation2024 IEEE Conference Virtual Reality and 3D User Interfaces (VR)10.1109/VR58804.2024.00085(650-660)Online publication date: 16-Mar-2024
      • (2024)NeuroCLIP: Neuromorphic Data Understanding by CLIP and SNNIEEE Signal Processing Letters10.1109/LSP.2023.334866731(246-250)Online publication date: 2024
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media