research-article

CLIP-Mesh: Generating textured meshes from text using pretrained image-text models

Authors:

Nasir Mohammad Khalid,

Eugene Belilovsky,

Tiberiu PopaAuthors Info & Claims

SA '22: SIGGRAPH Asia 2022 Conference Papers

Article No.: 25, Pages 1 - 8

https://doi.org/10.1145/3550469.3555392

Published: 30 November 2022 Publication History

Abstract

We present a technique for zero-shot generation of a 3D model using only a target text prompt. Without any 3D supervision our method deforms the control shape of a limit subdivided surface along with its texture map and normal map to obtain a 3D asset that corresponds to the input text prompt and can be easily deployed into games or modeling applications. We rely only on a pre-trained CLIP model that compares the input text prompt with differentiably rendered images of our 3D model. While previous works have focused on stylization or required training of generative models we perform optimization on mesh parameters directly to generate shape, texture or both. To constrain the optimization to produce plausible meshes and textures we introduce a number of techniques using image augmentations and the use of a pretrained prior that generates CLIP image embeddings given a text embedding.

Supplemental Material

MP4 File

presentation

Download
28.73 MB

MP4 File

Supplementary Video

Download
60.29 MB

References

[1]

Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, 2015. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012(2015).

[2]

Kevin Chen, Christopher B Choy, Manolis Savva, Angel X Chang, Thomas Funkhouser, and Silvio Savarese. 2018. Text2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings. arXiv preprint arXiv:1803.08495(2018).

[3]

Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Edward Raff. 2022. VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance. arXiv preprint arXiv:2204.08583(2022).

[4]

Kentaro Fukamizu, Masaaki Kondo, and Ryuichi Sakamoto. 2019. Generation High resolution 3D model from natural language by Generative Adversarial Network. arXiv preprint arXiv:1901.07165(2019).

[5]

Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. 2019. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations. https://openreview.net/forum?id=Bygh9j09KX

[6]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems 27 (2014).

[7]

Jon Hasselgren, Jacob Munkberg, Jaakko Lehtinen, Miika Aittala, and Samuli Laine. 2021. Appearance-Driven Automatic 3D Model Simplification. In Eurographics Symposium on Rendering.

[8]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33 (2020), 6840–6851.

[9]

Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, and Ziwei Liu. 2022. AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars. ACM Transactions on Graphics (TOG) 41, 4 (2022), 1–19.

Digital Library

[10]

Ajay Jain, Ben Mildenhall, Jonathan T Barron, Pieter Abbeel, and Ben Poole. 2021. Zero-Shot Text-Guided Object Generation with Dream Fields. arXiv preprint arXiv:2112.01455(2021).

[11]

Nikolay Jetchev. 2021. ClipMatrix: Text-controlled Creation of 3D Textured Meshes. arXiv preprint arXiv:2109.12922(2021).

[12]

Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo Aila. 2020. Modular Primitives for High-Performance Differentiable Rendering. ACM Transactions on Graphics 39, 6 (2020).

Digital Library

[13]

Charles Loop. 1987. Smooth Subdivision Surfaces Based on Triangles. Ph. D. Dissertation. https://www.microsoft.com/en-us/research/publication/smooth-subdivision-surfaces-based-on-triangles/

[14]

Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and Rana Hanocka. 2021. Text2Mesh: Text-Driven Neural Stylization for Meshes. arXiv preprint arXiv:2112.03221(2021).

[15]

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV.

[16]

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2021. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741(2021).

[17]

Dong Huk Park, Samaneh Azadi, Xihui Liu, Trevor Darrell, and Anna Rohrbach. 2021. Benchmark for Compositional Text-to-Image Synthesis. In NeurIPS Datasets and Benchmarks.

[18]

Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. 2021. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2085–2094.

[19]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748–8763.

[20]

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. ArXiv abs/2204.06125(2022).

[21]

Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. 2021. Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction. In International Conference on Computer Vision.

[22]

Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. 2021. ImageNet-21K Pretraining for the Masses. arxiv:2104.10972 [cs.CV]

[23]

Aditya Sanghi, Hang Chu, Joseph G Lambourne, Ye Wang, Chin-Yi Cheng, and Marco Fumero. 2021. Clip-forge: Towards zero-shot text-to-shape generation. arXiv preprint arXiv:2110.02624(2021).

[24]

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. 2021. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs. ArXiv abs/2111.02114(2021).

[25]

Jos Stam. 1998. Evaluation of loop subdivision surfaces. In SIGGRAPH’98 CDROM Proceedings. Citeseer.

Cited By

Earle SKokkinos FNie YTogelius JRaileanu R(2024)DreamCraft: Text-Guided Generation of Functional 3D Environments in MinecraftProceedings of the 19th International Conference on the Foundations of Digital Games10.1145/3649921.3649943(1-15)Online publication date: 21-May-2024
https://dl.acm.org/doi/10.1145/3649921.3649943
Lo KZheng ZWang YQian XZhong ZWang ZZheng L(2024)Revisit MTN: High-resolution Features Deserve More AttentionProceedings of the 1st ICMR Workshop on Multimedia Object Re-Identification10.1145/3643490.3661808(1-4)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3643490.3661808
Li XMeng LWu LLi MMeng X(2024)DreamFont3D: Personalized Text-to-3D Artistic Font GenerationACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657476(1-11)Online publication date: 13-Jul-2024
https://dl.acm.org/doi/10.1145/3641519.3657476
Show More Cited By

Index Terms

CLIP-Mesh: Generating textured meshes from text using pretrained image-text models
1. Computing methodologies
  1. Computer graphics
    1. Shape modeling
      1. Mesh geometry models
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Generating segmented meshes from textured color images

This paper presents a new framework for generating triangular meshes from textured color images. The proposed framework combines a texture classification technique, called W-operator, with Imesh, a method originally conceived to generate simplicial ...
Finite element mesh generation for subsurface simulation models

This paper introduces a methodology for creating geometrically consistent subsurface simulation models, and subsequently tetrahedral finite element (FE) meshes, from geometric entities generated in gOcad software. Subsurface simulation models have an ...
Carving for topology simplification of polygonal meshes

The topological complexity of polygonal meshes has a large impact on the performance of various geometric processing algorithms, such as rendering and collision detection algorithms. Several approaches for simplifying topology have been discussed in the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SA '22: SIGGRAPH Asia 2022 Conference Papers

November 2022

482 pages

ISBN:9781450394703

DOI:10.1145/3550469

Editors:
Soon Ki Jung
Kyungpook National University, South Korea
,
Jehee Lee
Seoul National University, South Korea
,
Adam Bargteil
University of Maryland Baltimore County, USA

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGGRAPH: ACM Special Interest Group on Computer Graphics and Interactive Techniques

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 November 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Data Availability

presentation https://dl.acm.org/doi/10.1145/3550469.3555392#3550469.3555392.mp4

Supplementary Video https://dl.acm.org/doi/10.1145/3550469.3555392#supplementary_video.mp4

Funding Sources

nserc
NSERC

Conference

SA '22

Sponsor:

SIGGRAPH

SA '22: SIGGRAPH Asia 2022

December 6 - 9, 2022

Daegu, Republic of Korea

Acceptance Rates

Overall Acceptance Rate 178 of 869 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

61
Total Citations
View Citations
938
Total Downloads

Downloads (Last 12 months)477
Downloads (Last 6 weeks)25

Reflects downloads up to 18 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Earle SKokkinos FNie YTogelius JRaileanu R(2024)DreamCraft: Text-Guided Generation of Functional 3D Environments in MinecraftProceedings of the 19th International Conference on the Foundations of Digital Games10.1145/3649921.3649943(1-15)Online publication date: 21-May-2024
https://dl.acm.org/doi/10.1145/3649921.3649943
Lo KZheng ZWang YQian XZhong ZWang ZZheng L(2024)Revisit MTN: High-resolution Features Deserve More AttentionProceedings of the 1st ICMR Workshop on Multimedia Object Re-Identification10.1145/3643490.3661808(1-4)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3643490.3661808
Li XMeng LWu LLi MMeng X(2024)DreamFont3D: Personalized Text-to-3D Artistic Font GenerationACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657476(1-11)Online publication date: 13-Jul-2024
https://dl.acm.org/doi/10.1145/3641519.3657476
Aigerman NGroueix T(2024)Generative Escher MeshesACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657452(1-11)Online publication date: 13-Jul-2024
https://dl.acm.org/doi/10.1145/3641519.3657452
Lee HSavva MChang A(2024)Text‐to‐3D Shape GenerationComputer Graphics Forum10.1111/cgf.1506143:2Online publication date: 30-Apr-2024
https://doi.org/10.1111/cgf.15061
Tatsukawa YShen IQi AKoyama YIgarashi TShamir A(2024)FontCLIP: A Semantic Typography Visual‐Language Model for Multilingual Font ApplicationsComputer Graphics Forum10.1111/cgf.1504343:2Online publication date: 30-Apr-2024
https://doi.org/10.1111/cgf.15043
Kovács ÁHermosilla PRaidou R(2024)Surface‐aware Mesh Texture Synthesis with Pre‐trained 2D CNNsComputer Graphics Forum10.1111/cgf.1501643:2Online publication date: 23-Apr-2024
https://doi.org/10.1111/cgf.15016
Pérez JNguyen-Phuoc TCao CSanakoyeu ASimon TArbeláez PGhanem BThabet APumarola A(2024)StyleAvatar: Stylizing Animatable Head Avatars2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00848(8663-8672)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00848
Yang BDong WMa LHu WLiu XCui ZMa Y(2024)DreamSpace: Dreaming Your Room Space with Text-Driven Panoramic Texture Propagation2024 IEEE Conference Virtual Reality and 3D User Interfaces (VR)10.1109/VR58804.2024.00085(650-660)Online publication date: 16-Mar-2024
https://doi.org/10.1109/VR58804.2024.00085
Guo YChen YMa Z(2024)NeuroCLIP: Neuromorphic Data Understanding by CLIP and SNNIEEE Signal Processing Letters10.1109/LSP.2023.334866731(246-250)Online publication date: 2024
https://doi.org/10.1109/LSP.2023.3348667
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents