research-article

Open access

DiffCAD: Weakly-Supervised Probabilistic CAD Model Retrieval and Alignment from an RGB Image

Authors:

David Rozenberszki,

Stefan Leutenegger,

Angela DaiAuthors Info & Claims

ACM Transactions on Graphics (TOG), Volume 43, Issue 4

Article No.: 106, Pages 1 - 15

https://doi.org/10.1145/3658236

Published: 19 July 2024 Publication History

Abstract

Perceiving 3D structures from RGB images based on CAD model primitives can enable an effective, efficient 3D object-based representation of scenes. However, current approaches rely on supervision from expensive yet imperfect annotations of CAD models associated with real images, and encounter challenges due to the inherent ambiguities in the task - both in depth-scale ambiguity in monocular perception, as well as inexact matches of CAD database models to real observations. We thus propose DiffCAD, the first weakly-supervised probabilistic approach to CAD retrieval and alignment from an RGB image. We learn a probabilistic model through diffusion, modeling likely distributions of shape, pose, and scale of CAD objects in an image. This enables multi-hypothesis generation of different plausible CAD reconstructions, requiring only a few hypotheses to characterize ambiguities in depth/scale and inexact shape matches. Our approach is trained only on synthetic data, leveraging monocular depth and mask estimates to enable robust zero-shot adaptation to various real target domains. Despite being trained solely on synthetic data, our multi-hypothesis approach can even surpass the supervised state-of-the-art on the Scan2CAD dataset by 5.9% with 8 hypotheses.

Supplementary Material

MP4 File (DiffCAD_video.mp4)

Supplemental video

Download
87.16 MB

References

[1]

Armen Avetisyan, Manuel Dahnert, Angela Dai, Manolis Savva, Angel X Chang, and Matthias Nießner. 2019a. Scan2cad: Learning cad model alignment in rgb-d scans. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. 2614--2623.

[2]

Armen Avetisyan, Angela Dai, and Matthias Nießner. 2019b. End-to-end cad model retrieval and 9dof alignment in 3d scans. In Proceedings of the IEEE/CVF International Conference on computer vision. 2551--2560.

[3]

Armen Avetisyan, Tatiana Khanova, Christopher Choy, Denver Dash, Angela Dai, and Matthias Nießner. 2020. Scenecad: Predicting object alignments and layouts in rgb-d scans. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XXII 16. Springer, 596--612.

[4]

Gwangbin Bae, Ignas Budvytis, and Roberto Cipolla. 2022. IronDepth: Iterative Refinement of Single-View Depth using Surface Normal and its Uncertainty. ArXiv abs/2210.03676 (2022). https://api.semanticscholar.org/CorpusID:252762221

[5]

Dmitry Baranchuk, Ivan Rubachev, Andrey Voynov, Valentin Khrulkov, and Artem Babenko. 2021. Label-Efficient Semantic Segmentation with Diffusion Models. ArXiv abs/2112.03126 (2021). https://api.semanticscholar.org/CorpusID:244908617

[6]

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. 2021. ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1). https://openreview.net/forum?id=tjZjv_qh_CE

[7]

Tim Beyer and Angela Dai. 2022. Weakly-Supervised End-to-End CAD Retrieval to Scan Objects. ArXiv abs/2203.12873 (2022). https://api.semanticscholar.org/CorpusID:247627889

[8]

S. Bhat, Ibraheem Alhashim, and Peter Wonka. 2020. AdaBins: Depth Estimation Using Adaptive Bins. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 4008--4017. https://api.semanticscholar.org/CorpusID:227227779

[9]

Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Müller. 2023. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288 (2023).

[10]

Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas Müller, and Björn Ommer. 2022. Retrieval-augmented diffusion models. Advances in Neural Information Processing Systems 35 (2022), 15309--15324.

[11]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-End Object Detection with Transformers. ArXiv abs/2005.12872 (2020). https://api.semanticscholar.org/CorpusID:218889832

[12]

Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. 2015. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015).

[13]

Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W. Cohen. 2022. Re-Imagen: Retrieval-Augmented Text-to-Image Generator. ArXiv abs/2209.14491 (2022). https://api.semanticscholar.org/CorpusID:252596087

[14]

Yen-Chi Cheng, Hsin-Ying Lee, Sergey Tulyakov, Alexander G Schwing, and Liang-Yan Gui. 2023. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4456--4465.

[15]

Gene Chou, Yuval Bahat, and Felix Heide. 2022. DiffusionSDF: Conditional Generative Modeling of Signed Distance Functions. ArXiv abs/2211.13757 (2022). https://api.semanticscholar.org/CorpusID:254017862

[16]

Gene Chou, Yuval Bahat, and Felix Heide. 2023. Diffusion-sdf: Conditional generative modeling of signed distance functions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2262--2272.

[17]

Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 2016. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part VIII 14. Springer, 628--644.

[18]

Sanghyuk Chun, Seong Joon Oh, Rafael Sampaio De Rezende, Yannis Kalantidis, and Diane Larlus. 2021. Probabilistic embeddings for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8415--8424.

[19]

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5828--5839.

[20]

Congyue Deng, Chiyu Max Jiang, C. Qi, Xinchen Yan, Yin Zhou, Leonidas J. Guibas, and Drago Anguelov. 2022. NeRDi: Single-View NeRF Synthesis with Language-Guided Diffusion as General Image Priors. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), 20637--20647. https://api.semanticscholar.org/CorpusID:254366717

[21]

Maximilian Denninger, Dominik Winkelbauer, Martin Sundermeyer, Wout Boerdijk, Markus Knauer, Klaus H. Strobl, Matthias Humt, and Rudolph Triebel. 2023. Blender-Proc2: A Procedural Pipeline for Photorealistic Rendering. Journal of Open Source Software 8, 82 (2023), 4901.

[22]

Yan Di, Chenyangguang Zhang, Ruida Zhang, Fabian Manhardt, Yongzhi Su, Jason Rambach, Didier Stricker, Xiangyang Ji, and Federico Tombari. 2023. U-RED: Unsupervised 3D Shape Retrieval and Deformation for Partial Point Clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8884--8895.

[23]

Christian Diller and Angela Dai. 2023. CG-HOI: Contact-Guided 3D Human-Object Interaction Generation. arXiv preprint arXiv:2311.16097 (2023).

[24]

Ainaz Eftekhar, Alexander Sax, Jitendra Malik, and Amir Zamir. 2021. Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets From 3D Scans. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10786--10796.

[25]

Ziya Erkoç, Fangchang Ma, Qi Shan, Matthias Nießner, and Angela Dai. 2023. Hyperdiffusion: Generating implicit neural fields with weight-space diffusion. arXiv preprint arXiv:2303.17015 (2023).

[26]

Haoqiang Fan, Hao Su, and Leonidas J Guibas. 2017. A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition. 605--613.

[27]

Yuxin Fang, Wen Wang, Binhui Xie, Quan-Sen Sun, Ledell Yu Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. 2022. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), 19358--19369. https://api.semanticscholar.org/CorpusID:253510587

[28]

Martin A Fischler and Robert C Bolles. 1981. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24, 6 (1981), 381--395.

Digital Library

[29]

Huan Fu, Bowen Cai, Lin Gao, Ling-Xiao Zhang, Jiaming Wang, Cao Li, Qixun Zeng, Chengyue Sun, Rongfei Jia, Binqiang Zhao, et al. 2021. 3d-front: 3d furnished rooms with layouts and semantics. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10933--10942.

[30]

Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. 2021. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021).

[31]

Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. 2022. Scaling Open-Vocabulary Image Segmentation with Image-Level Labels. In ECCV.

[32]

Georgia Gkioxari, Jitendra Malik, and Justin Johnson. 2019. Mesh r-cnn. In Proceedings of the IEEE/CVF international conference on computer vision. 9785--9795.

[33]

Vitor Guizilini, Igor Vasiljevic, Dian Chen, Rares Ambrus, and Adrien Gaidon. 2023. Towards Zero-Shot Scale-Aware Monocular Depth Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).

[34]

Can Gümeli, Angela Dai, and Matthias Nießner. 2022. Roca: Robust cad model retrieval and alignment from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4022--4031.

[35]

Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. 2022. Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5152--5161.

[36]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961--2969.

[37]

Eric Hedlin, Gopal Sharma, Shweta Mahajan, Hossam Isack, Abhishek Kar, Andrea Tagliasacchi, and Kwang Moo Yi. 2023. Unsupervised Semantic Correspondence Using Stable Diffusion. arXiv preprint arXiv:2305.15581 (2023).

[38]

Jonathan Ho, Ajay Jain, and P. Abbeel. 2020. Denoising Diffusion Probabilistic Models. ArXiv abs/2006.11239 (2020). https://api.semanticscholar.org/CorpusID:219955663

[39]

Muhammad Zubair Irshad, Thomas Kollar, Michael Laskey, Kevin Stone, and Zsolt Kira. 2022. CenterSnap: Single-Shot Multi-Object 3D Shape Reconstruction and Categorical 6D Pose and Size Estimation. 2022 International Conference on Robotics and Automation (ICRA) (2022), 10632--10640. https://api.semanticscholar.org/CorpusID:247222831

Digital Library

[40]

Vladislav Ishimtsev, Alexey Bokhovkin, Alexey Artemov, Savva Ignatyev, Matthias Niessner, Denis Zorin, and Evgeny Burnaev. 2020. Cad-deform: Deformable fitting of cad models to 3d scans. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XIII 16. Springer, 599--628.

[41]

Hamid Izadinia, Qi Shan, and Steven M Seitz. 2017. Im2cad. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5134--5143.

[42]

Young Min Kim, Niloy J Mitra, Dong-Ming Yan, and Leonidas Guibas. 2012. Acquiring 3d indoor environments with variability and repetition. ACM Transactions on Graphics (TOG) 31, 6 (2012), 1--11.

Digital Library

[43]

Diederik P. Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. 2021. Variational Diffusion Models. ArXiv abs/2107.00630 (2021). https://api.semanticscholar.org/CorpusID:235694314

[44]

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. 2023. Segment anything. arXiv preprint arXiv:2304.02643 (2023).

[45]

Juil Koo, Seungwoo Yoo, Minh Hoai Nguyen, and Minhyuk Sung. 2023. SALAD: Part-Level Latent Diffusion for 3D Shape Generation and Manipulation. ArXiv abs/2303.12236 (2023). https://api.semanticscholar.org/CorpusID:257663544

[46]

Weicheng Kuo, Anelia Angelova, Tsung-Yi Lin, and Angela Dai. 2020. Mask2cad: 3d shape prediction by learning to segment and retrieve. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part III 16. Springer, 260--277.

[47]

Weicheng Kuo, Anelia Angelova, Tsung-Yi Lin, and Angela Dai. 2021. Patch2cad: Patchwise embedding learning for in-the-wild shape retrieval from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12589--12599.

[48]

Florian Langer, Gwangbin Bae, Ignas Budvytis, and Roberto Cipolla. 2022. SPARC: Sparse Render-and-Compare for CAD model alignment in a single RGB image. arXiv preprint arXiv:2210.01044 (2022).

[49]

Florian Langer, Ignas Budvytis, and Roberto Cipolla. 2023. Sparse Multi-Object Render-and-Compare. arXiv preprint arXiv:2310.11184 (2023).

[50]

Boyi Li, Kilian Q. Weinberger, Serge J. Belongie, Vladlen Koltun, and René Ranftl. 2022c. Language-driven Semantic Segmentation. ArXiv abs/2201.03546 (2022). https://api.semanticscholar.org/CorpusID:245836975

[51]

Muheng Li, Yueqi Duan, Jie Zhou, and Jiwen Lu. 2022b. Diffusion-SDF: Text-to-Shape via Voxelized Diffusion. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), 12642--12651. https://api.semanticscholar.org/CorpusID:254366593

[52]

Yangyan Li, Angela Dai, Leonidas Guibas, and Matthias Nießner. 2015. Database-assisted object retrieval for real-time 3d reconstruction. In Computer graphics forum, Vol. 34. Wiley Online Library, 435--446.

[53]

Zhenyu Li, Zehui Chen, Xianming Liu, and Junjun Jiang. 2022a. DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation. Machine Intelligence Research 20 (2022), 837 -- 854. https://api.semanticscholar.org/CorpusID:247762153

[54]

Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Péter Vajda, and Diana Marculescu. 2022. Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), 7061--7070. https://api.semanticscholar.org/CorpusID:252780581

[55]

Kai-En Lin, Yen-Chen Lin, Wei-Sheng Lai, Tsung-Yi Lin, Yichang Shih, and Ravi Ramamoorthi. 2022. Vision Transformer for NeRF-Based View Synthesis from a Single Input Image. 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2022), 806--815. https://api.semanticscholar.org/CorpusID:250450901

[56]

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2117--2125.

[57]

Zhi-Hao Lin, Sheng-Yu Huang, and Yu-Chiang Frank Wang. 2020. Convolution in the cloud: Learning deformable kernels in 3d graph convolution networks for point cloud analysis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1800--1809.

[58]

Haolin Liu, Yujian Zheng, Guanying Chen, Shuguang Cui, and Xiaoguang Han. 2022. Towards High-Fidelity Single-view Holistic Reconstruction of Indoor Scenes. ArXiv abs/2207.08656 (2022). https://api.semanticscholar.org/CorpusID:250627520

[59]

Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3431--3440.

[60]

Grace Luo, Lisa Dunlap, Dong Huk Park, Aleksander Holynski, and Trevor Darrell. 2023. Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence. In Advances in Neural Information Processing Systems.

[61]

Priyanka Mandikal, L. NavaneetK., Mayank Agarwal, and R. Venkatesh Babu. 2018. 3D-LMNet: Latent Embedding Matching for Accurate and Diverse 3D Point Cloud Reconstruction from a Single Image. ArXiv abs/1807.07796 (2018). https://api.semanticscholar.org/CorpusID:49905039

[62]

Kevis-Kokitsi Maninis, Stefan Popov, Matthias Nießner, and Vittorio Ferrari. 2022. Vid2cad: Cad model alignment using multi-view constraints from videos. IEEE transactions on pattern analysis and machine intelligence 45, 1 (2022), 1320--1327.

[63]

Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. 2019. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4460--4470.

[64]

Gimin Nam, Mariem Khlifi, Andrew Rodriguez, Alberto Tono, Linqi Zhou, and Paul Guerrero. 2022. 3D-LDM: Neural Implicit 3D Shape Generation with Latent Diffusion Models. ArXiv abs/2212.00842 (2022). https://api.semanticscholar.org/CorpusID:254220714

[65]

Liangliang Nan, Ke Xie, and Andrei Sharf. 2012. A search-classify approach for cluttered indoor scene understanding. ACM Transactions on Graphics (TOG) 31, 6 (2012), 1--10.

Digital Library

[66]

Andrei Neculai, Yanbei Chen, and Zeynep Akata. 2022. Probabilistic compositional embeddings for multimodal image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4547--4557.

[67]

Yinyu Nie, Xiaoguang Han, Shihui Guo, Yujian Zheng, Jian Chang, and Jianjun Zhang. 2020. Total3DUnderstanding: Joint Layout, Object Pose and Mesh Reconstruction for Indoor Scenes From a Single Image. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 52--61. https://api.semanticscholar.org/CorpusID:211532831

[68]

Junyi Pan, Xiaoguang Han, Weikai Chen, Jiapeng Tang, and Kui Jia. 2019. Deep Mesh Reconstruction From Single RGB Images via Topology Modification Networks. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019), 9963--9972. https://api.semanticscholar.org/CorpusID:202121070

[69]

William Peebles and Saining Xie. 2023. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4195--4205.

[70]

Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc Pollefeys, and Andreas Geiger. 2020. Convolutional occupancy networks. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part III 16. Springer, 523--540.

[71]

Michael Ramamonjisoa and Vincent Lepetit. 2019. SharpNet: Fast and Accurate Recovery of Occluding Contours in Monocular Depth Estimation. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW) (2019), 2109--2118. https://api.semanticscholar.org/CorpusID:160009795

[72]

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1, 2 (2022), 3.

[73]

René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. 2021. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision. 12179--12188.

[74]

Nikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor Gordon, Wan-Yen Lo, Justin Johnson, and Georgia Gkioxari. 2020. Accelerating 3d deep learning with pytorch3d. arXiv preprint arXiv:2007.08501 (2020).

[75]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752 [cs.CV]

[76]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115 (2015), 211--252.

[77]

Manuel Schwonberg, Joshua Niemeijer, Jan-Aike Termöhlen, Jörg P Schäfer, Nico M Schmidt, Hanno Gottschalk, and Tim Fingscheidt. 2023. Survey on unsupervised domain adaptation for semantic segmentation for visual perception in automated driving. IEEE Access (2023).

[78]

Tianjia Shao, Weiwei Xu, Kun Zhou, Jingdong Wang, Dongping Li, and Baining Guo. 2012. An interactive approach to semantic modeling of indoor scenes with an rgbd camera. ACM Transactions on Graphics (TOG) 31, 6 (2012), 1--11.

Digital Library

[79]

Shelly Sheynin, Oron Ashual, Adam Polyak, Uriel Singer, Oran Gafni, Eliya Nachmani, and Yaniv Taigman. 2022. Knn-diffusion: Image generation via large-scale retrieval. arXiv preprint arXiv:2204.02849 (2022).

[80]

J Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner, Jiajun Wu, and Gordon Wetzstein. 2023. 3d neural field generation using triplane diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20875--20886.

[81]

Jascha Narain Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. ArXiv abs/1503.03585 (2015). https://api.semanticscholar.org/CorpusID:14888175

[82]

Yang Song and Stefano Ermon. 2019. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems 32 (2019).

[83]

Yang Song, Jascha Narain Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2020. Score-Based Generative Modeling through Stochastic Differential Equations. ArXiv abs/2011.13456 (2020). https://api.semanticscholar.org/CorpusID:227209335

[84]

David Stutz and Andreas Geiger. 2020. Learning 3d shape completion under weak supervision. International Journal of Computer Vision 128 (2020), 1162--1181.

Digital Library

[85]

Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. 2023. Emergent Correspondence from Image Diffusion. arXiv preprint arXiv:2306.03881 (2023).

[86]

Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. 2022. Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022).

[87]

Mikaela Angelina Uy, Jingwei Huang, Minhyuk Sung, Tolga Birdal, and Leonidas Guibas. 2020. Deformation-aware 3d model embedding and retrieval. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part VII 16. Springer, 397--413.

[88]

Mikaela Angelina Uy, Vladimir G Kim, Minhyuk Sung, Noam Aigerman, Siddhartha Chaudhuri, and Leonidas J Guibas. 2021. Joint learning of 3d shape retrieval and deformation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11713--11722.

[89]

Qun Wan, Yidong Li, Haidong Cui, and Zheng Mao Feng. 2019. 3D-Mask-GAN:Unsupervised Single-View 3D Object Reconstruction. 2019 6th International Conference on Behavioral, Economic and Socio-Cultural Computing (BESC) (2019), 1--6. https://api.semanticscholar.org/CorpusID:210888106

[90]

He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J Guibas. 2019. Normalized object coordinate space for category-level 6d object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2642--2651.

[91]

Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, W. Liu, and Yu-Gang Jiang. 2018. Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images. ArXiv abs/1804.01654 (2018). https://api.semanticscholar.org/CorpusID:4633214

[92]

Bowen Wen, Wenzhao Lian, Kostas Bekris, and Stefan Schaal. 2022. CaTGrasp: Learning Category-Level Task-Relevant Grasping in Clutter from Simulation. ICRA 2022 (2022).

Digital Library

[93]

Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xihui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. 2023. Point transformer v3: Simpler, faster, stronger. arXiv preprint arXiv:2312.10035 (2023).

[94]

Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, and Shalini De Mello. 2023. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2955--2966.

[95]

Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. 2020. pixelNeRF: Neural Radiance Fields from One or Few Images. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 4576--4585. https://api.semanticscholar.org/CorpusID:227254854

[96]

Yanjie Ze and Xiaolong Wang. 2022. Category-level 6d object pose estimation in the wild: A semi-supervised learning approach and a new dataset. Advances in Neural Information Processing Systems 35 (2022), 27469--27483.

[97]

Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, and Karsten Kreis. 2022. LION: Latent Point Diffusion Models for 3D Shape Generation. ArXiv abs/2210.06978 (2022). https://api.semanticscholar.org/CorpusID:252872881

[98]

Biao Zhang, Jiapeng Tang, Matthias Nießner, and Peter Wonka. 2023. 3DShape2VecSet: A 3D Shape Representation for Neural Fields and Generative Diffusion Models. ACM Transactions on Graphics (TOG) 42 (2023), 1 -- 16. https://api.semanticscholar.org/CorpusID:256358401

Digital Library

[99]

Cheng Zhang, Zhaopeng Cui, Yinda Zhang, Bing Zeng, Marc Pollefeys, and Shuaicheng Liu. 2021. Holistic 3D Scene Understanding from a Single Image with Implicit Representation. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), 8829--8838. https://api.semanticscholar.org/CorpusID:232185507

[100]

Jiyao Zhang, Mingdong Wu, and Hao Dong. 2024. Generative Category-level Object Pose Estimation via Diffusion Models. Advances in Neural Information Processing Systems 36 (2024).

[101]

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition. 586--595.

[102]

Linqi Zhou, Yilun Du, and Jiajun Wu. 2021. 3D Shape Generation and Completion through Point-Voxel Diffusion. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021), 5806--5815. https://api.semanticscholar.org/CorpusID:233182041

Cited By

Zhang GWang YLuo CXu SMing YPeng JZhang M(2024)Visual Harmony: LLM’s Power in Crafting Coherent Indoor Scenes from ImagesPattern Recognition and Computer Vision10.1007/978-981-97-8508-7_1(3-17)Online publication date: 3-Nov-2024
https://doi.org/10.1007/978-981-97-8508-7_1

Index Terms

DiffCAD: Weakly-Supervised Probabilistic CAD Model Retrieval and Alignment from an RGB Image
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding

Recommendations

WeLSA: Learning to Predict 6D Pose from Weakly Labeled Data Using Shape Alignment
Computer Vision – ECCV 2022
Abstract
Object pose estimation is a crucial task in computer vision and augmented reality. One of its key challenges is the difficulty of annotation of real training data and the lack of textured CAD models. Therefore, pipelines which do not require CAD ...
Ground Truth Inference for Weakly Supervised Entity Matching
PACMMOD

Entity matching (EM) refers to the problem of identifying pairs of data records in one or more relational tables that refer to the same entity in the real world. Supervised machine learning (ML) models currently achieve state-of-the-art matching ...
SPL-LDP: a label distribution propagation method for semi-supervised partial label learning
Abstract
Partial label learning learns from examples represented by a single instance while associated with multiple candidate labels, among which only one valid label resides. However, in real-world applications, collecting candidate label sets for all ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Graphics

ACM Transactions on Graphics Volume 43, Issue 4

July 2024

1774 pages

EISSN:1557-7368

DOI:10.1145/3675116

Issue’s Table of Contents

Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2024

Published in TOG Volume 43, Issue 4

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Bavarian State Ministry of Science and the Arts
ERC Starting Grant
German Research Foundation (DFG) Grant

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
356
Total Downloads

Downloads (Last 12 months)356
Downloads (Last 6 weeks)144

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang GWang YLuo CXu SMing YPeng JZhang M(2024)Visual Harmony: LLM’s Power in Crafting Coherent Indoor Scenes from ImagesPattern Recognition and Computer Vision10.1007/978-981-97-8508-7_1(3-17)Online publication date: 3-Nov-2024
https://doi.org/10.1007/978-981-97-8508-7_1

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents