research-article

ShapeCaptioner: Generative Caption Network for 3D Shapes by Learning a Mapping from Parts Detected in Multiple Views to Sentences

Authors:

Matthias ZwickerAuthors Info & Claims

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 1018 - 1027

https://doi.org/10.1145/3394171.3413889

Published: 12 October 2020 Publication History

Abstract

3D shape captioning is a challenging application in 3D shape understanding. Captions from recent multi-view based methods reveal that they cannot capture part-level characteristics of 3D shapes. This leads to a lack of detailed part-level description in captions, which human tend to focus on. To resolve this issue, we propose ShapeCaptioner, a generative caption network, to perform 3D shape captioning from semantic parts detected in multiple views. Our novelty lies in learning the knowledge of part detection in multiple views from 3D shape segmentations and transferring this knowledge to facilitate learning the mapping from 3D shapes to sentences. Specifically, ShapeCaptioner aggregates the parts detected in multiple colored views using our novel part class specific aggregation to represent a 3D shape, and then, employs a sequence to sequence model to generate the caption. Our outperforming results show that ShapeCaptioner can learn 3D shape features with more detailed part characteristics to facilitate better 3D shape captioning than previous work.

Supplementary Material

MP4 File (3394171.3413889.mp4)

Video File

Download
14.84 MB

References

[1]

Kevin Chen, Christopher B Choy, Manolis Savva, Angel X Chang, Thomas Funkhouser, and Silvio Savarese. 2018. Text2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings. In ACCV.

[2]

Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. In SSST@EMNLP. 103--111.

[3]

Michael Denkowski and Alon Lavie. 2014. Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In Proceedings of the Ninth Workshop on Statistical Machine Translation. 376--380.

[4]

Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. 2017. Semantic Compositional Networks for Visual Captioning. In IEEE Conference on Computer Vision and Pattern Recognition.

[5]

Ross Girshick. 2015. Fast R-CNN. In IEEE International Conference on Computer Vision. 1440--1448.

[6]

Zhizhong Han, Chao Chen, Yu-Shen Liu, and Matthias Zwicker. 2020 a. DRWR: A Differentiable Renderer without Rendering for Unsupervised 3D Structure Learning from Silhouette Images. In ICML.

[7]

Zhizhong Han, Xinhai Liu, Yu-Shen Liu, and Matthias Zwicker. 2019 b. Parts4Feature: Learning 3D Global Features from Generally Semantic Parts in Multiple Views. In IJCAI.

[8]

Zhizhong Han, Zhenbao Liu, Junwei Han, Chi-Man Vong, Shuhui Bu, and C.L.Philip Chen. 2017a. Mesh Convolutional Restricted Boltzmann Machines for Unsupervised Learning of Features With Structure Preservation on 3D Meshes. IEEE Transactions on Neural Network and Learning Systems, Vol. 28, 10 (2017), 2268 -- 2281.

[9]

Zhizhong Han, Zhenbao Liu, Junwei Han, Chi-Man Vong, Shuhui Bu, and C.L.P. Chen. 2019 a. Unsupervised Learning of 3D Local Features from Raw Voxels Based on A Novel Permutation Voxelization Strategy. IEEE Transactions on Cybernetics, Vol. 49, 2 (2019), 481--494.

[10]

Zhizhong Han, Zhenbao Liu, Junwei Han, Chi-Man Vong, Shuhui Bu, and Xuelong Li. 2016. Unsupervised 3D Local Feature Learning by Circle Convolutional Restricted Boltzmann Machine. IEEE Transactions on Image Processing, Vol. 25, 11 (2016), 5331--5344.

Digital Library

[11]

Zhizhong Han, Zhenbao Liu, Chi-Man Vong, Yu-Shen Liu, Shuhui Bu, Junwei Han, and C.L.Philip Chen. 2017b. BoSCC: Bag of Spatial Context Correlations for Spatially Enhanced 3D Shape Representation. IEEE Transactions on Image Processing, Vol. 26, 8 (2017), 3707--3720.

Digital Library

[12]

Zhizhong Han, Zhenbao Liu, Chi-Man Vong, Yu-Shen Liu, Shuhui Bu, Junwei Han, and CL Philip Chen. 2018. Deep Spatiality: Unsupervised Learning of Spatially-Enhanced Global and Local 3D Features by Deep Neural Network with Coupled Softmax. IEEE Transactions on Image Processing, Vol. 27, 6 (2018), 3049--3063.

[13]

Zhizhong Han, Honglei Lu, Zhenbao Liu, Chi-Man Vong, Yu-Shen Liu, Matthias Zwicker, Junwei Han, and C.L. Philip Chen. 2019 c. 3D2SeqViews: Aggregating Sequential Views for 3D Global Feature Learning by CNN With Hierarchical Attention Aggregation. IEEE Transactions on Image Processing, Vol. 28, 8 (2019), 3986--3999.

[14]

Zhizhong Han, Guanhui Qiao, Yu-Shen Liu, and Matthias Zwicker. 2020 b. SeqXY2SeqZ: Structure Learning for 3D Shapes by Sequentially Predicting 1D Occupancy Segments From 2D Coordinates. ArXiv, Vol. abs/2003.05559 (2020).

[15]

Zhizhong Han, Mingyang Shang, Zhenbao Liu, Chi-Man Vong, Yu-Shen Liu, Matthias Zwicker, Junwei Han, and C.L. Philip Chen. 2019 d. SeqViews2SeqLabels: Learning 3D Global Features via Aggregating Sequential Views by RNN With Attention. IEEE Transactions on Image Processing, Vol. 28, 2 (2019), 685--672.

[16]

Zhizhong Han, Mingyang Shang, Xiyang Wang, Yu-Shen Liu, and Matthias Zwicker. 2019 e. Y2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences. In AAAI. 126--133.

[17]

Zhizhong Han, Xiyang Wang, Yu-Shen Liu, and Matthias Zwicker. 2019 f. Multi-Angle Point Cloud-VAE:Unsupervised Feature Learning for 3D Point Clouds from Multiple Angles by Joint Self-Reconstruction and Half-to-Half Prediction. In ICCV.

[18]

Zhizhong Han, Xiyang Wang, Chi-Man Vong, Yu-Shen Liu, Matthias Zwicker, and C.L. Philip Chen. 2019 g. 3DViewGraph: Learning Global Features for 3D Shapes from A Graph of Unordered Views with Attention. In IJCAI.

[19]

Tao Hu, Zhizhong Han, Abhinav Shrivastava, and Matthias Zwicker. 2019 a. Render4Completion: Synthesizing Multi-view Depth Maps for 3D Shape Completion. ArXiv, Vol. abs/1904.08366 (2019).

[20]

Tao Hu, Zhizhong Han, and Matthias Zwicker. 2020. 3D Shape Completion with Multi-view Consistent Inference. In AAAI.

[21]

Tao Hu, Geng Lin, Zhizhong Han, and Matthias Zwicker. 2019 b. Learning to Generate Dense Point Clouds with Textures on Multiple Categories. ArXiv, Vol. abs/1912.10545 (2019).

[22]

Qiuyuan Huang, Pengchuan Zhang, Dapeng Wu, and Lei Zhang. 2018. Turbo Learning for CaptionBot and DrawingBot. In Advances in Neural Information Processing Systems. 6455--6465.

[23]

Yue Jiang, Dantong Ji, Zhizhong Han, and Matthias Zwicker. 2020. SDFDiff: Differentiable Rendering of Signed Distance Fields for 3D Shape Optimization. In IEEE Conference on Computer Vision and Pattern Recognition.

[24]

Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. DenseCap: Fully Convolutional Localization Networks for Dense Captioning. In IEEE Conference on Computer Vision and Pattern Recognition.

[25]

Evangelos Kalogerakis, Melinos Averkiou, Subhransu Maji, and Siddhartha Chaudhuri. 2017. 3D Shape Segmentation with Projective Convolutional Networks. In IEEE Conference on Computer Vision and Pattern Recognition. 6630--6639.

[26]

Andrej Karpathy and Li Fei-Fei. 2017. Deep Visual-Semantic Alignments for Generating Image Descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 39, 4 (2017), 664--676.

Digital Library

[27]

Andrej Karpathy and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition. 3128--3137.

[28]

Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of summaries. In Proceeding of ACL workshop on Text Summarization Branches Out.

[29]

Xinhai Liu, Zhizhong Han, Yu-Shen Liu, and Matthias Zwicker. 2019 a. Point2Sequence: Learning the Shape Representation of 3D Point Clouds with an Attention-based Sequence to Sequence Network. In AAAI. 8778--8785.

[30]

Xinhai Liu, Zhizhong Han, Wen Xin, Yu-Shen Liu, and Matthias Zwicker. 2019 b. L2G Auto-encoder: Understanding Point Clouds by Local-to-Global Reconstruction with Hierarchical Self-Attention. In ACMMM.

[31]

Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. 2019. Occupancy Networks: Learning 3D Reconstruction in Function Space. In IEEE Conference on Computer Vision and Pattern Recognition.

[32]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In Annual Meeting on Association for Computational Linguistics. 311--318.

[33]

Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. 2019. DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation. In IEEE Conference on Computer Vision and Pattern Recognition.

[34]

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J. Guibas. 2017. PointNet+: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Advances in Neural Information Processing Systems. 5105--5114.

[35]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems. 91--99.

[36]

Xu Shen, Xinmei Tian, Jun Xing, Yong Rui, and Dacheng Tao. 2018. Sequence-to-Sequence Learning via Shared Latent Representation. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 2--7, 2018. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16071

[37]

K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR, Vol. abs/1409.1556 (2014).

[38]

Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. 2016. Deep Metric Learning via Lifted Structured Feature Embedding. In IEEE Conference on Computer Vision and Pattern Recognition. 4004--4012.

[39]

Yale Song and Mohammad Soleymani. 2018. Cross-Modal Retrieval with Implicit Concept Association. CoRR, Vol. abs/1804.04318 (2018). arxiv: 1804.04318 http://arxiv.org/abs/1804.04318

[40]

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition. 4566--4575.

[41]

Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to Sequence -- Video to Text. In Proceedings of the IEEE International Conference on Computer Vision. 4534--4542.

Digital Library

[42]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A Neural Image Caption Generator. In IEEE Conference on Computer Vision and Pattern Recognition.

[43]

Josiah Wang, Pranava Swaroop Madhyastha, and Lucia Specia. 2018. Object Counts! Bringing Explicit Detections Back into Image Captioning. In NAACL-HLT. 2180--2193.

[44]

Xin Wen, Zhizhong Han, Geunhyuk Youk, and Yu-Shen Liu. 2020 a. CF-SIS: Semantic-Instance Segmentation of 3D Point Clouds by Context Fusion with Self-Attention. In ACM International Conference on Multimedia.

Digital Library

[45]

Xin Wen, Tianyang Li, Zhizhong Han, and Yu-Shen Liu. 2020 b. Point Cloud Completion by Skip-attention Network with Hierarchical Folding. In The IEEE Conference on Computer Vision and Pattern Recognition.

[46]

Q. Wu, C. Shen, L. Liu, A. Dick, and A. v. d. Hengel. 2016. What Value Do Explicit High Level Concepts Have in Vision to Language Problems?. In IEEE Conference on Computer Vision and Pattern Recognition. 203--212.

[47]

Jin Xie, Guoxian Dai, and Yi Fang. 2017. Deep multimetric learning for shape-based 3D model retrieval. IEEE Transactions on Multimedia, Vol. 19, 11 (2017), 2463--2474.

[48]

Xuwang Yin and Vicente Ordonez. 2017. Obj2Text: Generating Visually Descriptive Language from Object Layouts. In EMNLP. 177--187.

[49]

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image Captioning with Semantic Attention. In IEEE Conference on Computer Vision and Pattern Recognition. 4651--4659.

Cited By

Chen CLiu YHan Z(2025)NeuralTPS: Learning Signed Distance Functions Without Priors From Single Sparse Point CloudsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.347634947:1(565-582)Online publication date: Jan-2025
https://doi.org/10.1109/TPAMI.2024.3476349
Nie WChen RWang WLepri BSebe N(2025)T2TD: Text-3D Generation Model Based on Prior Knowledge GuidanceIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.346375347:1(172-189)Online publication date: Jan-2025
https://doi.org/10.1109/TPAMI.2024.3463753
Kabra RMatthey LLerchner AMitra NSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Leveraging VLM-based pipelines to annotate 3D objectsProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692984(22710-22747)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3692984
Show More Cited By

Index Terms

ShapeCaptioner: Generative Caption Network for 3D Shapes by Learning a Mapping from Parts Detected in Multiple Views to Sentences
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks

Recommendations

Dynamic Graph CNN for Learning on Point Clouds

Point clouds provide a flexible geometric representation suitable for countless applications in computer graphics; they also comprise the raw output of most 3D data acquisition devices. While hand-designed features on point clouds have long been ...
NeRF: representing scenes as neural radiance fields for view synthesis

We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Our algorithm represents a scene using a fully ...
ImageNet classification with deep convolutional neural networks

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0%, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

October 2020

4889 pages

ISBN:9781450379885

DOI:10.1145/3394171

General Chairs:
Chang Wen Chen
Chinese University of Hong Kong, Shenzhen, China
,
Rita Cucchiara
UNIMORE, Italy
,
Xian-Sheng Hua
Alibaba Group, China
,
Program Chairs:
Guo-Jun Qi
Futurewei Technologies, USA
,
Elisa Ricci
UNITN & Fondazione Bruno Kessler, Italy
,
Zhengyou Zhang
Tencent, China
,
Roger Zimmermann
National University of Singapore, Singapore

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key R&D Program of China

Conference

MM '20

Sponsor:

SIGMM

MM '20: The 28th ACM International Conference on Multimedia

October 12 - 16, 2020

WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
325
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)3

Reflects downloads up to 22 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen CLiu YHan Z(2025)NeuralTPS: Learning Signed Distance Functions Without Priors From Single Sparse Point CloudsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.347634947:1(565-582)Online publication date: Jan-2025
https://doi.org/10.1109/TPAMI.2024.3476349
Nie WChen RWang WLepri BSebe N(2025)T2TD: Text-3D Generation Model Based on Prior Knowledge GuidanceIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.346375347:1(172-189)Online publication date: Jan-2025
https://doi.org/10.1109/TPAMI.2024.3463753
Kabra RMatthey LLerchner AMitra NSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Leveraging VLM-based pipelines to annotate 3D objectsProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692984(22710-22747)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3692984
Ruan YLee HZhang YZhang KChang A(2024)TriCoLo: Trimodal Contrastive Loss for Text to Shape Retrieval2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00571(5803-5813)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00571
Abdelreheem AOlszewski KLee HWonka PAchlioptas P(2024)ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved Visio-Linguistic Models in 3D Scenes2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00349(3512-3522)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00349
Mao AYang ZChen WYi RLiu Y(2024)Complete 3D Relationships Extraction Modality Alignment Network for 3D Dense CaptioningIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.327920430:8(4867-4880)Online publication date: Aug-2024
https://doi.org/10.1109/TVCG.2023.3279204
Jin CWu TLiu YZhou J(2024)MuSic-UDF: Learning Multi-Scale dynamic grid representation for high-fidelity surface reconstruction from point cloudsComputers & Graphics10.1016/j.cag.2024.104081124(104081)Online publication date: Nov-2024
https://doi.org/10.1016/j.cag.2024.104081
Jiang ZGao CLi PLiu CLiu FZhu L(2024)TopologyFormer: structure transformer assisted topology reconstruction for point cloud completionMultimedia Tools and Applications10.1007/s11042-024-18136-983:26(68743-68771)Online publication date: 26-Jan-2024
https://doi.org/10.1007/s11042-024-18136-9
Chen CLiu YHan Z(2024)Learning Local Pattern Modularization for Point Cloud Reconstruction from Unseen ClassesComputer Vision – ECCV 202410.1007/978-3-031-73195-2_18(305-323)Online publication date: 27-Nov-2024
https://doi.org/10.1007/978-3-031-73195-2_18
Zhou JMa BLi SLiu YHan Z(2023)Learning a More Continuous Zero Level Set in Unsigned Distance Fields through Level Set Projection2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00295(3158-3169)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.00295
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents