Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3394171.3413889acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

ShapeCaptioner: Generative Caption Network for 3D Shapes by Learning a Mapping from Parts Detected in Multiple Views to Sentences

Published: 12 October 2020 Publication History

Abstract

3D shape captioning is a challenging application in 3D shape understanding. Captions from recent multi-view based methods reveal that they cannot capture part-level characteristics of 3D shapes. This leads to a lack of detailed part-level description in captions, which human tend to focus on. To resolve this issue, we propose ShapeCaptioner, a generative caption network, to perform 3D shape captioning from semantic parts detected in multiple views. Our novelty lies in learning the knowledge of part detection in multiple views from 3D shape segmentations and transferring this knowledge to facilitate learning the mapping from 3D shapes to sentences. Specifically, ShapeCaptioner aggregates the parts detected in multiple colored views using our novel part class specific aggregation to represent a 3D shape, and then, employs a sequence to sequence model to generate the caption. Our outperforming results show that ShapeCaptioner can learn 3D shape features with more detailed part characteristics to facilitate better 3D shape captioning than previous work.

Supplementary Material

MP4 File (3394171.3413889.mp4)
Video File

References

[1]
Kevin Chen, Christopher B Choy, Manolis Savva, Angel X Chang, Thomas Funkhouser, and Silvio Savarese. 2018. Text2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings. In ACCV.
[2]
Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. In SSST@EMNLP. 103--111.
[3]
Michael Denkowski and Alon Lavie. 2014. Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In Proceedings of the Ninth Workshop on Statistical Machine Translation. 376--380.
[4]
Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. 2017. Semantic Compositional Networks for Visual Captioning. In IEEE Conference on Computer Vision and Pattern Recognition.
[5]
Ross Girshick. 2015. Fast R-CNN. In IEEE International Conference on Computer Vision. 1440--1448.
[6]
Zhizhong Han, Chao Chen, Yu-Shen Liu, and Matthias Zwicker. 2020 a. DRWR: A Differentiable Renderer without Rendering for Unsupervised 3D Structure Learning from Silhouette Images. In ICML.
[7]
Zhizhong Han, Xinhai Liu, Yu-Shen Liu, and Matthias Zwicker. 2019 b. Parts4Feature: Learning 3D Global Features from Generally Semantic Parts in Multiple Views. In IJCAI.
[8]
Zhizhong Han, Zhenbao Liu, Junwei Han, Chi-Man Vong, Shuhui Bu, and C.L.Philip Chen. 2017a. Mesh Convolutional Restricted Boltzmann Machines for Unsupervised Learning of Features With Structure Preservation on 3D Meshes. IEEE Transactions on Neural Network and Learning Systems, Vol. 28, 10 (2017), 2268 -- 2281.
[9]
Zhizhong Han, Zhenbao Liu, Junwei Han, Chi-Man Vong, Shuhui Bu, and C.L.P. Chen. 2019 a. Unsupervised Learning of 3D Local Features from Raw Voxels Based on A Novel Permutation Voxelization Strategy. IEEE Transactions on Cybernetics, Vol. 49, 2 (2019), 481--494.
[10]
Zhizhong Han, Zhenbao Liu, Junwei Han, Chi-Man Vong, Shuhui Bu, and Xuelong Li. 2016. Unsupervised 3D Local Feature Learning by Circle Convolutional Restricted Boltzmann Machine. IEEE Transactions on Image Processing, Vol. 25, 11 (2016), 5331--5344.
[11]
Zhizhong Han, Zhenbao Liu, Chi-Man Vong, Yu-Shen Liu, Shuhui Bu, Junwei Han, and C.L.Philip Chen. 2017b. BoSCC: Bag of Spatial Context Correlations for Spatially Enhanced 3D Shape Representation. IEEE Transactions on Image Processing, Vol. 26, 8 (2017), 3707--3720.
[12]
Zhizhong Han, Zhenbao Liu, Chi-Man Vong, Yu-Shen Liu, Shuhui Bu, Junwei Han, and CL Philip Chen. 2018. Deep Spatiality: Unsupervised Learning of Spatially-Enhanced Global and Local 3D Features by Deep Neural Network with Coupled Softmax. IEEE Transactions on Image Processing, Vol. 27, 6 (2018), 3049--3063.
[13]
Zhizhong Han, Honglei Lu, Zhenbao Liu, Chi-Man Vong, Yu-Shen Liu, Matthias Zwicker, Junwei Han, and C.L. Philip Chen. 2019 c. 3D2SeqViews: Aggregating Sequential Views for 3D Global Feature Learning by CNN With Hierarchical Attention Aggregation. IEEE Transactions on Image Processing, Vol. 28, 8 (2019), 3986--3999.
[14]
Zhizhong Han, Guanhui Qiao, Yu-Shen Liu, and Matthias Zwicker. 2020 b. SeqXY2SeqZ: Structure Learning for 3D Shapes by Sequentially Predicting 1D Occupancy Segments From 2D Coordinates. ArXiv, Vol. abs/2003.05559 (2020).
[15]
Zhizhong Han, Mingyang Shang, Zhenbao Liu, Chi-Man Vong, Yu-Shen Liu, Matthias Zwicker, Junwei Han, and C.L. Philip Chen. 2019 d. SeqViews2SeqLabels: Learning 3D Global Features via Aggregating Sequential Views by RNN With Attention. IEEE Transactions on Image Processing, Vol. 28, 2 (2019), 685--672.
[16]
Zhizhong Han, Mingyang Shang, Xiyang Wang, Yu-Shen Liu, and Matthias Zwicker. 2019 e. Y2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences. In AAAI. 126--133.
[17]
Zhizhong Han, Xiyang Wang, Yu-Shen Liu, and Matthias Zwicker. 2019 f. Multi-Angle Point Cloud-VAE:Unsupervised Feature Learning for 3D Point Clouds from Multiple Angles by Joint Self-Reconstruction and Half-to-Half Prediction. In ICCV.
[18]
Zhizhong Han, Xiyang Wang, Chi-Man Vong, Yu-Shen Liu, Matthias Zwicker, and C.L. Philip Chen. 2019 g. 3DViewGraph: Learning Global Features for 3D Shapes from A Graph of Unordered Views with Attention. In IJCAI.
[19]
Tao Hu, Zhizhong Han, Abhinav Shrivastava, and Matthias Zwicker. 2019 a. Render4Completion: Synthesizing Multi-view Depth Maps for 3D Shape Completion. ArXiv, Vol. abs/1904.08366 (2019).
[20]
Tao Hu, Zhizhong Han, and Matthias Zwicker. 2020. 3D Shape Completion with Multi-view Consistent Inference. In AAAI.
[21]
Tao Hu, Geng Lin, Zhizhong Han, and Matthias Zwicker. 2019 b. Learning to Generate Dense Point Clouds with Textures on Multiple Categories. ArXiv, Vol. abs/1912.10545 (2019).
[22]
Qiuyuan Huang, Pengchuan Zhang, Dapeng Wu, and Lei Zhang. 2018. Turbo Learning for CaptionBot and DrawingBot. In Advances in Neural Information Processing Systems. 6455--6465.
[23]
Yue Jiang, Dantong Ji, Zhizhong Han, and Matthias Zwicker. 2020. SDFDiff: Differentiable Rendering of Signed Distance Fields for 3D Shape Optimization. In IEEE Conference on Computer Vision and Pattern Recognition.
[24]
Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. DenseCap: Fully Convolutional Localization Networks for Dense Captioning. In IEEE Conference on Computer Vision and Pattern Recognition.
[25]
Evangelos Kalogerakis, Melinos Averkiou, Subhransu Maji, and Siddhartha Chaudhuri. 2017. 3D Shape Segmentation with Projective Convolutional Networks. In IEEE Conference on Computer Vision and Pattern Recognition. 6630--6639.
[26]
Andrej Karpathy and Li Fei-Fei. 2017. Deep Visual-Semantic Alignments for Generating Image Descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 39, 4 (2017), 664--676.
[27]
Andrej Karpathy and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition. 3128--3137.
[28]
Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of summaries. In Proceeding of ACL workshop on Text Summarization Branches Out.
[29]
Xinhai Liu, Zhizhong Han, Yu-Shen Liu, and Matthias Zwicker. 2019 a. Point2Sequence: Learning the Shape Representation of 3D Point Clouds with an Attention-based Sequence to Sequence Network. In AAAI. 8778--8785.
[30]
Xinhai Liu, Zhizhong Han, Wen Xin, Yu-Shen Liu, and Matthias Zwicker. 2019 b. L2G Auto-encoder: Understanding Point Clouds by Local-to-Global Reconstruction with Hierarchical Self-Attention. In ACMMM.
[31]
Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. 2019. Occupancy Networks: Learning 3D Reconstruction in Function Space. In IEEE Conference on Computer Vision and Pattern Recognition.
[32]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In Annual Meeting on Association for Computational Linguistics. 311--318.
[33]
Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. 2019. DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation. In IEEE Conference on Computer Vision and Pattern Recognition.
[34]
Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J. Guibas. 2017. PointNet+: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Advances in Neural Information Processing Systems. 5105--5114.
[35]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems. 91--99.
[36]
Xu Shen, Xinmei Tian, Jun Xing, Yong Rui, and Dacheng Tao. 2018. Sequence-to-Sequence Learning via Shared Latent Representation. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 2--7, 2018. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16071
[37]
K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR, Vol. abs/1409.1556 (2014).
[38]
Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. 2016. Deep Metric Learning via Lifted Structured Feature Embedding. In IEEE Conference on Computer Vision and Pattern Recognition. 4004--4012.
[39]
Yale Song and Mohammad Soleymani. 2018. Cross-Modal Retrieval with Implicit Concept Association. CoRR, Vol. abs/1804.04318 (2018). arxiv: 1804.04318 http://arxiv.org/abs/1804.04318
[40]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition. 4566--4575.
[41]
Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to Sequence -- Video to Text. In Proceedings of the IEEE International Conference on Computer Vision. 4534--4542.
[42]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A Neural Image Caption Generator. In IEEE Conference on Computer Vision and Pattern Recognition.
[43]
Josiah Wang, Pranava Swaroop Madhyastha, and Lucia Specia. 2018. Object Counts! Bringing Explicit Detections Back into Image Captioning. In NAACL-HLT. 2180--2193.
[44]
Xin Wen, Zhizhong Han, Geunhyuk Youk, and Yu-Shen Liu. 2020 a. CF-SIS: Semantic-Instance Segmentation of 3D Point Clouds by Context Fusion with Self-Attention. In ACM International Conference on Multimedia.
[45]
Xin Wen, Tianyang Li, Zhizhong Han, and Yu-Shen Liu. 2020 b. Point Cloud Completion by Skip-attention Network with Hierarchical Folding. In The IEEE Conference on Computer Vision and Pattern Recognition.
[46]
Q. Wu, C. Shen, L. Liu, A. Dick, and A. v. d. Hengel. 2016. What Value Do Explicit High Level Concepts Have in Vision to Language Problems?. In IEEE Conference on Computer Vision and Pattern Recognition. 203--212.
[47]
Jin Xie, Guoxian Dai, and Yi Fang. 2017. Deep multimetric learning for shape-based 3D model retrieval. IEEE Transactions on Multimedia, Vol. 19, 11 (2017), 2463--2474.
[48]
Xuwang Yin and Vicente Ordonez. 2017. Obj2Text: Generating Visually Descriptive Language from Object Layouts. In EMNLP. 177--187.
[49]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image Captioning with Semantic Attention. In IEEE Conference on Computer Vision and Pattern Recognition. 4651--4659.

Cited By

View all
  • (2025)NeuralTPS: Learning Signed Distance Functions Without Priors From Single Sparse Point CloudsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.347634947:1(565-582)Online publication date: Jan-2025
  • (2025)T2TD: Text-3D Generation Model Based on Prior Knowledge GuidanceIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.346375347:1(172-189)Online publication date: Jan-2025
  • (2024)Leveraging VLM-based pipelines to annotate 3D objectsProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692984(22710-22747)Online publication date: 21-Jul-2024
  • Show More Cited By

Index Terms

  1. ShapeCaptioner: Generative Caption Network for 3D Shapes by Learning a Mapping from Parts Detected in Multiple Views to Sentences

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '20: Proceedings of the 28th ACM International Conference on Multimedia
    October 2020
    4889 pages
    ISBN:9781450379885
    DOI:10.1145/3394171
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 October 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. 3D shape captioning
    2. RNN
    3. description generation
    4. multiple views
    5. semantic part detection

    Qualifiers

    • Research-article

    Funding Sources

    • National Key R&D Program of China

    Conference

    MM '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)27
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 22 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)NeuralTPS: Learning Signed Distance Functions Without Priors From Single Sparse Point CloudsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.347634947:1(565-582)Online publication date: Jan-2025
    • (2025)T2TD: Text-3D Generation Model Based on Prior Knowledge GuidanceIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.346375347:1(172-189)Online publication date: Jan-2025
    • (2024)Leveraging VLM-based pipelines to annotate 3D objectsProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692984(22710-22747)Online publication date: 21-Jul-2024
    • (2024)TriCoLo: Trimodal Contrastive Loss for Text to Shape Retrieval2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00571(5803-5813)Online publication date: 3-Jan-2024
    • (2024)ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved Visio-Linguistic Models in 3D Scenes2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00349(3512-3522)Online publication date: 3-Jan-2024
    • (2024)Complete 3D Relationships Extraction Modality Alignment Network for 3D Dense CaptioningIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.327920430:8(4867-4880)Online publication date: Aug-2024
    • (2024)MuSic-UDF: Learning Multi-Scale dynamic grid representation for high-fidelity surface reconstruction from point cloudsComputers & Graphics10.1016/j.cag.2024.104081124(104081)Online publication date: Nov-2024
    • (2024)TopologyFormer: structure transformer assisted topology reconstruction for point cloud completionMultimedia Tools and Applications10.1007/s11042-024-18136-983:26(68743-68771)Online publication date: 26-Jan-2024
    • (2024)Learning Local Pattern Modularization for Point Cloud Reconstruction from Unseen ClassesComputer Vision – ECCV 202410.1007/978-3-031-73195-2_18(305-323)Online publication date: 27-Nov-2024
    • (2023)Learning a More Continuous Zero Level Set in Unsigned Distance Fields through Level Set Projection2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00295(3158-3169)Online publication date: 1-Oct-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media