A Recurrent Point Clouds Selection Method for 3D Dense Captioning

Hayashi, Shinko; Zhang, Zhiqiang; Zhou, Jinja

doi:10.1007/978-3-031-30111-7_23

Shinko Hayashi¹²,
Zhiqiang Zhang¹² &
Jinja Zhou¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13625))

Included in the following conference series:

International Conference on Neural Information Processing

975 Accesses

Abstract

3D dense captioning provides descriptions for corresponding objects in 3D scenes represented as RGB-D scans and point clouds. However, when generating a description, existing methods select points randomly from a point cloud regardless of importance, which degrades the quality of the description by removing important points or including low-value points. To solve the above problem, we propose a recurrent point clouds selection (RPCS) method to mitigate descriptive deficiencies in 3D dense captioning by iteratively checking the caption results of the different point clouds. Our method is divided into two steps. On step 1, this work randomly selects cloud points and uses objectness score to evaluate the generated description. The objectness score indicates whether the proposed object is close to the ground truth; the higher the score, the closer the proposed object is to the ground truth in the positive value. On step 2, if the objectness score is lower than the threshold, step 1 is processed to generate another group of cloud points and evaluate the results. This loop stops when the objectness score is no longer reduced. The loop termination conditions are configurable according to the requirement of accuracy and processing time. As a result, our work can decrease the deficient descriptions and outperforms previous state-of-the-art methods by a large margin (6.58%~35.70% CiDEr, BLUE-4, METEOR, ROUGE improvement).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

3D-SceneCaptioner: Visual Scene Captioning Network for Three-Dimensional Point Clouds

MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes

Dense Image Captioning Based on Precise Feature Extraction

References

Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086 (2018)
Google Scholar
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: ACL Workshop, pp. 65–72 (2005)
Google Scholar
Chen, D.Z., Chang, A.X., Nießner, M.: ScanRefer: 3D object localization in RGB-D scans using natural language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 202–221. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_13
Chapter Google Scholar
Chen, D.Z., Wu, Q., Nießner, M., Chang, A.X.: D3Net: a speaker-listener architecture for semi-supervised dense captioning and visual grounding in RGB-D scans. arXiv preprint arXiv:2112.01551 (2021)
Chen, Z., Gholami, A., Nießner, M., Chang, A.X.: Scan2Cap: context-aware dense captioning in RGB-D scans. In: CVPR, pp. 3193–3203 (2021)
Google Scholar
Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: CVPR, pp. 10578–10587 (2020)
Google Scholar
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: CVPR, pp. 5828–5839 (2017)
Google Scholar
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: TPAMI, vol. 39, pp. 677–691 (2017)
Google Scholar
Gao, L., Wang, B., Wang, W.: Image captioning with scene-graph based semantic concepts. In: ICMLC, pp. 225–229 (2018)
Google Scholar
Jingpeng, H., Zhuo, L., Zhihong, C., Zhen, L., Xiang, W., Tsung-Hui, C.: Graph enhanced contrastive learning for radiology findings summarization. In: ACL (2022)
Google Scholar
Johnson, J., Karpathy, A., Fei-Fei, L.: DenseCap: fully convolutional localization networks for dense captioning. In: CVPR, pp. 4565–4574 (2016)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp. 3128–3137 (2015)
Google Scholar
Kim, D.J., Choi, J., Oh, T.H., Kweon, I.S.: Dense relational captioning: triple-stream networks for relationship-based captioning. In: CVPR, pp. 6271–6280 (2019)
Google Scholar
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Google Scholar
Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: CVPR, pp. 3242–3250 (2017)
Google Scholar
Lu, J., Yang, J., Batra, D., Parikh, D.: Neural baby talk. In: CVPR, pp. 7219–7228 (2018)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: The 40th Annual Meeting of ACL, pp. 311–318 (2002)
Google Scholar
Qi, C.R., Litany, O., He, K., Guibas, L.J.: Deep hough voting for 3D object detection in point clouds. In: ICCV, pp. 9277–9286 (2019)
Google Scholar
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413 (2017)
Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: CVPR, pp. 4566–4575 (2014)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR, pp. 3156–3164 (2015)
Google Scholar
Xiangyang, L., Jiang, S., Han, J.: Learning object context for dense captioning. In: AAAI, pp. 8650–8657 (2019)
Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044 (2015)
Yang, L., Tang, K., Yang, J., Li, L.J.: Dense captioning with joint inference and visual context. In: CVPR, pp. 2193–2202 (2017)
Google Scholar
Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: IEEE Conf. Comput. Vis. Pattern Recog., pp. 10677–10686 (2019)
Google Scholar
Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: ECCV, pp. 684–699 (2018)
Google Scholar
Yuan, Z., et al.: X-Trans2Cap: cross-modal knowledge transfer using transformer for 3D dense captioning. In: CVPR, pp. 8563–8573 (2022)
Google Scholar

Download references

Author information

Authors and Affiliations

Hosei University, Tokyo, Japan
Shinko Hayashi, Zhiqiang Zhang & Jinja Zhou

Authors

Shinko Hayashi
View author publications
You can also search for this author in PubMed Google Scholar
Zhiqiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jinja Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jinja Zhou .

Editor information

Editors and Affiliations

Indian Institute of Technology Indore, Indore, India
Mohammad Tanveer
Indian Institute of Information Technology - Allahabad, Prayagraj, India
Sonali Agarwal
Kobe University, Kobe, Japan
Seiichi Ozawa
Indian Institute of Technology Patna, Patna, India
Asif Ekbal
University of Innsbruck, Innsbruck, Austria
Adam Jatowt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hayashi, S., Zhang, Z., Zhou, J. (2023). A Recurrent Point Clouds Selection Method for 3D Dense Captioning. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Lecture Notes in Computer Science, vol 13625. Springer, Cham. https://doi.org/10.1007/978-3-031-30111-7_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-30111-7_23
Published: 13 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30110-0
Online ISBN: 978-3-031-30111-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Recurrent Point Clouds Selection Method for 3D Dense Captioning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

3D-SceneCaptioner: Visual Scene Captioning Network for Three-Dimensional Point Clouds

MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes

Dense Image Captioning Based on Precise Feature Extraction

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Recurrent Point Clouds Selection Method for 3D Dense Captioning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

3D-SceneCaptioner: Visual Scene Captioning Network for Three-Dimensional Point Clouds

MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes

Dense Image Captioning Based on Precise Feature Extraction

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation