Abstract
3D dense captioning provides descriptions for corresponding objects in 3D scenes represented as RGB-D scans and point clouds. However, when generating a description, existing methods select points randomly from a point cloud regardless of importance, which degrades the quality of the description by removing important points or including low-value points. To solve the above problem, we propose a recurrent point clouds selection (RPCS) method to mitigate descriptive deficiencies in 3D dense captioning by iteratively checking the caption results of the different point clouds. Our method is divided into two steps. On step 1, this work randomly selects cloud points and uses objectness score to evaluate the generated description. The objectness score indicates whether the proposed object is close to the ground truth; the higher the score, the closer the proposed object is to the ground truth in the positive value. On step 2, if the objectness score is lower than the threshold, step 1 is processed to generate another group of cloud points and evaluate the results. This loop stops when the objectness score is no longer reduced. The loop termination conditions are configurable according to the requirement of accuracy and processing time. As a result, our work can decrease the deficient descriptions and outperforms previous state-of-the-art methods by a large margin (6.58%~35.70% CiDEr, BLUE-4, METEOR, ROUGE improvement).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086 (2018)
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: ACL Workshop, pp. 65–72 (2005)
Chen, D.Z., Chang, A.X., Nießner, M.: ScanRefer: 3D object localization in RGB-D scans using natural language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 202–221. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_13
Chen, D.Z., Wu, Q., Nießner, M., Chang, A.X.: D3Net: a speaker-listener architecture for semi-supervised dense captioning and visual grounding in RGB-D scans. arXiv preprint arXiv:2112.01551 (2021)
Chen, Z., Gholami, A., Nießner, M., Chang, A.X.: Scan2Cap: context-aware dense captioning in RGB-D scans. In: CVPR, pp. 3193–3203 (2021)
Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: CVPR, pp. 10578–10587 (2020)
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: CVPR, pp. 5828–5839 (2017)
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: TPAMI, vol. 39, pp. 677–691 (2017)
Gao, L., Wang, B., Wang, W.: Image captioning with scene-graph based semantic concepts. In: ICMLC, pp. 225–229 (2018)
Jingpeng, H., Zhuo, L., Zhihong, C., Zhen, L., Xiang, W., Tsung-Hui, C.: Graph enhanced contrastive learning for radiology findings summarization. In: ACL (2022)
Johnson, J., Karpathy, A., Fei-Fei, L.: DenseCap: fully convolutional localization networks for dense captioning. In: CVPR, pp. 4565–4574 (2016)
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp. 3128–3137 (2015)
Kim, D.J., Choi, J., Oh, T.H., Kweon, I.S.: Dense relational captioning: triple-stream networks for relationship-based captioning. In: CVPR, pp. 6271–6280 (2019)
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: CVPR, pp. 3242–3250 (2017)
Lu, J., Yang, J., Batra, D., Parikh, D.: Neural baby talk. In: CVPR, pp. 7219–7228 (2018)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: The 40th Annual Meeting of ACL, pp. 311–318 (2002)
Qi, C.R., Litany, O., He, K., Guibas, L.J.: Deep hough voting for 3D object detection in point clouds. In: ICCV, pp. 9277–9286 (2019)
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413 (2017)
Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: CVPR, pp. 4566–4575 (2014)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR, pp. 3156–3164 (2015)
Xiangyang, L., Jiang, S., Han, J.: Learning object context for dense captioning. In: AAAI, pp. 8650–8657 (2019)
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044 (2015)
Yang, L., Tang, K., Yang, J., Li, L.J.: Dense captioning with joint inference and visual context. In: CVPR, pp. 2193–2202 (2017)
Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: IEEE Conf. Comput. Vis. Pattern Recog., pp. 10677–10686 (2019)
Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: ECCV, pp. 684–699 (2018)
Yuan, Z., et al.: X-Trans2Cap: cross-modal knowledge transfer using transformer for 3D dense captioning. In: CVPR, pp. 8563–8573 (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Hayashi, S., Zhang, Z., Zhou, J. (2023). A Recurrent Point Clouds Selection Method for 3D Dense Captioning. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Lecture Notes in Computer Science, vol 13625. Springer, Cham. https://doi.org/10.1007/978-3-031-30111-7_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-30111-7_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30110-0
Online ISBN: 978-3-031-30111-7
eBook Packages: Computer ScienceComputer Science (R0)