Image Captioning Using Motion-CNN with Object Detection
Abstract
:1. Introduction
- We introduce motion-CNN with object detection. It extracts motion features automatically from object regions in the image.
- We conduct an analysis of our model, particularly of the effects of using only motion features relevant to the object regions, by comparing the error of using all motion features against using only motion features around the object regions.
- We achieve higher accuracy than by earlier methods with MSR-VTT2016-Image and MSCOCO datasets.
2. Related Work
3. Method
3.1. Concept
3.2. Overall Model Architecture
3.3. Object Detection
3.4. Motion CNN with Object Detection Architecture
4. Experiments
4.1. Implementation Details
- Experiment 1 specifically emphasized examination of the accuracy of motion features between object regions and other regions. Specifically, the MSR-VTT2016-Image is used to measure the error between the estimated motion features and the ground truth motion features. Motion features consist of three channels, which are an angle in the x direction, an angle in the y direction, and its magnitude. For calculating error, mean square error (MSE) is used.
- Experiment 2 performed image captioning with copyright-free images that are freely available on the internet. We checked the operation of our model and compared our model with other models.
- Experiment 3 aimed at analyzing the performance of our model using MSR-VTT2016-Image. We compared our model with other models.
- Experiment 4 performed image captioning with MSCOCO. We compared our model with other models.
- Previous method [1]: This model included image feature extraction, caption generation, and object detection components. The model is re-implemented as described in the original paper. Except for activation functions, we used the ReLU function. For experiment 3, ResNet-101 [12] was used for image feature extraction. For experiment 4, faster R-CNN [10] with ResNet-101 [12] was used for image feature extraction.
- Previous method [7]: This model included image feature extraction and caption generation components. The model has an adaptive attention mechanism, which decides when to rely on visual features or LSTM’s memory according to the words in the caption. The result is quoted from the original paper.
- Proposed method (with motion estimation): Our proposed model includes motion-CNN image feature extraction, caption generation, motion estimation, and object detection components. For motion features, estimated optical flow was used. These were obtained using neural networks from a single image. (With optical flow): Our proposed model has optical flow for motion features. Optical flow was calculated using two consecutive images that provide high-quality motion features. For experiment 3, ResNet-101 [12] was used for image feature extraction. For experiment 4, faster R-CNN [10] with ResNet-101 [12] was used for image feature extraction.
4.2. Detailed Workflow
4.3. Datasets
4.4. Evaluation Metrics
4.5. Results of Experiment 1
4.6. Results of Experiment 2
4.7. Results of Experiment 3
4.8. Results of Experiment 4
4.9. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6077–6086. [Google Scholar]
- Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and Tell: A Neural Image Caption Generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]
- Karpathy, A.; Li, F.-F. Deep Visual-Semantic Alignments for Generating Image Descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3128–3137. [Google Scholar]
- Iwamura, K.; Louhi Kasahara, J.Y.; Moro, A.; Yamashita, A.; Asama, H. Potential of Incorporating Motion Estimation for Image Captioning. In Proceedings of the IEEE/SICE International Symposium on System Integration, Fukushima, Japan, 11–14 January 2021. [Google Scholar]
- Wang, C.; Yang, H.; Bartz, C.; Meinel, C. Image Captioning with Deep Bidirectional LSTMs. In Proceedings of the ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 988–997. [Google Scholar]
- Xu, K.; Ba, J.L.; Kiros, R.; Cho, K.; Courville, A.; Salakhutdinov, R.; Zemel, R.S.; Bengio, Y. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
- Lu, J.; Xiong, C.; Parikh, D.; Socher, R. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 375–383. [Google Scholar]
- Johnson, J.; Karpathy, A.; Li, F.-F. DenseCap: Fully Convolutional Localization Networks for Dense Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4565–4574. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, MT, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
- Gao, R.; Xiong, B.; Grauman, K. Im2Flow: Motion Hallucination from Static Images for Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5937–5947. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.B. Imagenet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
- Pennington, J.; Socher, R.; Manning, C.D. GloVe: Gloval Vectors for Word Representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Xu, J.; Mei, T.; Yao, T.; Rui, Y. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 5288–5296. [Google Scholar]
- Hui, T.W.; Tang, X.; Change, C. A Lightweight Optical flow CNN-revisiting Data Fidelity and Regularization. arXiv 2019, arXiv:1903.07414. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the Annual Meeting on Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
- Denkowski, M.; Lavie, A. Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation, Baltimore, MD, USA, 26–27 June 2014; pp. 376–380. [Google Scholar]
- Lin, C.Y.; Cao, G.; Gao, J.; Nie, J.Y. An Information-theoretic Approach to Automatic Evaluation of Summaries. In Proceedings of theMain Conference on Human Language Technology Conference of the North American Chapter Association for Computational Linguistics, New York, NY, USA, 4–9 June 2006; pp. 463–470. [Google Scholar]
- Vedantam, R.; Zitnick, C.L.; Parikh, D. Consensus-based Image Description Evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
- Soomro, K.; Zamir, R.A.; Shah, M. UCF101: A Dataset of 101 Human Action Classes From Videos in The Wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
Datasets | MSE (Overall Image Regions) | MSE (Object Image Regions) |
---|---|---|
MSR-VTT2016-Image | 3743.7 | 2307.8 |
Method | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROUGE-L | CIDEr |
---|---|---|---|---|---|---|---|
Previous method [6] | 49.4 | 31.5 | 21.0 | 14.2 | 15.8 | 39.2 | 31.4 |
Previous method [1] | 48.8 | 31.5 | 21.2 | 14.4 | 15.8 | 39.3 | 32.1 |
Previous method [4] | 49.5 | 31.8 | 21.4 | 14.5 | 15.8 | 39.3 | 32.5 |
With optical flow | 49.3 | 31.8 | 21.4 | 14.6 | 16.0 | 39.3 | 32.7 |
With motion estimation | 49.9 | 32.2 | 21.5 | 14.5 | 16.1 | 39.5 | 32.7 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Iwamura, K.; Louhi Kasahara, J.Y.; Moro, A.; Yamashita, A.; Asama, H. Image Captioning Using Motion-CNN with Object Detection. Sensors 2021, 21, 1270. https://doi.org/10.3390/s21041270
Iwamura K, Louhi Kasahara JY, Moro A, Yamashita A, Asama H. Image Captioning Using Motion-CNN with Object Detection. Sensors. 2021; 21(4):1270. https://doi.org/10.3390/s21041270
Chicago/Turabian StyleIwamura, Kiyohiko, Jun Younes Louhi Kasahara, Alessandro Moro, Atsushi Yamashita, and Hajime Asama. 2021. "Image Captioning Using Motion-CNN with Object Detection" Sensors 21, no. 4: 1270. https://doi.org/10.3390/s21041270
APA StyleIwamura, K., Louhi Kasahara, J. Y., Moro, A., Yamashita, A., & Asama, H. (2021). Image Captioning Using Motion-CNN with Object Detection. Sensors, 21(4), 1270. https://doi.org/10.3390/s21041270