Abstract
Medical image report writing is a time-consuming and knowledge intensive task. However, the existing machine/deep learning models often incur similar reports and inaccurate descriptions. To address these critical issues, we propose a multi-view and multi-modal (MvMM) approach which utilizes various-perspective visual features and medical semantic features to generate diverse and accurate medical reports. First, we design a multi-view encoder with attention to extract visual features from the frontal and lateral viewing angles. Second, we extract medical concepts from the radiology reports which are adopted as semantic features and combined with visual features through a two-layer decoder with attention. Third, we fine-tune the model parameters using self-critical training with a coverage reward to generate more accurate medical concepts. Experimental results show that our method achieves noticeable performance improvements over the baseline approaches and increases CIDEr scores by 0.157.
Supported by Hangzhou Innovation Institution, Beihang University.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Brady, A., Laoide, R.Ó., McCarthy, P., McDermott, R.: Discrepancy and error in radiology: concepts, causes and consequences. Ulster Med. J. 81(1), 3 (2012)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR, June 2015
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Donahue, J., Anne Hendricks, L., Guadarrama, S., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp. 2625–2634 (2015)
Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.L.: Deep captioning with multimodal recurrent neural networks (m-RNN). In: Bengio, Y., LeCun, Y. (eds.) ICLR 2015 (2015). http://arxiv.org/abs/1412.6632
Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: CVPR, pp. 375–383 (2017)
Li, Y., Liang, X., Hu, Z., Xing, E.P.: Hybrid retrieval-generation reinforced agent for medical image report generation. In: Advances in Neural Information Processing Systems, pp. 1530–1540 (2018)
Li, C.Y., Liang, X., Hu, Z., Xing, E.P.: Knowledge-driven encode, retrieve, paraphrase for medical image report generation. In: AAAI 2019, pp. 6666–6673. AAAI Press (2019). https://doi.org/10.1609/aaai.v33i01.33016666
Jing, B., Xie, P., Xing, E.P.: On the automatic generation of medical imaging reports. In: ACL 2018, pp. 2577–2586. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/P18-1240, https://www.aclweb.org/anthology/P18-1240/
Yuan, J., Liao, H., Luo, R., Luo, J.: Automatic radiology report generation based on multi-view image fusion and medical concept enrichment. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11769, pp. 721–729. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32226-7_80
Demner-Fushman, D., Kohli, M.D., Rosenman, M.B., et al.: Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 23(2), 304–310 (2015)
Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. In: ICLR 2016 (2016). http://arxiv.org/abs/1511.06732
Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W.: Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016). http://arxiv.org/abs/1609.08144
Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: Advances in Neural Information Processing Systems, pp. 1171–1179 (2015)
Cho, K., Van Merriënboer, B., Gulcehre, C., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP (2014)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR 2015 (2015). http://arxiv.org/abs/1409.0473
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS, pp. 3104–3112 (2014)
Xu, K., Ba, J., Kiros, R., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)
You, Q., Jin, H., Wang, Z., et al.: Image captioning with semantic attention. In: CVPR, June 2016
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Bahdanau, D., et al.: An actor-critic algorithm for sequence prediction. In: ICLR 2017. OpenReview.net (2017). https://openreview.net/forum?id=SJDaqqveg
Tan, B., Hu, Z., Yang, Z., Salakhutdinov, R., Xing, E.P.: Connecting the dots between MLE and RL for sequence generation. In: ICLR 2019 (2019). OpenReview.net. https://openreview.net/forum?id=Syl1pGI9wN
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024 (2017)
Liu, F., Ren, X., Liu, Y., Wang, H., Sun, X.: simNet: stepwise image-topic merging network for generating detailed and comprehensive image captions. In: EMNLP 2018, pp. 137–149. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/d18-1013
Fang, H., et al.: From captions to visual concepts and back. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1473–1482 (2015)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV 2017, pp. 2980–2988 (2017)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015). http://arxiv.org/abs/1412.6980
Xue, Y.: Multimodal recurrent model with attention for automated radiology report generation. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11070, pp. 457–466. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00928-1_52
Irvin, J., et al.: CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In: AAAI, vol. 33, pp. 590–597 (2019)
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: CVPR, pp. 4566–4575 (2015)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Acknowledgment
This work has been supported by National Natural Science Foundation of China (61772060, 61976012, 61602024), Qianjiang Postdoctoral Foundation (2020-Y4- A-001), and CERNET Innovation Project (NGII20170315).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Yang, S., Niu, J., Wu, J., Liu, X. (2020). Automatic Medical Image Report Generation with Multi-view and Multi-modal Attention Mechanism. In: Qiu, M. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2020. Lecture Notes in Computer Science(), vol 12454. Springer, Cham. https://doi.org/10.1007/978-3-030-60248-2_48
Download citation
DOI: https://doi.org/10.1007/978-3-030-60248-2_48
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60247-5
Online ISBN: 978-3-030-60248-2
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)