Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3474085.3475662acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Similar Scenes Arouse Similar Emotions: Parallel Data Augmentation for Stylized Image Captioning

Published: 17 October 2021 Publication History
  • Get Citation Alerts
  • Abstract

    Stylized image captioning systems aim to generate a caption not only semantically related to a given image but also consistent with a given style description. One of the biggest challenges with this task is the lack of sufficient paired stylized data. Many studies focus on unsupervised approaches, without considering from the perspective of data augmentation. We begin with the observation that people may recall similar emotions when they are in similar scenes, and often express similar emotions with similar style phrases, which underpins our data augmentation idea. In this paper, we propose a novel Extract-Retrieve-Generate data augmentation framework to extract style phrases from small-scale stylized sentences and graft them to large-scale factual captions. First, we design the emotional signal extractor to extract style phrases from small-scale stylized sentences. Second, we construct the plugable multi-modal scene retriever to retrieve scenes represented with pairs of an image and its stylized caption, which are similar to the query image or caption in the large-scale factual data. In the end, based on the style phrases of similar scenes and the factual description of the current scene, we build the emotion-aware caption generator to generate fluent and diversified stylized captions for the current scene. Extensive experimental results show that our framework can alleviate the data scarcity problem effectively. It also significantly boosts the performance of several existing image captioning models in both supervised and unsupervised settings, which outperforms the state-of-the-art stylized image captioning methods in terms of both sentence relevance and stylishness by a substantial margin.

    References

    [1]
    Ethel Mary Abernethy. 1940. The effect of changed environmental conditions upon the results of college examinations. The Journal of Psychology, Vol. 10, 2 (1940), 293--301.
    [2]
    Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. IEEE Computer Society, 6077--6086. https://doi.org/10.1109/CVPR.2018.00636
    [3]
    Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005, Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare R. Voss (Eds.). Association for Computational Linguistics, 65--72. https://www.aclweb.org/anthology/W05-0909/
    [4]
    Cheng-Kuan Chen, Zhufeng Pan, Ming-Yu Liu, and Min Sun. 2019. Unsupervised stylish image description generation via domain layer norm. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8151--8158.
    [5]
    Tianlang Chen, Zhongping Zhang, Quanzeng You, Chen Fang, Zhaowen Wang, Hailin Jin, and Jiebo Luo. 2018. "Factual;" or "Emotional;": Stylized Image Captioning with Adaptive Learning and Attention. In Proceedings of the European Conference on Computer Vision (ECCV). 519--535.
    [6]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171--4186. https://doi.org/10.18653/v1/n19-1423
    [7]
    Stanley G Dulsky. 1935. The effect of a change of background on recall and relearning. Journal of Experimental Psychology, Vol. 18, 6 (1935), 725.
    [8]
    Chuang Gan, Zhe Gan, Xiaodong He, Jianfeng Gao, and Li Deng. 2017. Stylenet: Generating attractive visual captions with styles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3137--3146.
    [9]
    Duncan R Godden and Alan D Baddeley. 1975. Context-dependent memory in two natural environments: On land and underwater. British Journal of psychology, Vol. 66, 3 (1975), 325--331.
    [10]
    Longteng Guo, Jing Liu, Jinhui Tang, Jiangwei Li, Wei Luo, and Hanqing Lu. 2019 a. Aligning Linguistic Words and Visual Semantic Units for Image Captioning. In Proceedings of the 27th ACM International Conference on Multimedia (Nice, France) (MM '19). Association for Computing Machinery, New York, NY, USA, 765--773. https://doi.org/10.1145/3343031.3350943
    [11]
    Longteng Guo, Jing Liu, Peng Yao, Jiangwei Li, and Hanqing Lu. 2019 b. MSCap: Multi-style image captioning with unpaired stylized text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4204--4213.
    [12]
    Longteng Guo, Jing Liu, Xinxin Zhu, Peng Yao, Shichen Lu, and Hanqing Lu. 2020. Normalized and Geometry-Aware Self-Attention Network for Image Captioning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. IEEE, 10324--10333. https://doi.org/10.1109/CVPR42600.2020.01034
    [13]
    Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, Vol. 47 (2013), 853--899.
    [14]
    MD Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. 2019. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CSUR), Vol. 51, 6 (2019), 1--36.
    [15]
    Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017).
    [16]
    Sulabh Katiyar and Samir Kumar Borgohain. 2021. Image Captioning using Deep Stacked LSTMs, Contextual Word Embeddings and Data Augmentation. CoRR, Vol. abs/2102.11237 (2021). arxiv: 2102.11237 https://arxiv.org/abs/2102.11237
    [17]
    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis., Vol. 123, 1 (2017), 32--73.
    [18]
    Kun Li, Chengbo Chen, Xiaojun Quan, Qing Ling, and Yan Song. 2020 a. Conditional Augmentation for Aspect Term Extraction via Masked Sequence-to-Sequence Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020. Association for Computational Linguistics, 7056--7066.
    [19]
    Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. 2020 b. BERT-ATTACK: Adversarial Attack Against BERT Using BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020. Association for Computational Linguistics, 6193--6202. https://doi.org/10.18653/v1/2020.emnlp-main.500
    [20]
    Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020 c. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXX. 121--137.
    [21]
    Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V. 740--755.
    [22]
    Canjie Luo, Yuanzhi Zhu, Lianwen Jin, and Yongpan Wang. 2020. Learn to Augment: Joint Data Augmentation and Network Optimization for Text Recognition. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. IEEE, 13743--13752.
    [23]
    Alexander Mathews, Lexing Xie, and Xuming He. 2018. Semstyle: Learning to generate stylised image captions using unaligned text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8591--8600.
    [24]
    Alexander Patrick Mathews, Lexing Xie, and Xuming He. 2016. Senticap: Generating image descriptions with sentiments. In Thirtieth AAAI conference on artificial intelligence.
    [25]
    Omid Mohamad Nezami, Mark Dras, Stephen Wan, Cécile Paris, and Len Hamey. 2019. Towards Generating Stylized Image Captions via Adversarial Training. In Pacific Rim International Conference on Artificial Intelligence. Springer, 270--284.
    [26]
    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA. 311--318. http://www.aclweb.org/anthology/P02-1040.pdf
    [27]
    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog, Vol. 1, 8 (2019), 9.
    [28]
    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. CoRR, Vol. abs/1910.10683 (2019). arxiv: 1910.10683 http://arxiv.org/abs/1910.10683
    [29]
    Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada. 91--99.
    [30]
    Shurong Sheng and Marie-Francine Moens. 2019. Generating Captions for Images of Ancient Artworks. In Proceedings of the 27th ACM International Conference on Multimedia. 2478--2486.
    [31]
    Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine Bordes, and Jason Weston. 2019. Engaging image captioning via personality. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12516--12526.
    [32]
    Yuqing Song, Shizhe Chen, Yida Zhao, and Qin Jin. 2019. Unpaired Cross-Lingual Image Caption Generation with Self-Supervised Rewards. In Proceedings of the 27th ACM International Conference on Multimedia (Nice, France) (MM '19). Association for Computing Machinery, New York, NY, USA, 784--792. https://doi.org/10.1145/3343031.3350996
    [33]
    Akhilesh Sudhakar, Bhargav Upadhyay, and Arjun Maheswaran. 2019. "Transforming" Delete, Retrieve, Generate Approach for Controlled Text Style Transfer. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3--7, 2019, Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, 3267--3277. https://doi.org/10.18653/v1/D19-1322
    [34]
    Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence.
    [35]
    Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4566--4575.
    [36]
    Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015a. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3156--3164.
    [37]
    Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015b. Show and tell: A neural image caption generator. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. IEEE Computer Society, 3156--3164. https://doi.org/10.1109/CVPR.2015.7298935
    [38]
    Cheng Wang, Haojin Yang, and Christoph Meinel. 2018. Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning. ACM Trans. Multim. Comput. Commun. Appl., Vol. 14, 2s (2018), 40:1--40:20.
    [39]
    Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. 2015. Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37. 2048--2057.
    [40]
    Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. 4894--4902.
    [41]
    Xingxu Yao, Dongyu She, Sicheng Zhao, Jie Liang, Yu-Kun Lai, and Jufeng Yang. 2019. Attention-Aware Polarity Sensitive Embedding for Affective Image Retrieval. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019), 1140--1150.
    [42]
    Guojun Yin, Lu Sheng, Bin Liu, Nenghai Yu, Xiaogang Wang, and Jing Shao. 2019. Context and Attribute Grounded Dense Captioning. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 6234--6243.
    [43]
    Quanzeng You, Hailin Jin, and Jiebo Luo. 2018. Image captioning at will: A versatile scheme for effectively injecting sentiments into image descriptions. arXiv preprint arXiv:1801.10121 (2018).
    [44]
    Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4651--4659.
    [45]
    Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. VinVL: Making Visual Representations Matter in Vision-Language Models. CoRR, Vol. abs/2101.00529 (2021). arxiv: 2101.00529 https://arxiv.org/abs/2101.00529
    [46]
    Yumeng Zhang, Gaoguo Jia, Li Chen, Mingrui Zhang, and Junhai Yong. 2020. Self-Paced Video Data Augmentation by Generative Adversarial Networks with Insufficient Samples. In MM '20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020. ACM, 1652--1660.
    [47]
    Wentian Zhao, Xinxiao Wu, and Xiaoxun Zhang. 2020. MemCap: Memorizing Style Knowledge for Image Captioning. In Thirty-Fourth AAAI conference on artificial intelligence.

    Cited By

    View all
    • (2024)Emotional Video Captioning With Vision-Based Emotion Interpretation NetworkIEEE Transactions on Image Processing10.1109/TIP.2024.335904533(1122-1135)Online publication date: 1-Feb-2024
    • (2024)TridentCap: Image-Fact-Style Trident Semantic Framework for Stylized Image CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.331513334:5(3563-3575)Online publication date: May-2024
    • (2023)Deep-learning-based image captioning:analysis and prospectsJournal of Image and Graphics10.11834/jig.22066028:9(2788-2816)Online publication date: 2023
    • Show More Cited By

    Index Terms

    1. Similar Scenes Arouse Similar Emotions: Parallel Data Augmentation for Stylized Image Captioning

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MM '21: Proceedings of the 29th ACM International Conference on Multimedia
      October 2021
      5796 pages
      ISBN:9781450386517
      DOI:10.1145/3474085
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 17 October 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. data augmentation
      2. stylized image captioning

      Qualifiers

      • Research-article

      Funding Sources

      • Alibaba-Zhejiang University Joint Institute of Frontier Technologies
      • MoE Engineering Research Center of Digital Library
      • the Fundamental Research Funds for the Central Universities.
      • NSFC
      • Chinese Knowledge Center for Engineering Sciences and Technology

      Conference

      MM '21
      Sponsor:
      MM '21: ACM Multimedia Conference
      October 20 - 24, 2021
      Virtual Event, China

      Acceptance Rates

      Overall Acceptance Rate 995 of 4,171 submissions, 24%

      Upcoming Conference

      MM '24
      The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)59
      • Downloads (Last 6 weeks)3

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Emotional Video Captioning With Vision-Based Emotion Interpretation NetworkIEEE Transactions on Image Processing10.1109/TIP.2024.335904533(1122-1135)Online publication date: 1-Feb-2024
      • (2024)TridentCap: Image-Fact-Style Trident Semantic Framework for Stylized Image CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.331513334:5(3563-3575)Online publication date: May-2024
      • (2023)Deep-learning-based image captioning:analysis and prospectsJournal of Image and Graphics10.11834/jig.22066028:9(2788-2816)Online publication date: 2023
      • (2023)Emotion Interpretational Caption of Art Visual based on Reinforcement Learning2023 2nd International Conference on Image Processing and Media Computing (ICIPMC)10.1109/ICIPMC58929.2023.00021(82-88)Online publication date: 26-May-2023
      • (2023)Evolution of visual data captioning Methods, Datasets, and evaluation MetricsExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119773221:COnline publication date: 1-Jul-2023
      • (2023)Cross-domain multi-style merge for image captioningComputer Vision and Image Understanding10.1016/j.cviu.2022.103617228:COnline publication date: 1-Feb-2023
      • (2023)Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source DatasetsInternational Journal of Computer Vision10.1007/s11263-023-01949-w132:5(1701-1720)Online publication date: 5-Dec-2023
      • (2023)Sentimental Visual Captioning using Multimodal TransformerInternational Journal of Computer Vision10.1007/s11263-023-01752-7131:4(1073-1090)Online publication date: 6-Feb-2023
      • (2022)A Review of Stylized Image Captioning Techniques, Evaluation Parameters, and Datasets2022 4th International Conference on Artificial Intelligence and Speech Technology (AIST)10.1109/AIST55798.2022.10064842(1-5)Online publication date: 9-Dec-2022
      • (undefined)Cross-Domain Multi-Style Merge for Image CaptioningSSRN Electronic Journal10.2139/ssrn.4162675

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media