Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3394171.3413777acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

IR-GAN: Image Manipulation with Linguistic Instruction by Increment Reasoning

Published: 12 October 2020 Publication History

Abstract

Conditional image generation is an active research topic including text2image and image translation. Recently image manipulation with linguistic instruction brings new challenges of multimodal conditional generation. However, traditional conditional image generation models mainly focus on generating high-quality and visually realistic images, and lack resolving the partial consistency between image and instruction. To address this issue, we propose an Increment Reasoning Generative Adversarial Network (IR-GAN), which aims to reason the consistency between visual increment in images and semantic increment in instructions. First, we introduce the word-level and instruction-level instruction encoders to learn user's intention from history-correlated instructions as semantic increment. Second, we embed the representation of semantic increment into that of source image for generating target image, where source image plays the role of referring auxiliary. Finally, we propose a reasoning discriminator to measure the consistency between visual increment and semantic increment, which purifies user's intention and guarantees the good logic of generated target image. Extensive experiments and visualization conducted on two datasets show the effectiveness of IR-GAN.

Supplementary Material

M4V File (3394171.3413777.m4v)
Conditional image generation is an active research topic including text2image and image translation. Recently image manipulation with linguistic instruction brings new challenges of multimodal conditional generation. To address this issue, we propose an Increment Reasoning Generative Adversarial Network (IR-GAN), which aims to reason the consistency between visual increment in images and semantic increment in instructions. First, we introduce the word-level and instruction-level instruction encoders to learn user?s intention from history-correlated instructions as semantic increment. Second, we embed the representation of semantic increment into that of source image for generating target image, where source image plays the role of referring auxiliary. Finally, we propose a reasoning discriminator to measure the consistency between visual increment and semantic increment, which purifies user?s intention and guarantees the good logic of generated target image.

References

[1]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
[2]
Hao Dong, Simiao Yu, Chao Wu, and Yike Guo. 2017. Semantic image synthesis via adversarial learning. In Proceedings of the IEEE International Conference on Computer Vision. 5706--5714.
[3]
Alaaeldin El-Nouby, Shikhar Sharma, Hannes Schulz, Devon Hjelm, Layla El Asri, Samira Ebrahimi Kahou, Yoshua Bengio, and Graham W Taylor. 2019. Tell, draw, and repeat: Generating and modifying images based on continual linguistic instruction. In Proceedings of the IEEE International Conference on Computer Vision. 10304--10312.
[4]
Lianli Gao, Daiyuan Chen, Jingkuan Song, Xing Xu, Dongxiang Zhang, and Heng Tao Shen. 2019. Perceptual pyramid adversarial networks for text-to-image synthesis. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8312--8319.
[5]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems. 6626--6637.
[6]
Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning. 448--456.
[7]
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1125--1134.
[8]
Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 2017. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2901--2910.
[9]
Jin-Hwa Kim, Nikita Kitaev, Xinlei Chen, Marcus Rohrbach, Byoung-Tak Zhang, Yuandong Tian, Dhruv Batra, and Devi Parikh. 2019. CoDraw: Collaborative Drawing as a Testbed for Grounded Goal-driven Communication. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 6495--6513.
[10]
P. Diederik Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. international conference on learning representations (2015).
[11]
Hsin-Ying Lee, Hung-Yu Tseng, Qi Mao, Jia-Bin Huang, Yu-Ding Lu, Maneesh Singh, and Ming-Hsuan Yang. 2020. Drit: Diverse image-to-image translation via disentangled representations. International Journal of Computer Vision (2020), 1--16.
[12]
Liang Li, Shuqiang Jiang, and Qingming Huang. 2012. Learning hierarchical semantic description via mixed-norm regularization for image understanding. IEEE Transactions on Multimedia, Vol. 14, 5 (2012), 1401--1413.
[13]
Che-Tsung Lin, Yen-Yi Wu, Po-Hao Hsu, and Shang-Hong Lai. 2020. Multimodal Structure-Consistent Image-to-Image Translation. AAAI Conference on Artificial Intelligence (2020).
[14]
Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Dechao Meng, and Qingming Huang. 2019. Adaptive reconstruction network for weakly supervised referring expression grounding. In Proceedings of the IEEE International Conference on Computer Vision. 2611--2620.
[15]
Kowalski Marek, Stephan Garbin J., Estellers Virginia, Baltruaitis Tadas, Johnson Matthew, and Shotton Jamie. 2020. CONFIG: Controllable Neural Face Image Generation. IEEE Conference on Computer Vision and Pattern Recognition (2020).
[16]
Mehdi Mirza and Simon Osindero. 2014. Conditional Generative Adversarial Nets. CoRR (2014).
[17]
Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. 2018. Spectral Normalization for Generative Adversarial Networks. International Conference on Learning Representations (2018).
[18]
Takeru Miyato and Masanori Koyama. 2018. cGANs with Projection Discriminator. International Conference on Learning Representations (2018).
[19]
Seonghyeon Nam, Yunji Kim, and Seon Joo Kim. 2018. Text-adaptive generative adversarial networks: manipulating images with natural language. In Advances in neural information processing systems. 42--51.
[20]
Augustus Odena, Christopher Olah, and Jonathon Shlens. 2017. Conditional image synthesis with auxiliary classifier gans. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2642--2651.
[21]
Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. 2019. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2337--2346.
[22]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.
[23]
Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. 2019. Mirrorgan: Learning text-to-image generation by redescription. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1505--1514.
[24]
Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative Adversarial Text to Image Synthesis. In 33rd International Conference on Machine Learning. 1060--1069.
[25]
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training gans. In Advances in neural information processing systems. 2234--2242.
[26]
Shikhar Sharma, Dendi Suhubdy, Vincent Michalski, Samira Ebrahimi Kahou, and Yoshua Bengio. 2018. Chatpainter: Improving text to image generation using dialogue. arXiv preprint arXiv:1802.08216 (2018).
[27]
Hongchen Tan, Xiuping Liu, Xin Li, Yi Zhang, and Baocai Yin. 2019. Semantics-Enhanced Adversarial Nets for Text-to-Image Synthesis. In Proceedings of the IEEE International Conference on Computer Vision. 10501--10510.
[28]
Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1316--1324.
[29]
Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. 2016. Attribute2image: Conditional image generation from visual attributes. In European Conference on Computer Vision. Springer, 776--791.
[30]
Shijie Yang, Liang Li, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2017. A graph regularized deep neural network for unsupervised image representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1203--1211.
[31]
Shijie Yang, Liang Li, Shuhui Wang, Weigang Zhang, Qingming Huang, and Qi Tian. 2019. Skeletonnet: A hybrid network with a skeleton-embedding process for multi-view image representation learning. IEEE Transactions on Multimedia, Vol. 21, 11 (2019), 2916--2929.
[32]
Wang Yi, Chen Ying-Cong, Zhang Xiangyu, Sun Jian, and Jia Jiaya. 2020. Attentive Normalization for Conditional Image Generation. IEEE Conference on Computer Vision and Pattern Recognition (2020).
[33]
Beichen Zhang, Liang Li, Shijie Yang, Shuhui Wang, Zheng-Jun Zha, and Qingming Huang. 2020. State-Relabeling Adversarial Active Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8756--8765.
[34]
Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. 2018. Stackgan: Realistic image synthesis with stacked generative adversarial networks. IEEE transactions on pattern analysis and machine intelligence, Vol. 41, 8 (2018), 1947--1962.
[35]
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on computer vision. 2223--2232.

Cited By

View all
  • (2024)Learning From Box Annotations for Referring Image SegmentationIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.320137235:3(3927-3937)Online publication date: Mar-2024
  • (2024)From Text to Canvas: Unleashing the Artistry of GAN-Driven Image Synthesis2024 IEEE International Conference on Information Technology, Electronics and Intelligent Communication Systems (ICITEICS)10.1109/ICITEICS61368.2024.10625170(1-5)Online publication date: 28-Jun-2024
  • (2023)From External to Internal: Structuring Image for Text-to-Image Attributes ManipulationIEEE Transactions on Multimedia10.1109/TMM.2022.321967725(7248-7261)Online publication date: 1-Jan-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. adversarial networks
  2. image manipulation with linguistic instruction
  3. increment reasoning

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China
  • National Key R&D Program of China

Conference

MM '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)28
  • Downloads (Last 6 weeks)4
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Learning From Box Annotations for Referring Image SegmentationIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.320137235:3(3927-3937)Online publication date: Mar-2024
  • (2024)From Text to Canvas: Unleashing the Artistry of GAN-Driven Image Synthesis2024 IEEE International Conference on Information Technology, Electronics and Intelligent Communication Systems (ICITEICS)10.1109/ICITEICS61368.2024.10625170(1-5)Online publication date: 28-Jun-2024
  • (2023)From External to Internal: Structuring Image for Text-to-Image Attributes ManipulationIEEE Transactions on Multimedia10.1109/TMM.2022.321967725(7248-7261)Online publication date: 1-Jan-2023
  • (2023)Language-Based Image Manipulation Built on Language-Guided RankingIEEE Transactions on Multimedia10.1109/TMM.2022.320700025(6219-6231)Online publication date: 1-Jan-2023
  • (2023)DSG-GAN: Multi-turn text-to-image synthesis via dual semantic-stream guidance with global and local linguisticsIntelligent Systems with Applications10.1016/j.iswa.2023.200271(200271)Online publication date: Aug-2023
  • (2023)VTM-GAN: video-text matcher based generative adversarial network for generating videos from textual descriptionInternational Journal of Information Technology10.1007/s41870-023-01468-416:1(221-236)Online publication date: 16-Sep-2023
  • (2022)GSAIC: GeoScience Articles Illustration and Caption DatasetHighlights in Science, Engineering and Technology10.54097/hset.v9i.18589(289-297)Online publication date: 30-Sep-2022
  • (2022)Towards Open-Ended Text-to-Face Generation, Combination and ManipulationProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3547758(5045-5054)Online publication date: 10-Oct-2022
  • (2022)Text-to-Image Synthesis based on Object-Guided Joint-Decoding Transformer2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52688.2022.01758(18092-18101)Online publication date: Jun-2022
  • (2022)Bidirectional difference locating and semantic consistency reasoning for change captioningInternational Journal of Intelligent Systems10.1002/int.2282137:5(2969-2987)Online publication date: 19-Jan-2022

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media