Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3394171.3413551acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Sequential Attention GAN for Interactive Image Editing

Published: 12 October 2020 Publication History

Abstract

Most existing text-to-image synthesis tasks are static single-turn generation, based on pre-defined textual descriptions of images. To explore more practical and interactive real-life applications, we introduce a new task - Interactive Image Editing, where users can guide an agent to edit images via multi-turn textual commands on-the-fly. In each session, the agent takes a natural language description from the user as the input, and modifies the image generated in previous turn to a new design, following the user description. The main challenges in this sequential and interactive image generation task are two-fold: 1) contextual consistency between a generated image and the provided textual description; 2) step-by-step region-level modification to maintain visual consistency across the generated image sequence in each session. To address these challenges, we propose a novel Sequential Attention Generative Adversarial Network (SeqAttnGAN), which applies a neural state tracker to encode the previous image and the textual description in each turn of the sequence, and uses a GAN framework to generate a modified version of the image that is consistent with the preceding images and coherent with the description. To achieve better region-specific refinement, we also introduce a sequential attention mechanism into the model. To benchmark on the new task, we introduce two new datasets, Zap-Seq and DeepFashion-Seq, which contain multi-turn sessions with image-description sequences in the fashion domain. Experiments on both datasets show that the proposed SeqAttnGAN model outperforms state-of-the-art approaches on the interactive image editing task across all evaluation metrics including visual quality, image sequence coherence and text-image consistency.

Supplementary Material

MP4 File (3394171.3413551.mp4)
A video presentation of the paper: Sequential Attention GAN for Interactive Image Editing, published in MM 2020.

References

[1]
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. Vqa: Visual question answering. In ICCV, 2015.
[2]
O. Ashual and L. Wolf. Specifying object attributes and relations in interactive scene generation. In ICCV, 2019.
[3]
R. Y. Benmalek, C. Cardie, S. J. Belongie, X. He, and J. Gao. The neural painter: Multi-turn image generation. arXiv preprint arXiv:1806.06183, 2018.
[4]
A. Bordes, Y.-L. Boureau, and J. Weston. Learning end-to-end goal-oriented dialog. In ICLR, 2017.
[5]
M. Buhrmester, T. Kwang, and S. Gosling. Amazon's mechanical turk: A new source of inexpensive, yet high-quality, data? Perspectives on Psychological Science, 2011.
[6]
J. Chen, Y. Shen, J. Gao, J. Liu, and X. Liu. Language-based image editing with recurrent attentive models. In CVPR, 2018.
[7]
J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NeurIPS Workshop, 2014.
[8]
A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra. Visual dialog. In CVPR, 2017.
[9]
H. de Vries, F. Strub, S. Chandar, O. Pietquin, H. Larochelle, and A. C. Courville. Guesswhat?! visual object discovery through multi-modal dialogue. In CVPR, 2017.
[10]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
[11]
M. Dixit, R. Kwitt, M. Niethammer, and N. Vasconcelos. Aga: Attribute guided augmentation. In CVPR, 2017.
[12]
J. Gao, M. Galley, and L. Li. Neural approaches to conversational ai. arXiv preprint arXiv:1809.08267, 2018.
[13]
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NeurIPS, 2014.
[14]
X. Guo, H. Wu, Y. Cheng, S. Rennie, and R. S. Feris. Dialog-based interactive image retrieval. In NeurIPS, 2018.
[15]
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
[16]
X. He, L. Deng, and W. Chou. Discriminative learning in sequential pattern recognition. IEEE Signal Processing Magazine, 2008.
[17]
R. Hu, M. Rohrbach, and T. Darrell. Segmentation from natural language expressions. In ECCV, 2016.
[18]
P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017.
[19]
J. Johnson, A. Gupta, and L. Fei-Fei. Image generation from scene graphs. In CVPR, 2018.
[20]
J. Kim, D. Parikh, D. Batra, B. Zhang, and Y. Tian. Codraw: Visual dialog for collaborative drawing. CoRR, abs/1712.05558, 2017.
[21]
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[22]
C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, and B. Poczos. Mmd gan: Towards deeper understanding of moment matching network. In NeurIPS, pages 2203--2213, 2017.
[23]
Y. Li, Y. Cheng, Z. Gan, L. Yu, L. Wang, and J. Liu. Bachgan: High-resolution image synthesis from salient object layout. arXiv preprint arXiv:2003.11690, 2020.
[24]
Y. Li, Z. Gan, Y. Shen, J. Liu, Y. Cheng, Y. Wu, L. Carin, D. Carlson, and J. Gao. Storygan: A sequential conditional gan for story visualization. In CVPR, 2019.
[25]
M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In NeurIPS, 2017.
[26]
Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In CVPR, 2016.
[27]
R. Manuvinakurike, T. Bui, W. Chang, and K. Georgila. Conversational Image Editing: Incremental Intent Identification in a New Dialogue Task. In SIGDIAL, Melbourne, Australia, 2018.
[28]
R. Manuvinakurike, D. DeVault, and K. Georgila. Using Reinforcement Learning to Model Incrementality in a Fast-Paced Dialogue Game. In SIGDIAL, Saarbruecken Germany, 2017.
[29]
M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
[30]
N. Mostafazadeh, C. Brockett, B. Dolan, M. Galley, J. Gao, G. Spithourakis, and L. Vanderwende. Image-grounded conversations: Multimodal context for natural question and response generation. In IJCNLP, 2017.
[31]
S. Nam, Y. Kim, and S. J. Kim. Text-adaptive generative adversarial networks: Manipulating images with natural language. In NeurIPS, 2018.
[32]
A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier GANs. In ICML, 2017.
[33]
D. Pathak, P. Kr228;henbühl, J. Donahue, T. Darrell, and A. Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016.
[34]
S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. In ICML, 2016.
[35]
A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding of textual phrases in images by reconstruction. In ECCV, 2016.
[36]
P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays. Scribbler: Controlling deep image synthesis with sketch and color. In CVPR, 2017.
[37]
I. V. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau. Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI, 2016.
[38]
S. Sharma, D. Suhubdy, V. Michalski, S. E. Kahou, and Y. Bengio. Chatpainter: Improving text to image generation using dialogue. CoRR, abs/1802.08216, 2018.
[39]
L. Wang, Y. Li, and S. Lazebnik. Learning deep structure-preserving image-text embeddings. In CVPR, 2016.
[40]
X. Wang and A. Gupta. Generative image modeling using style and structure adversarial networks. In ECCV, 2016.
[41]
W. Xian, P. Sangkloy, V. Agrawal, A. Raj, J. Lu, C. Fang, F. Yu, and J. Hays. Texturegan: Controlling deep image synthesis with texture patches. arXiv preprint arXiv:1706.02823, 2017.
[42]
T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, 2018.
[43]
A. Yu and K. Grauman. Fine-grained visual comparisons with local learning. In CVPR, 2014.
[44]
H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
[45]
B. Zhao, J. Feng, X. Wu, and S. Yan. Memory-augmented attribute manipulation networks for interactive fashion search. In CVPR, 2017.
[46]
B. Zhao, L. Meng, W. Yin, and L. Sigal. Image generation from layout. In CVPR, 2019.
[47]
J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.
[48]
J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal image-to-image translation. In NeurIPS, 2017.
[49]
S. Zhu, S. Fidler, R. Urtasun, D. Lin, and C. C. Loy. Be your own prada: Fashion synthesis with structural coherence. In ICCV, 2017.

Cited By

View all
  • (2025)Synthetic data for enhanced privacy: A VAE-GAN approach against membership inference attacksKnowledge-Based Systems10.1016/j.knosys.2024.112899309(112899)Online publication date: Jan-2025
  • (2025)Neurovision: Advanced Deep Learning for Eye Disorder DetectionCognitive Computing and Cyber Physical Systems10.1007/978-3-031-77081-4_39(503-513)Online publication date: 9-Feb-2025
  • (2024)Attention-Enhanced Unpaired xAI-GANs for Transformation of Histological Stain ImagesJournal of Imaging10.3390/jimaging1002003210:2(32)Online publication date: 25-Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. generative adversarial network
  2. image editing with natural language
  3. sequential attention

Qualifiers

  • Research-article

Conference

MM '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)412
  • Downloads (Last 6 weeks)34
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Synthetic data for enhanced privacy: A VAE-GAN approach against membership inference attacksKnowledge-Based Systems10.1016/j.knosys.2024.112899309(112899)Online publication date: Jan-2025
  • (2025)Neurovision: Advanced Deep Learning for Eye Disorder DetectionCognitive Computing and Cyber Physical Systems10.1007/978-3-031-77081-4_39(503-513)Online publication date: 9-Feb-2025
  • (2024)Attention-Enhanced Unpaired xAI-GANs for Transformation of Histological Stain ImagesJournal of Imaging10.3390/jimaging1002003210:2(32)Online publication date: 25-Jan-2024
  • (2024)DCMFNet: Deep Cross-Modal Fusion Network for Different Modalities with Iterative Gated FusionProceedings of the 50th Graphics Interface Conference10.1145/3670947.3670956(1-12)Online publication date: 3-Jun-2024
  • (2024)PlantoGraphy: Incorporating Iterative Design Process into Generative Artificial Intelligence for Landscape RenderingProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642824(1-19)Online publication date: 11-May-2024
  • (2024)Enhancing Control in Stable Diffusion Through Example-based Fine-Tuning and Prompt Engineering2024 5th International Conference on Image Processing and Capsule Networks (ICIPCN)10.1109/ICIPCN63822.2024.00153(887-894)Online publication date: 3-Jul-2024
  • (2024)Separable Facial Image Editing via Warping With Displacement VectorsIEEE Access10.1109/ACCESS.2024.341169912(127224-127234)Online publication date: 2024
  • (2024)Leveraging LLMs for On-the-Fly Instruction Guided Image EditingProgress in Artificial Intelligence10.1007/978-3-031-73497-7_3(28-40)Online publication date: 16-Nov-2024
  • (2024)SM-GAN: Single-Stage and Multi-object Text Guided Image EditingMultiMedia Modeling10.1007/978-3-031-53308-2_16(213-226)Online publication date: 28-Jan-2024
  • (2024)Editable Stain Transformation of Histological Images Using Unpaired GANsImage Analysis and Processing - ICIAP 2023 Workshops10.1007/978-3-031-51026-7_3(27-38)Online publication date: 21-Jan-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media