Article

Imagine This! Scripts to Compositions to Videos

Authors:

Dustin Schwenk,

Aniruddha KembhaviAuthors Info & Claims

Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VIII

Pages 610 - 626

https://doi.org/10.1007/978-3-030-01237-3_37

Published: 08 September 2018 Publication History

Abstract

Imagining a scene described in natural language with realistic layout and appearance of entities is the ultimate test of spatial, visual, and semantic world knowledge. Towards this goal, we present the Composition, Retrieval and Fusion Network (Craft), a model capable of learning this knowledge from video-caption data and applying it while generating videos from novel captions. Craft explicitly predicts a temporal-layout of mentioned entities (characters and objects), retrieves spatio-temporal entity segments from a video database and fuses them to generate scene videos. Our contributions include sequential training of components of Craft while jointly modeling layout and appearances, and losses that encourage learning compositional representations for retrieval. We evaluate Craft on semantic fidelity to caption, composition consistency, and visual quality. Craft outperforms direct pixel generation approaches and generalizes well to unseen captions and to unseen video databases with no text annotations. We demonstrate Craft on Flintstones (Flintstones is available at https://prior.allenai.org/projects/craft), a new richly annotated video-caption dataset with over 25000 videos. For a glimpse of videos generated by Craft, see https://youtu.be/688Vv86n0z8.

References

[1]

Achanta R, Shaji A, Smith K, Lucchi A, Fua P, and Süsstrunk S Slic superpixels compared to state-of-the-art superpixel methods IEEE Trans. Pattern Anal. Mach. Intell. 2012 34 11 2274-2282

[2]

Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein gan. CoRR abs/1701.07875 (2017)

[3]

Barnes C, Shechtman E, Finkelstein A, and Goldman DB Patchmatch: a randomized correspondence algorithm for structural image editing ACM Trans. Graph.-TOG 2009 28 3 24

[4]

Bengio, Y., Mesnil, G., Dauphin, Y., Rifai, S.: Better mixing via deep representations. In: ICML (2013)

[5]

Caruana R Thrun S and Pratt L Multitask learning Learning to Learn 1998 Boston Springer 95-133

[6]

Chen, Q., Koltun, V.: Photographic image synthesis with cascaded refinement networks. CoRR abs/1707.09405 (2017)

[7]

Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: Vse++: Improved visual-semantic embeddings. arXiv preprint arXiv:1707.05612 (2017)

[8]

Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al.: Devise: a deep visual-semantic embedding model. In: Advances in Neural Information Processing Systems, pp. 2121–2129 (2013)

[9]

Goodfellow, I.J., et al.: Generative adversarial nets. In: NIPS (2014)

[10]

Gregor, K., Danihelka, I., Graves, A., Rezende, D.J., Wierstra, D.: Draw: a recurrent neural network for image generation. In: ICML (2015)

[11]

Gupta, T., et al.: Aligned image-word representations improve inductive transfer across vision-language tasks. arXiv preprint arXiv:1704.00260 (2017)

[12]

Hong, S., Yang, D., Choi, J., Lee, H.: Inferring semantic layout for hierarchical text-to-image synthesis. CoRR abs/1801.05091 (2018)

[13]

Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. CoRR abs/1611.07004 (2016)

[14]

Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for improved quality, stability, and variation. CoRR abs/1710.10196 (2017)

[15]

Kingma, D.P., Welling, M.: Auto-encoding variational bayes. CoRR abs/1312.6114 (2013)

[16]

Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014)

[17]

Kwak, H., Zhang, B.T.: Generating images part by part with composite generative adversarial networks. CoRR abs/1607.05387 (2016)

[18]

Li, Y., Min, M.R., Shen, D., Carlson, D., Carin, L.: Video generation from text. arXiv preprint arXiv:1710.00421 (2017)

[19]

Liu, Y., Qin, Z., Luo, Z., Wang, H.: Auto-painter: Cartoon image generation from sketch by using conditional generative adversarial networks. CoRR abs/1705.01908 (2017)

[20]

Mansimov, E., Parisotto, E., Ba, J., Salakhutdinov, R.: Generating images from captions with attention. In: ICLR (2016)

[21]

Marwah, T., Mittal, G., Balasubramanian, V.N.: Attentive semantic video generation using captions. CoRR abs/1708.05980 (2017)

[22]

Odena, A., Olah, C., Shlens, J.: Conditional image synthesis with auxiliary classifier gans. In: ICML (2017)

[23]

Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2641–2649. IEEE (2015)

[24]

Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR abs/1511.06434 (2015)

[25]

Reed, S., van den Oord, A., Kalchbrenner, N., Bapst, V., Botvinick, M., de Freitas, N.: Generating interpretable images with controllable structure. In: OpenReview.net (2017)

[26]

Reed, S.E., Akata, Z., Mohan, S., Tenka, S., Schiele, B., Lee, H.: Learning what and where to draw. In: NIPS (2016)

[27]

Reed, S.E., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: ICML (2016)

[28]

Rother, C., Kolmogorov, V., Blake, A.: Grabcut: Interactive foreground extraction using iterated graph cuts. In: ACM Transactions on Graphics (TOG), vol. 23, pp. 309–314. ACM (2004)

[29]

Tsai, Y.H.H., Huang, L.K., Salakhutdinov, R.: Learning robust visual-semantic embeddings. arXiv preprint arXiv:1703.05908 (2017)

[30]

Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., Lee, H.: Learning to generate long-term future via hierarchical prediction. arXiv preprint arXiv:1704.05831 (2017)

[31]

Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: NIPS (2016)

[32]

Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5005–5013 (2016)

[33]

Yan X, Yang J, Sohn K, and Lee H Leibe B, Matas J, Sebe N, and Welling M Attribute2Image: conditional image generation from visual attributes Computer Vision – ECCV 2016 2016 Cham Springer 776-791

[34]

Zhang, H., et al.: Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. CoRR abs/1612.03242 (2016)

[35]

Zitnick, C.L., Parikh, D.: Bringing semantics into focus using visual abstraction. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3009–3016 (2013)

[36]

Zitnick, C.L., Parikh, D., Vanderwende, L.: Learning the visual interpretation of sentences. In: 2013 IEEE International Conference on Computer Vision, pp. 1681–1688 (2013)

[37]

Zitnick CL, Vedantam R, and Parikh D Adopting abstract images for semantic scene understanding IEEE Trans. Pattern Anal. Mach. Intell. 2016 38 627-638

Cited By

Fu XCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Enhancing Multimodal Large Language Models on Demonstrative Multi-Image InstructionsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3688994(11429-11434)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3688994
Ge ZLi JYu QZhou WTang SZhuang YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)DEMON24: ACM MM24 Demonstrative Instruction Following ChallengeProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3688993(11426-11428)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3688993
Tao MBao BTang HWang YXu CCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)CoIn: A Lightweight and Effective Framework for Story Visualization and ContinuationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680873(10659-10668)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680873
Show More Cited By

Index Terms

Imagine This! Scripts to Compositions to Videos
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Machine learning
2. Information systems

Index terms have been assigned to the content through auto-classification.

Recommendations

Could ChatGPT Imagine: Content Control for Artistic Painting Generation Via Large Language Models
Abstract
Intelligent systems and human-machine interactions have consistently provided convenience in both work and daily life. Artificial Intelligence Generated Content (AIGC) can assist humans in artistic creation by generating painting images based on ...
Imagine Math 6: Between Culture and Mathematics
Multicultural videos: an interactive online museum based on an international artistic video database
CommunicabilityMS '08: Proceedings of the 1st ACM international workshop on Communicability design and evaluation in cultural and ecological multimedia system

Multicultural Videos (MCV), www.multiculturalvideos.org, is an Interactive Online Museum based on an Online Video Database developed by an International Community of Artists from many creative disciplines and cultural heritage. MCV is an online space ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VIII

Sep 2018

845 pages

ISBN:978-3-030-01236-6

DOI:10.1007/978-3-030-01237-3

Editors:
Vittorio Ferrari
Google Research, Zurich, Switzerland
,
Martial Hebert
Carnegie Mellon University, Pittsburgh, PA, USA
,
Cristian Sminchisescu
Google Research, Zurich, Switzerland
,
Yair Weiss
Hebrew University of Jerusalem, Jerusalem, Israel

© Springer Nature Switzerland AG 2018.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 08 September 2018

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Fu XCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Enhancing Multimodal Large Language Models on Demonstrative Multi-Image InstructionsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3688994(11429-11434)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3688994
Ge ZLi JYu QZhou WTang SZhuang YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)DEMON24: ACM MM24 Demonstrative Instruction Following ChallengeProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3688993(11426-11428)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3688993
Tao MBao BTang HWang YXu CCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)CoIn: A Lightweight and Effective Framework for Story Visualization and ContinuationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680873(10659-10668)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680873
Kumagai HYamaki RNaganuma HKankanhalli MPatras ILiu JWong YKomamizu T(2023)Story-to-Images Translation: Leveraging Diffusion Models and Large Language Models for Sequence Image GenerationProceedings of the 2nd Workshop on User-centric Narrative Summarization of Long Videos10.1145/3607540.3617144(57-63)Online publication date: 29-Oct-2023
https://dl.acm.org/doi/10.1145/3607540.3617144
Maharana AHannan DBansal M(2022)StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story ContinuationComputer Vision – ECCV 202210.1007/978-3-031-19836-6_5(70-87)Online publication date: 23-Oct-2022
https://dl.acm.org/doi/10.1007/978-3-031-19836-6_5
Schulze HYaman DWaibel A(2021)CAGAN: Text-To-Image Generation with Combined Attention Generative Adversarial NetworksPattern Recognition10.1007/978-3-030-92659-5_25(392-404)Online publication date: 28-Sep-2021
https://dl.acm.org/doi/10.1007/978-3-030-92659-5_25
Verma GBV SSharma SSrinivasan BPaternò FOliver NConati CSpano LTintarev N(2020)Generating need-adapted multimodal fragmentsProceedings of the 25th International Conference on Intelligent User Interfaces10.1145/3377325.3377487(335-346)Online publication date: 17-Mar-2020
https://dl.acm.org/doi/10.1145/3377325.3377487
Chatterjee MCherian A(2020)Sound2Sight: Generating Visual Dynamics from Sound and ContextComputer Vision – ECCV 202010.1007/978-3-030-58583-9_42(701-719)Online publication date: 23-Aug-2020
https://dl.acm.org/doi/10.1007/978-3-030-58583-9_42

View Options

View options

Figures

Tables

Media

View Table of Conten