Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-030-01237-3_37guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Imagine This! Scripts to Compositions to Videos

Published: 08 September 2018 Publication History

Abstract

Imagining a scene described in natural language with realistic layout and appearance of entities is the ultimate test of spatial, visual, and semantic world knowledge. Towards this goal, we present the Composition, Retrieval and Fusion Network (Craft), a model capable of learning this knowledge from video-caption data and applying it while generating videos from novel captions. Craft explicitly predicts a temporal-layout of mentioned entities (characters and objects), retrieves spatio-temporal entity segments from a video database and fuses them to generate scene videos. Our contributions include sequential training of components of Craft while jointly modeling layout and appearances, and losses that encourage learning compositional representations for retrieval. We evaluate Craft on semantic fidelity to caption, composition consistency, and visual quality. Craft outperforms direct pixel generation approaches and generalizes well to unseen captions and to unseen video databases with no text annotations. We demonstrate Craft on Flintstones (Flintstones is available at https://prior.allenai.org/projects/craft), a new richly annotated video-caption dataset with over 25000 videos. For a glimpse of videos generated by Craft, see https://youtu.be/688Vv86n0z8.

References

[1]
Achanta R, Shaji A, Smith K, Lucchi A, Fua P, and Süsstrunk S Slic superpixels compared to state-of-the-art superpixel methods IEEE Trans. Pattern Anal. Mach. Intell. 2012 34 11 2274-2282
[2]
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein gan. CoRR abs/1701.07875 (2017)
[3]
Barnes C, Shechtman E, Finkelstein A, and Goldman DB Patchmatch: a randomized correspondence algorithm for structural image editing ACM Trans. Graph.-TOG 2009 28 3 24
[4]
Bengio, Y., Mesnil, G., Dauphin, Y., Rifai, S.: Better mixing via deep representations. In: ICML (2013)
[5]
Caruana R Thrun S and Pratt L Multitask learning Learning to Learn 1998 Boston Springer 95-133
[6]
Chen, Q., Koltun, V.: Photographic image synthesis with cascaded refinement networks. CoRR abs/1707.09405 (2017)
[7]
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: Vse++: Improved visual-semantic embeddings. arXiv preprint arXiv:1707.05612 (2017)
[8]
Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al.: Devise: a deep visual-semantic embedding model. In: Advances in Neural Information Processing Systems, pp. 2121–2129 (2013)
[9]
Goodfellow, I.J., et al.: Generative adversarial nets. In: NIPS (2014)
[10]
Gregor, K., Danihelka, I., Graves, A., Rezende, D.J., Wierstra, D.: Draw: a recurrent neural network for image generation. In: ICML (2015)
[11]
Gupta, T., et al.: Aligned image-word representations improve inductive transfer across vision-language tasks. arXiv preprint arXiv:1704.00260 (2017)
[12]
Hong, S., Yang, D., Choi, J., Lee, H.: Inferring semantic layout for hierarchical text-to-image synthesis. CoRR abs/1801.05091 (2018)
[13]
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. CoRR abs/1611.07004 (2016)
[14]
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for improved quality, stability, and variation. CoRR abs/1710.10196 (2017)
[15]
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. CoRR abs/1312.6114 (2013)
[16]
Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014)
[17]
Kwak, H., Zhang, B.T.: Generating images part by part with composite generative adversarial networks. CoRR abs/1607.05387 (2016)
[18]
Li, Y., Min, M.R., Shen, D., Carlson, D., Carin, L.: Video generation from text. arXiv preprint arXiv:1710.00421 (2017)
[19]
Liu, Y., Qin, Z., Luo, Z., Wang, H.: Auto-painter: Cartoon image generation from sketch by using conditional generative adversarial networks. CoRR abs/1705.01908 (2017)
[20]
Mansimov, E., Parisotto, E., Ba, J., Salakhutdinov, R.: Generating images from captions with attention. In: ICLR (2016)
[21]
Marwah, T., Mittal, G., Balasubramanian, V.N.: Attentive semantic video generation using captions. CoRR abs/1708.05980 (2017)
[22]
Odena, A., Olah, C., Shlens, J.: Conditional image synthesis with auxiliary classifier gans. In: ICML (2017)
[23]
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2641–2649. IEEE (2015)
[24]
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR abs/1511.06434 (2015)
[25]
Reed, S., van den Oord, A., Kalchbrenner, N., Bapst, V., Botvinick, M., de Freitas, N.: Generating interpretable images with controllable structure. In: OpenReview.net (2017)
[26]
Reed, S.E., Akata, Z., Mohan, S., Tenka, S., Schiele, B., Lee, H.: Learning what and where to draw. In: NIPS (2016)
[27]
Reed, S.E., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: ICML (2016)
[28]
Rother, C., Kolmogorov, V., Blake, A.: Grabcut: Interactive foreground extraction using iterated graph cuts. In: ACM Transactions on Graphics (TOG), vol. 23, pp. 309–314. ACM (2004)
[29]
Tsai, Y.H.H., Huang, L.K., Salakhutdinov, R.: Learning robust visual-semantic embeddings. arXiv preprint arXiv:1703.05908 (2017)
[30]
Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., Lee, H.: Learning to generate long-term future via hierarchical prediction. arXiv preprint arXiv:1704.05831 (2017)
[31]
Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: NIPS (2016)
[32]
Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5005–5013 (2016)
[33]
Yan X, Yang J, Sohn K, and Lee H Leibe B, Matas J, Sebe N, and Welling M Attribute2Image: conditional image generation from visual attributes Computer Vision – ECCV 2016 2016 Cham Springer 776-791
[34]
Zhang, H., et al.: Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. CoRR abs/1612.03242 (2016)
[35]
Zitnick, C.L., Parikh, D.: Bringing semantics into focus using visual abstraction. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3009–3016 (2013)
[36]
Zitnick, C.L., Parikh, D., Vanderwende, L.: Learning the visual interpretation of sentences. In: 2013 IEEE International Conference on Computer Vision, pp. 1681–1688 (2013)
[37]
Zitnick CL, Vedantam R, and Parikh D Adopting abstract images for semantic scene understanding IEEE Trans. Pattern Anal. Mach. Intell. 2016 38 627-638

Cited By

View all
  • (2024)Enhancing Multimodal Large Language Models on Demonstrative Multi-Image InstructionsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3688994(11429-11434)Online publication date: 28-Oct-2024
  • (2024)DEMON24: ACM MM24 Demonstrative Instruction Following ChallengeProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3688993(11426-11428)Online publication date: 28-Oct-2024
  • (2024)CoIn: A Lightweight and Effective Framework for Story Visualization and ContinuationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680873(10659-10668)Online publication date: 28-Oct-2024
  • Show More Cited By

Index Terms

  1. Imagine This! Scripts to Compositions to Videos
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Guide Proceedings
        Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VIII
        Sep 2018
        845 pages
        ISBN:978-3-030-01236-6
        DOI:10.1007/978-3-030-01237-3

        Publisher

        Springer-Verlag

        Berlin, Heidelberg

        Publication History

        Published: 08 September 2018

        Qualifiers

        • Article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 13 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Enhancing Multimodal Large Language Models on Demonstrative Multi-Image InstructionsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3688994(11429-11434)Online publication date: 28-Oct-2024
        • (2024)DEMON24: ACM MM24 Demonstrative Instruction Following ChallengeProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3688993(11426-11428)Online publication date: 28-Oct-2024
        • (2024)CoIn: A Lightweight and Effective Framework for Story Visualization and ContinuationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680873(10659-10668)Online publication date: 28-Oct-2024
        • (2023)Story-to-Images Translation: Leveraging Diffusion Models and Large Language Models for Sequence Image GenerationProceedings of the 2nd Workshop on User-centric Narrative Summarization of Long Videos10.1145/3607540.3617144(57-63)Online publication date: 29-Oct-2023
        • (2022)StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story ContinuationComputer Vision – ECCV 202210.1007/978-3-031-19836-6_5(70-87)Online publication date: 23-Oct-2022
        • (2021)CAGAN: Text-To-Image Generation with Combined Attention Generative Adversarial NetworksPattern Recognition10.1007/978-3-030-92659-5_25(392-404)Online publication date: 28-Sep-2021
        • (2020)Generating need-adapted multimodal fragmentsProceedings of the 25th International Conference on Intelligent User Interfaces10.1145/3377325.3377487(335-346)Online publication date: 17-Mar-2020
        • (2020)Sound2Sight: Generating Visual Dynamics from Sound and ContextComputer Vision – ECCV 202010.1007/978-3-030-58583-9_42(701-719)Online publication date: 23-Aug-2020

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media