Sound-Guided Semantic Video Generation

Lee, Seung Hyun; Oh, Gyeongrok; Byeon, Wonmin; Kim, Chanyoung; Ryoo, Won Jeong; Yoon, Sang Ho; Cho, Hyunjun; Bae, Jihyun; Kim, Jinkyu; Kim, Sangpil

Computer Science > Computer Vision and Pattern Recognition

arXiv:2204.09273 (cs)

[Submitted on 20 Apr 2022 (v1), last revised 21 Oct 2022 (this version, v4)]

Title:Sound-Guided Semantic Video Generation

Authors:Seung Hyun Lee, Gyeongrok Oh, Wonmin Byeon, Chanyoung Kim, Won Jeong Ryoo, Sang Ho Yoon, Hyunjun Cho, Jihyun Bae, Jinkyu Kim, Sangpil Kim

View PDF

Abstract:The recent success in StyleGAN demonstrates that pre-trained StyleGAN latent space is useful for realistic video generation. However, the generated motion in the video is usually not semantically meaningful due to the difficulty of determining the direction and magnitude in the StyleGAN latent space. In this paper, we propose a framework to generate realistic videos by leveraging multimodal (sound-image-text) embedding space. As sound provides the temporal contexts of the scene, our framework learns to generate a video that is semantically consistent with sound. First, our sound inversion module maps the audio directly into the StyleGAN latent space. We then incorporate the CLIP-based multimodal embedding space to further provide the audio-visual relationships. Finally, the proposed frame generator learns to find the trajectory in the latent space which is coherent with the corresponding sound and generates a video in a hierarchical manner. We provide the new high-resolution landscape video dataset (audio-visual pair) for the sound-guided video generation task. The experiments show that our model outperforms the state-of-the-art methods in terms of video quality. We further show several applications including image and video editing to verify the effectiveness of our method.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2204.09273 [cs.CV]
	(or arXiv:2204.09273v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2204.09273

Submission history

From: Seung Hyun Lee [view email]
[v1] Wed, 20 Apr 2022 07:33:10 UTC (45,369 KB)
[v2] Thu, 21 Apr 2022 02:13:31 UTC (45,369 KB)
[v3] Tue, 30 Aug 2022 08:00:48 UTC (45,368 KB)
[v4] Fri, 21 Oct 2022 06:10:08 UTC (6,130 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Sound-Guided Semantic Video Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Sound-Guided Semantic Video Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators