research-article

Learning temporal coherence via self-supervision for GAN-based video generation

Authors:

Laura Leal-Taixé,

Nils ThuereyAuthors Info & Claims

ACM Transactions on Graphics (TOG), Volume 39, Issue 4

Article No.: 75, Pages 75:1 - 75:13

https://doi.org/10.1145/3386569.3392457

Published: 12 August 2020 Publication History

Abstract

Our work explores temporal self-supervision for GAN-based video generation tasks. While adversarial training successfully yields generative models for a variety of areas, temporal relationships in the generated data are much less explored. Natural temporal changes are crucial for sequential generation tasks, e.g. video super-resolution and unpaired video translation. For the former, state-of-the-art methods often favor simpler norm losses such as L² over adversarial training. However, their averaging nature easily leads to temporally smooth results with an undesirable lack of spatial detail. For unpaired video translation, existing approaches modify the generator networks to form spatio-temporal cycle consistencies. In contrast, we focus on improving learning objectives and propose a temporally self-supervised algorithm. For both tasks, we show that temporal adversarial learning is key to achieving temporally coherent solutions without sacrificing spatial detail. We also propose a novel Ping-Pong loss to improve the long-term temporal consistency. It effectively prevents recurrent networks from accumulating artifacts temporally without depressing detailed features. Additionally, we propose a first set of metrics to quantitatively evaluate the accuracy as well as the perceptual quality of the temporal evolution. A series of user studies confirm the rankings computed with these metrics. Code, data, models, and results are provided at https://github.com/thunil/TecoGAN.

Supplemental Material

MP4 File

Presentation video

Transcript for: Presentation video

ZIP File

Supplemental files.

Download
164.31 MB

References

[1]

Aayush Bansal, Shugao Ma, Deva Ramanan, and Yaser Sheikh. 2018. Recycle-GAN: Unsupervised Video Retargeting. In The European Conference on Computer Vision (ECCV).

[2]

Dina Bashkirova, Ben Usman, and Kate Saenko. 2018. Unsupervised video-to-video translation. arXiv preprint arXiv:1806.03698 (2018).

[3]

Yochai Blau and Tomer Michaeli. 2018. The perception-distortion tradeoff. In Proc. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, USA. 6228--6237.

[4]

Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika 39, 3/4 (1952), 324--345.

[5]

Andrew Brock, Jeff Donahue, and Karen Simonyan. 2019. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In International Conference on Learning Representations. https://openreview.net/forum?id=B1xsqj09Fm

[6]

Jose Caballero, Christian Ledig, Andrew P Aitken, Alejandro Acosta, Johannes Totz, Zehan Wang, and Wenzhe Shi. 2017. Real-Time Video Super-Resolution with Spatio-Temporal Networks and Motion Compensation. In CVPR, Vol. 1. 7.

[7]

(CC) Blender Foundation | mango.blender.org. 2011. Tears of Steel. https://mango.blender.org/. Online; accessed 15 Nov. 2018.

[8]

Dongdong Chen, Jing Liao, Lu Yuan, Nenghai Yu, and Gang Hua. 2017. Coherent online video style transfer. In Proc. Intl. Conf. Computer Vision (ICCV).

[9]

Yang Chen, Yingwei Pan, Ting Yao, Xinmei Tian, and Tao Mei. 2019. Mocycle-GAN: Unpaired Video-to-Video Translation. arXiv preprint arXiv:1908.09514 (August 2019).

[10]

Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision. 2758--2766.

Digital Library

[11]

M-L Eckert, Wolfgang Heidrich, and Nils Thuerey. 2018. Coupled fluid density and motion from single views. In Computer Graphics Forum, Vol. 37(8). Wiley Online Library, 47--58.

[12]

Gunnar Farnebäck. 2003. Two-frame motion estimation based on polynomial expansion. In Scandinavian conference on Image analysis. Springer, 363--370.

Digital Library

[13]

Gustav Theodor Fechner and Wilhelm Max Wundt. 1889. Elemente der Psychophysik: erster Theil. Breitkopf & Härtel.

[14]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672--2680.

[15]

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017. Improved training of wasserstein gans. In Advances in neural information processing systems. 5767--5777.

[16]

Muhammad Haris, Gregory Shakhnarovich, and Norimichi Ukita. 2019. Recurrent back-projection network for video super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3897--3906.

[17]

Mingming He, Dongdong Chen, Jing Liao, Pedro V. Sander, and Lu Yuan. 2018. Deep Exemplar-Based Colorization. ACM Trans. Graph. 37, 4, Article 47 (July 2018), 16 pages.

Digital Library

[18]

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-To-Image Translation With Conditional Adversarial Networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]

Ondřej Jamriška, Šárka Sochorová, Ondřej Texler, Michal Lukáč, Jakub Fišer, Jingwan Lu, Eli Shechtman, and Daniel Sýkora. 2019. Stylizing Video by Example. ACM Transactions on Graphics 38, 4, Article 107 (2019).

Digital Library

[20]

Younghyun Jo, Seoung Wug Oh, Jaeyeon Kang, and Seon Joo Kim. 2018. Deep Video Super-Resolution Network Using Dynamic Upsampling Filters Without Explicit Motion Compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3224--3232.

[21]

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision. Springer, 694--711.

[22]

Anton S Kaplanyan, Anton Sochenov, Thomas Leimkühler, Mikhail Okunev, Todd Goodall, and Gizem Rufo. 2019. DeepFovea: neural reconstruction for foveated rendering and video compression using learned statistics of natural videos. ACM Transactions on Graphics (TOG) 38, 6 (2019), 1--13.

Digital Library

[23]

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2017. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017).

[24]

T Kelly, P Guerrero, A Steed, P Wonka, and NJ Mitra. 2018. FrankenGAN: guided detail synthesis for building mass models using style-synchonized GANs. ACM Transactions on Graphics 37, 6 (November 2018).

Digital Library

[25]

Byungsoo Kim, Vinicius C. Azevedo, Markus Gross, and Barbara Solenthaler. 2019. Transport-Based Neural Style Transfer for Smoke Simulations. ACM Trans. Graph. 38, 6, Article 188 (Nov. 2019), 11 pages.

Digital Library

[26]

Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. 2016. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1646--1654.

[27]

Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang. 2017. Deep laplacian pyramid networks for fast and accurate superresolution. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2. 5.

[28]

Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. 2016. Photo-realistic single image super-resolution using a generative adversarial network. arXiv:1609.04802 (2016).

[29]

Renjie Liao, Xin Tao, Ruiyu Li, Ziyang Ma, and Jiaya Jia. 2015. Video super-resolution via deep draft-ensemble learning. In Proceedings of the IEEE International Conference on Computer Vision. 531--539.

Digital Library

[30]

Ce Liu and Deqing Sun. 2011. A Bayesian approach to adaptive video super resolution. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 209--216.

Digital Library

[31]

Ding Liu, Zhaowen Wang, Yuchen Fan, Xianming Liu, Zhangyang Wang, Shiyu Chang, and Thomas Huang. 2017. Robust video super-resolution with learned temporal dynamics. In Computer Vision (ICCV), 2017 IEEE International Conference on. IEEE, 2526--2534.

[32]

Pengpeng Liu, Michael Lyu, Irwin King, and Jia Xu. 2019. Selflow: Self-supervised learning of optical flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4571--4580.

[33]

Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. 2017. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision. 2794--2802.

[34]

Seungjun Nah, Sungyong Baik, Seokil Hong, Gyeongsik Moon, Sanghyun Son, Radu Timofte, and Kyoung Mu Lee. 2019. NTIRE 2019 Challenge on Video Deblurring and Super-Resolution: Dataset and Study. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.

[35]

Suraj Nair, Mohammad Babaeizadeh, Chelsea Finn, Sergey Levine, and Vikash Kumar. 2018. Time reversal as self-supervision. arXiv preprint arXiv:1810.01128 (2018).

[36]

Kwanyong Park, Sanghyun Woo, Dahun Kim, Donghyeon Cho, and In So Kweon. 2019. Preserving Semantic and Temporal Consistency for Unpaired Video-to-Video Translation. In Proceedings of the 27th ACM International Conference on Multimedia (Nice, France) (MM '19). Association for Computing Machinery, New York, NY, USA, 1248--1257.

Digital Library

[37]

Eduardo Pérez-Pellitero, Mehdi SM Sajjadi, Michael Hirsch, and Bernhard Schölkopf. 2018. Photorealistic Video Super Resolution. arXiv preprint arXiv:1807.07930 (2018).

[38]

Ekta Prashnani, Hong Cai, Yasamin Mostofi, and Pradeep Sen. 2018. PieAPP: Perceptual Image-Error Assessment through Pairwise Preference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1808--1817.

[39]

Manuel Ruder, Alexey Dosovitskiy, and Thomas Brox. 2016. Artistic style transfer for videos. In German Conference on Pattern Recognition. Springer, 26--36.

[40]

Mehdi SM Sajjadi, Bernhard Schölkopf, and Michael Hirsch. 2017. Enhancenet: Single image super-resolution through automated texture synthesis. In Computer Vision (ICCV), 2017 IEEE International Conference on. IEEE, 4501--4510.

[41]

Mehdi SM Sajjadi, Raviteja Vemulapalli, and Matthew Brown. 2018. Frame-Recurrent Video Super-Resolution. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018).

[42]

Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. 2016. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1874--1883.

[43]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[44]

Vincent Sitzmann, Steven Diamond, Yifan Peng, Xiong Dun, Stephen Boyd, Wolfgang Heidrich, Felix Heide, and Gordon Wetzstein. 2018. End-to-end optimization of optics and image processing for achromatic extended depth of field and super-resolution imaging. ACM Transactions on Graphics (TOG) 37, 4 (2018), 1--13.

Digital Library

[45]

Xin Tao, Hongyun Gao, Renjie Liao, Jue Wang, and Jiaya Jia. 2017. Detail-Revealing Deep Video Super-Resolution. In The IEEE International Conference on Computer Vision (ICCV).

[46]

Kiwon Um, Xiangyu Hu, and Nils Thuerey. 2017. Perceptual evaluation of liquid simulation methods. ACM Transactions on Graphics (TOG) 36, 4 (2017), 143.

Digital Library

[47]

Chaoyue Wang, Chang Xu, Chaohui Wang, and Dacheng Tao. 2018b. Perceptual adversarial networks for image-to-image transformation. IEEE Transactions on Image Processing 27, 8 (2018), 4066--4079.

[48]

Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018a. Video-to-Video Synthesis. In Advances in Neural Information Processing Systems (NIPS).

[49]

Xintao Wang, Kelvin C.K. Chan, Ke Yu, Chao Dong, and Chen Change Loy. 2019a. EDVR: Video restoration with enhanced deformable convolutional networks. In The IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[50]

Xiaolong Wang, Allan Jabri, and Alexei A Efros. 2019b. Learning correspondence from the cycle-consistency of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2566--2576.

[51]

Bartlomiej Wronski, Ignacio Garcia-Dorado, Manfred Ernst, Damien Kelly, Michael Krainin, Chia-Kai Liang, Marc Levoy, and Peyman Milanfar. 2019. Handheld Multi-Frame Super-Resolution. ACM Trans. Graph. 38, 4, Article 28 (July 2019), 18 pages.

Digital Library

[52]

Zhijie Wu, Xiang Wang, Di Lin, Dani Lischinski, Daniel Cohen-Or, and Hui Huang. 2019. SAGNet: Structure-aware Generative Network for 3D-Shape Modeling. ACM Transactions on Graphics (Proceedings of SIGGRAPH 2019) 38, 4 (2019), 91:1--91:14.

Digital Library

[53]

You Xie, Erik Franz, Mengyu Chu, and Nils Thuerey. 2018. tempoGAN: A Temporally Coherent, Volumetric GAN for Super-resolution Fluid Flow. ACM Transactions on Graphics (TOG) 37, 4 (2018), 95.

Digital Library

[54]

Bo Zhang, Mingming He, Jing Liao, Pedro V Sander, Lu Yuan, Amine Bermak, and Dong Chen. 2019. Deep Exemplar-based Video Colorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8052--8061.

[55]

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. arXiv preprint (2018).

[56]

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision. 2223--2232.

[57]

Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. 2019. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9308--9316.

Cited By

Woo SKim DKim AGalia VPark S(2024)BasicVSR Model Filtering Study Using Squeeze and Excitation BlockJOURNAL OF BROADCAST ENGINEERING10.5909/JBE.2024.29.1.10529:1(105-108)Online publication date: 31-Jan-2024
https://doi.org/10.5909/JBE.2024.29.1.105
Banerjee CNguyen KFookes CGeorge K(2024)Physics-Informed Computer Vision: A Review and PerspectivesACM Computing Surveys10.1145/3689037Online publication date: 20-Aug-2024
https://dl.acm.org/doi/10.1145/3689037
Song ZYu ZSong XHao YJiang LJing NLiang X(2024)Environmental Condition Aware Super-Resolution Acceleration Framework in Server-Client HierarchiesACM Transactions on Architecture and Code Optimization10.1145/3678008Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3678008
Show More Cited By

Index Terms

Learning temporal coherence via self-supervision for GAN-based video generation
1. Computing methodologies
  1. Computer graphics
    1. Image manipulation
      1. Image processing
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

End-to-end novel visual categories learning via auxiliary self-supervision
Abstract
Semi-supervised learning has largely alleviated the strong demand for large amount of annotations in deep learning. However, most of the methods have adopted a common assumption that there is always labeled data from the same class of ...
Self-Supervised Video Super-Resolution by Spatial Constraint and Temporal Fusion
Pattern Recognition and Computer Vision
Abstract
To avoid any fallacious assumption on the degeneration procedure in preparing training data, some self-similarity based super-resolution (SR) algorithms have been proposed to exploit the internal recurrence of patches without relying on external ...
Domain Adaptive Video Segmentation via Temporal Pseudo Supervision
Computer Vision – ECCV 2022
Abstract
Video semantic segmentation has achieved great progress under the supervision of large amounts of labelled training data. However, domain adaptive video segmentation, which can mitigate data labelling constraints by adapting from a labelled source ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Graphics

ACM Transactions on Graphics Volume 39, Issue 4

August 2020

1732 pages

ISSN:0730-0301

EISSN:1557-7368

DOI:10.1145/3386569

Editor:
Szymon Rusinkiewicz
Princeton University

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2020

Published in TOG Volume 39, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

European Research Council
Alexander von Humboldt-Stiftung

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

101
Total Citations
View Citations
1,951
Total Downloads

Downloads (Last 12 months)204
Downloads (Last 6 weeks)17

Reflects downloads up to 03 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Woo SKim DKim AGalia VPark S(2024)BasicVSR Model Filtering Study Using Squeeze and Excitation BlockJOURNAL OF BROADCAST ENGINEERING10.5909/JBE.2024.29.1.10529:1(105-108)Online publication date: 31-Jan-2024
https://doi.org/10.5909/JBE.2024.29.1.105
Banerjee CNguyen KFookes CGeorge K(2024)Physics-Informed Computer Vision: A Review and PerspectivesACM Computing Surveys10.1145/3689037Online publication date: 20-Aug-2024
https://dl.acm.org/doi/10.1145/3689037
Song ZYu ZSong XHao YJiang LJing NLiang X(2024)Environmental Condition Aware Super-Resolution Acceleration Framework in Server-Client HierarchiesACM Transactions on Architecture and Code Optimization10.1145/3678008Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3678008
Wolski KDjeacoumar AJavanmardi ASeidel HTheobalt CCordonnier GMyszkowski KDrettakis GPan XLeimkühler T(2024)Learning Images Across Scales Using Adversarial TrainingACM Transactions on Graphics10.1145/365819043:4(1-13)Online publication date: 19-Jul-2024
https://dl.acm.org/doi/10.1145/3658190
Lee HYang YVon Davier TForlizzi JDas S(2024)Deepfakes, Phrenology, Surveillance, and More! A Taxonomy of AI Privacy RisksProceedings of the CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642116(1-19)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642116
Ni HLiu JXue YHuang S(2024)3D-Aware Talking-Head Video Motion Transfer2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00488(4942-4952)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00488
Ehrlich MBarker JPadmanabhan NDavis LTao ACatanzaro BShrivastava A(2024)Leveraging Bitstream Metadata for Fast, Accurate, Generalized Compressed Video Quality Enhancement2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00154(1506-1516)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00154
Wang GJiang KGu KLiu HLiu HZhang W(2024)Coarse- and Fine-Grained Fusion Hierarchical Network for Hole Filling in View SynthesisIEEE Transactions on Image Processing10.1109/TIP.2023.334130333(322-337)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TIP.2023.3341303
Fu CYuan HShen LHamzaoui RZhang H(2024)3DAttGAN: A 3D Attention-Based Generative Adversarial Network for Joint Space-Time Video Super-ResolutionIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2024.33699948:4(3117-3128)Online publication date: Aug-2024
https://doi.org/10.1109/TETCI.2024.3369994
Samad AIzani MAbdulla DFaiz MWadood RHamdan A(2024)Innovative Workflow for AI-Generated Video: Addressing Limitations, Impact and Implications2024 IEEE Symposium on Industrial Electronics & Applications (ISIEA)10.1109/ISIEA61920.2024.10607369(1-7)Online publication date: 6-Jul-2024
https://doi.org/10.1109/ISIEA61920.2024.10607369
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents