research-article

Mocycle-GAN: Unpaired Video-to-Video Translation

Authors:

Tao MeiAuthors Info & Claims

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 647 - 655

https://doi.org/10.1145/3343031.3350937

Published: 15 October 2019 Publication History

Abstract

Unsupervised image-to-image translation is the task of translating an image from one domain to another in the absence of any paired training examples and tends to be more applicable to practical applications. Nevertheless, the extension of such synthesis from image-to-image to video-to-video is not trivial especially when capturing spatio-temporal structures in videos. The difficulty originates from the aspect that not only the visual appearance in each frame but also motion between consecutive frames should be realistic and consistent across transformation. This motivates us to explore both appearance structure and temporal continuity in video synthesis. In this paper, we present a new Motion-guided Cycle GAN, dubbed as Mocycle-GAN, that novelly integrates motion estimation into unpaired video translator. Technically, Mocycle-GAN capitalizes on three types of constrains: adversarial constraint discriminating between synthetic and real frame, cycle consistency encouraging an inverse translation on both frame and motion, and motion translation validating the transfer of motion between consecutive frames. Extensive experiments are conducted on video-to-labels and labels-to-video translation, and superior results are reported when comparing to state-of-the-art methods. More remarkably, we qualitatively demonstrate our Mocycle-GAN for both flower-to-flower and ambient condition transfer.

References

[1]

Alexander G Anderson, Cory P Berg, Daniel P Mossing, and Bruno A Olshausen. 2016. Deepmovie: Using optical flow and deep neural networks to stylize movies. arXiv preprint arXiv:1605.08153 (2016).

[2]

Aayush Bansal, Shugao Ma, Deva Ramanan, and Yaser Sheikh. 2018. Recycle-gan: Unsupervised video retargeting. In ECCV .

[3]

Dongdong Chen, Jing Liao, Lu Yuan, Nenghai Yu, and Gang Hua. 2017. Coherent online video style transfer. In ICCV .

[4]

Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. 2018. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR .

[5]

Chang Gao, Derun Gu, Fangjun Zhang, and Yizhou Yu. 2018. ReCoNet: Real-time Coherent Video Style Transfer Network. arXiv preprint arXiv:1807.01197 (2018).

[6]

Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2016. Image style transfer using convolutional neural networks. In CVPR .

[7]

Golnaz Ghiasi, Honglak Lee, Manjunath Kudlur, Vincent Dumoulin, and Jonathon Shlens. 2017. Exploring the structure of a real-time, arbitrary neural artistic stylization network. In BMVC .

[8]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In NIPS .

[9]

Agrim Gupta, Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2017. Characterizing and improving stability in neural style transfer. In ICCV .

[10]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR .

[11]

Haozhi Huang, Hao Wang, Wenhan Luo, Lin Ma, Wenhao Jiang, Xiaolong Zhu, Zhifeng Li, and Wei Liu. 2017. Real-time neural style transfer for videos. In CVPR .

[12]

Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. 2017. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR .

[13]

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image translation with conditional adversarial networks. In CVPR .

[14]

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In ECCV .

[15]

Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. 2017. Learning to discover cross-domain relations with generative adversarial networks. In ICML .

[16]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In ICLR .

[17]

Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. 2018. Jointly localizing and describing events for dense video captioning. In CVPR .

[18]

Ming-Yu Liu and Oncel Tuzel. 2016. Coupled generative adversarial networks. In NIPS .

[19]

Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In CVPR .

[20]

Youssef Alami Mejjati, Christian Richardt, James Tompkin, Darren Cosker, and Kwang In Kim. 2018. Unsupervised Attention-guided Image-to-Image Translation. In NIPS .

[21]

Yingwei Pan, Yehao Li, Ting Yao, Tao Mei, Houqiang Li, and Yong Rui. 2016a. Learning Deep Intrinsic Video Representation by Exploring Temporal Coherence and Graph Structure. In IJCAI .

[22]

Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016b. Jointly modeling embedding and translation to bridge video and language. In CVPR .

[23]

Yingwei Pan, Zhaofan Qiu, Ting Yao, Houqiang Li, and Tao Mei. 2017a. To create what you tell: Generating videos from captions. In ACM MM .

[24]

Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017b. Video captioning with transferred semantic attributes. In CVPR .

[25]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, et almbox. 2017. Automatic differentiation in PyTorch. In NIPS Workshop .

[26]

Zhaofan Qiu, Yingwei Pan, Ting Yao, and Tao Mei. 2017. Deep semantic hashing with generative adversarial networks. In SIGIR .

[27]

Stephan R Richter, Zeeshan Hayder, and Vladlen Koltun. 2017. Playing for benchmarks. In ICCV .

[28]

Manuel Ruder, Alexey Dosovitskiy, and Thomas Brox. 2016. Artistic style transfer for videos. In GCPR .

[29]

Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Victor S Lempitsky. 2016. Texture Networks: Feed-forward Synthesis of Textures and Stylized Images. In ICML .

[30]

Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Generating videos with scene dynamics. In NIPS .

[31]

Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018a. Video-to-video synthesis. In NIPS .

[32]

Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018b. High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR .

[33]

Xingxing Wei, Jun Zhu, Sitong Feng, and Hang Su. 2018. Video-to-video translation with global temporal consistency. In ACM MM .

[34]

Xuewen Yang, Dongliang Xie, and Xin Wang. 2018. Crossing-Domain Generative Adversarial Networks for Unsupervised Multi-Domain Image-to-Image Translation. In ACM MM .

[35]

Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. 2017. Dualgan: Unsupervised dual learning for image-to-image translation. In ICCV .

[36]

Hang Zhang and Kristin Dana. 2018. Multi-style generative network for real-time transfer. In ECCV .

[37]

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017a. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV .

[38]

Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. 2017b. Toward multimodal image-to-image translation. In NIPS .

Cited By

Wei JHuang HZhang XYe DLi YWang LMa YLi Y(2025)Neural-Network-Enhanced Metalens Camera for High-Definition, Dynamic Imaging in the Long-Wave Infrared SpectrumACS Photonics10.1021/acsphotonics.4c01321Online publication date: 2-Jan-2025
https://doi.org/10.1021/acsphotonics.4c01321
Bendiab GHaiouni HMoulas IShiaeles S(2025)Deepfakes in digital media forensics: Generation, AI-based detection and challengesJournal of Information Security and Applications10.1016/j.jisa.2024.10393588(103935)Online publication date: Feb-2025
https://doi.org/10.1016/j.jisa.2024.103935
Li SMa DDing YXian YZhang T(2024)DBSF-Net: Infrared Image Colorization Based on the Generative Adversarial Model with Dual-Branch Feature Extraction and Spatial-Frequency-Domain DiscriminationRemote Sensing10.3390/rs1620376616:20(3766)Online publication date: 10-Oct-2024
https://doi.org/10.3390/rs16203766
Show More Cited By

Index Terms

Mocycle-GAN: Unpaired Video-to-Video Translation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Vision for robotics
      2. Image and video acquisition
        Motion capture
2. Information systems
  1. Information systems applications
    1. Multimedia information systems
      1. Multimedia content creation

Recommendations

I2V-GAN: Unpaired Infrared-to-Visible Video Translation
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Human vision is often adversely affected by complex environmental factors, especially in night vision scenarios. Thus, infrared cameras are often leveraged to help enhance the visual effects via detecting infrared radiation in the surrounding ...
Animating Your Life: Real-Time Video-to-Animation Translation
MM '19: Proceedings of the 27th ACM International Conference on Multimedia

We demonstrate a video-to-animation translator, which can transform real-world video into cartoon or ink-wash animation in real-time. When users upload a video or record what they are seeing with the phone, the video-to-animation translator renders the ...
Utilizing the Neural Renderer for Accurate 3D Face Reconstruction from a Single Image
Abstract
Recently, deep learning-based methods have shown significant results in 3D face reconstruction. By harnessing the power of convolutional neural networks, significant progress has been made in recovering 3D face shapes from single images using the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

October 2019

2794 pages

ISBN:9781450368896

DOI:10.1145/3343031

General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Conference

MM '19

Sponsor:

SIGMM

MM '19: The 27th ACM International Conference on Multimedia

October 21 - 25, 2019

Nice, France

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

55
Total Citations
View Citations
840
Total Downloads

Downloads (Last 12 months)94
Downloads (Last 6 weeks)11

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wei JHuang HZhang XYe DLi YWang LMa YLi Y(2025)Neural-Network-Enhanced Metalens Camera for High-Definition, Dynamic Imaging in the Long-Wave Infrared SpectrumACS Photonics10.1021/acsphotonics.4c01321Online publication date: 2-Jan-2025
https://doi.org/10.1021/acsphotonics.4c01321
Bendiab GHaiouni HMoulas IShiaeles S(2025)Deepfakes in digital media forensics: Generation, AI-based detection and challengesJournal of Information Security and Applications10.1016/j.jisa.2024.10393588(103935)Online publication date: Feb-2025
https://doi.org/10.1016/j.jisa.2024.103935
Li SMa DDing YXian YZhang T(2024)DBSF-Net: Infrared Image Colorization Based on the Generative Adversarial Model with Dual-Branch Feature Extraction and Spatial-Frequency-Domain DiscriminationRemote Sensing10.3390/rs1620376616:20(3766)Online publication date: 10-Oct-2024
https://doi.org/10.3390/rs16203766
Islam TMiron ALiu XLi Y(2024)Dynamic Fashion Video Synthesis from Static ImageryFuture Internet10.3390/fi1608028716:8(287)Online publication date: 8-Aug-2024
https://doi.org/10.3390/fi16080287
Yang HChen YPan YYao TChen ZNgo CMei TCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681634(6870-6879)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681634
Yang YGuo CGuo X(2024)Depth-Aware Unpaired Video DehazingIEEE Transactions on Image Processing10.1109/TIP.2024.337847233(2388-2403)Online publication date: 2024
https://doi.org/10.1109/TIP.2024.3378472
Li XWang XYe HQiu SLiao X(2024)Multinetwork Algorithm for Coastal Line Segmentation in Remote Sensing ImagesIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.343596362(1-12)Online publication date: 2024
https://doi.org/10.1109/TGRS.2024.3435963
Tang HYu ZLi S(2024)Exploring Spatiotemporal Consistency of Features for Video Translation in Consumer Internet of ThingsIEEE Transactions on Consumer Electronics10.1109/TCE.2023.333100970:1(3077-3087)Online publication date: Feb-2024
https://doi.org/10.1109/TCE.2023.3331009
Fan RSun QXia RTang Y(2024)Temporally Consistent Unpaired Multi-domain Video Translation by Contrastive Learning2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650014(1-8)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10650014
Chen YPant YYang HYao TMeit T(2024)VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00468(4896-4905)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.00468
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents