Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3343031.3350937acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Mocycle-GAN: Unpaired Video-to-Video Translation

Published: 15 October 2019 Publication History

Abstract

Unsupervised image-to-image translation is the task of translating an image from one domain to another in the absence of any paired training examples and tends to be more applicable to practical applications. Nevertheless, the extension of such synthesis from image-to-image to video-to-video is not trivial especially when capturing spatio-temporal structures in videos. The difficulty originates from the aspect that not only the visual appearance in each frame but also motion between consecutive frames should be realistic and consistent across transformation. This motivates us to explore both appearance structure and temporal continuity in video synthesis. In this paper, we present a new Motion-guided Cycle GAN, dubbed as Mocycle-GAN, that novelly integrates motion estimation into unpaired video translator. Technically, Mocycle-GAN capitalizes on three types of constrains: adversarial constraint discriminating between synthetic and real frame, cycle consistency encouraging an inverse translation on both frame and motion, and motion translation validating the transfer of motion between consecutive frames. Extensive experiments are conducted on video-to-labels and labels-to-video translation, and superior results are reported when comparing to state-of-the-art methods. More remarkably, we qualitatively demonstrate our Mocycle-GAN for both flower-to-flower and ambient condition transfer.

References

[1]
Alexander G Anderson, Cory P Berg, Daniel P Mossing, and Bruno A Olshausen. 2016. Deepmovie: Using optical flow and deep neural networks to stylize movies. arXiv preprint arXiv:1605.08153 (2016).
[2]
Aayush Bansal, Shugao Ma, Deva Ramanan, and Yaser Sheikh. 2018. Recycle-gan: Unsupervised video retargeting. In ECCV .
[3]
Dongdong Chen, Jing Liao, Lu Yuan, Nenghai Yu, and Gang Hua. 2017. Coherent online video style transfer. In ICCV .
[4]
Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. 2018. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR .
[5]
Chang Gao, Derun Gu, Fangjun Zhang, and Yizhou Yu. 2018. ReCoNet: Real-time Coherent Video Style Transfer Network. arXiv preprint arXiv:1807.01197 (2018).
[6]
Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2016. Image style transfer using convolutional neural networks. In CVPR .
[7]
Golnaz Ghiasi, Honglak Lee, Manjunath Kudlur, Vincent Dumoulin, and Jonathon Shlens. 2017. Exploring the structure of a real-time, arbitrary neural artistic stylization network. In BMVC .
[8]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In NIPS .
[9]
Agrim Gupta, Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2017. Characterizing and improving stability in neural style transfer. In ICCV .
[10]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR .
[11]
Haozhi Huang, Hao Wang, Wenhan Luo, Lin Ma, Wenhao Jiang, Xiaolong Zhu, Zhifeng Li, and Wei Liu. 2017. Real-time neural style transfer for videos. In CVPR .
[12]
Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. 2017. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR .
[13]
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image translation with conditional adversarial networks. In CVPR .
[14]
Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In ECCV .
[15]
Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. 2017. Learning to discover cross-domain relations with generative adversarial networks. In ICML .
[16]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In ICLR .
[17]
Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. 2018. Jointly localizing and describing events for dense video captioning. In CVPR .
[18]
Ming-Yu Liu and Oncel Tuzel. 2016. Coupled generative adversarial networks. In NIPS .
[19]
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In CVPR .
[20]
Youssef Alami Mejjati, Christian Richardt, James Tompkin, Darren Cosker, and Kwang In Kim. 2018. Unsupervised Attention-guided Image-to-Image Translation. In NIPS .
[21]
Yingwei Pan, Yehao Li, Ting Yao, Tao Mei, Houqiang Li, and Yong Rui. 2016a. Learning Deep Intrinsic Video Representation by Exploring Temporal Coherence and Graph Structure. In IJCAI .
[22]
Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016b. Jointly modeling embedding and translation to bridge video and language. In CVPR .
[23]
Yingwei Pan, Zhaofan Qiu, Ting Yao, Houqiang Li, and Tao Mei. 2017a. To create what you tell: Generating videos from captions. In ACM MM .
[24]
Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017b. Video captioning with transferred semantic attributes. In CVPR .
[25]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, et almbox. 2017. Automatic differentiation in PyTorch. In NIPS Workshop .
[26]
Zhaofan Qiu, Yingwei Pan, Ting Yao, and Tao Mei. 2017. Deep semantic hashing with generative adversarial networks. In SIGIR .
[27]
Stephan R Richter, Zeeshan Hayder, and Vladlen Koltun. 2017. Playing for benchmarks. In ICCV .
[28]
Manuel Ruder, Alexey Dosovitskiy, and Thomas Brox. 2016. Artistic style transfer for videos. In GCPR .
[29]
Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Victor S Lempitsky. 2016. Texture Networks: Feed-forward Synthesis of Textures and Stylized Images. In ICML .
[30]
Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Generating videos with scene dynamics. In NIPS .
[31]
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018a. Video-to-video synthesis. In NIPS .
[32]
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018b. High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR .
[33]
Xingxing Wei, Jun Zhu, Sitong Feng, and Hang Su. 2018. Video-to-video translation with global temporal consistency. In ACM MM .
[34]
Xuewen Yang, Dongliang Xie, and Xin Wang. 2018. Crossing-Domain Generative Adversarial Networks for Unsupervised Multi-Domain Image-to-Image Translation. In ACM MM .
[35]
Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. 2017. Dualgan: Unsupervised dual learning for image-to-image translation. In ICCV .
[36]
Hang Zhang and Kristin Dana. 2018. Multi-style generative network for real-time transfer. In ECCV .
[37]
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017a. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV .
[38]
Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. 2017b. Toward multimodal image-to-image translation. In NIPS .

Cited By

View all
  • (2025)Neural-Network-Enhanced Metalens Camera for High-Definition, Dynamic Imaging in the Long-Wave Infrared SpectrumACS Photonics10.1021/acsphotonics.4c01321Online publication date: 2-Jan-2025
  • (2025)Deepfakes in digital media forensics: Generation, AI-based detection and challengesJournal of Information Security and Applications10.1016/j.jisa.2024.10393588(103935)Online publication date: Feb-2025
  • (2024)DBSF-Net: Infrared Image Colorization Based on the Generative Adversarial Model with Dual-Branch Feature Extraction and Spatial-Frequency-Domain DiscriminationRemote Sensing10.3390/rs1620376616:20(3766)Online publication date: 10-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '19: Proceedings of the 27th ACM International Conference on Multimedia
October 2019
2794 pages
ISBN:9781450368896
DOI:10.1145/3343031
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. gans
  2. unsupervised learning
  3. video-to-video translation

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China

Conference

MM '19
Sponsor:

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)94
  • Downloads (Last 6 weeks)11
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Neural-Network-Enhanced Metalens Camera for High-Definition, Dynamic Imaging in the Long-Wave Infrared SpectrumACS Photonics10.1021/acsphotonics.4c01321Online publication date: 2-Jan-2025
  • (2025)Deepfakes in digital media forensics: Generation, AI-based detection and challengesJournal of Information Security and Applications10.1016/j.jisa.2024.10393588(103935)Online publication date: Feb-2025
  • (2024)DBSF-Net: Infrared Image Colorization Based on the Generative Adversarial Model with Dual-Branch Feature Extraction and Spatial-Frequency-Domain DiscriminationRemote Sensing10.3390/rs1620376616:20(3766)Online publication date: 10-Oct-2024
  • (2024)Dynamic Fashion Video Synthesis from Static ImageryFuture Internet10.3390/fi1608028716:8(287)Online publication date: 8-Aug-2024
  • (2024)Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681634(6870-6879)Online publication date: 28-Oct-2024
  • (2024)Depth-Aware Unpaired Video DehazingIEEE Transactions on Image Processing10.1109/TIP.2024.337847233(2388-2403)Online publication date: 2024
  • (2024)Multinetwork Algorithm for Coastal Line Segmentation in Remote Sensing ImagesIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.343596362(1-12)Online publication date: 2024
  • (2024)Exploring Spatiotemporal Consistency of Features for Video Translation in Consumer Internet of ThingsIEEE Transactions on Consumer Electronics10.1109/TCE.2023.333100970:1(3077-3087)Online publication date: Feb-2024
  • (2024)Temporally Consistent Unpaired Multi-domain Video Translation by Contrastive Learning2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650014(1-8)Online publication date: 30-Jun-2024
  • (2024)VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00468(4896-4905)Online publication date: 16-Jun-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media