research-article

Harmony Everything! Masked Autoencoders for Video Harmonization

Authors:

Jian Jun ZhangAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 8536 - 8545

https://doi.org/10.1145/3664647.3680800

Published: 28 October 2024 Publication History

Abstract

Video harmonization aims to address the discrepancy in color and lighting between foreground and background elements within video compositions, thereby enhancing the innate coherence of composite video content. Nevertheless, existing methods struggle to effectively handle video composite tasks with excessively large-scale foregrounds. In this paper, we propose Video Harmonization Masked Autoencoders (VHMAE), a simple yet powerful end-to-end video harmonization method designed to tackle this challenge once and for all. Unlike other typically MAE-based methods employing random or tube masking strategies, we innovative treat all foregrounds in each frame required for harmonization as prediction regions, which are designated as masked tokens and fed into our network to produce the final refinement video. To this end, the network is optimized to prioritize the harmonization task, proficiently reconstructing the masked region despite the limited background information. Specifically, we introduce the Pattern Alignment Module (PAM) to extract content information from the extensive masked foreground region, aligning the latent semantic features of the masked foreground content with the background context while disregarding the impact of various colors or illumination. Moreover, We propose the Patch Balancing Loss, which effectively mitigates the undesirable grid-like artifacts commonly observed in MAE-based approaches for image generation, thereby ensuring consistency between the predicted foreground and the visible background. Additionally, we introduce a real-composited video harmonization dataset named RCVH, which serves as a valuable benchmark for assessing the efficacy of techniques aimed at video harmonization across different real video sources. Comprehensive experiments demonstrate that our VHMAE outperforms state-of-the-art techniques on both RCVH and HYouTube datasets.

References

[1]

Roman Bachmann, David Mizrahi, Andrei Atanov, and Amir Zamir. 2022. Multimae: Multi-modal multi-task masked autoencoders. In European Conference on Computer Vision. Springer, 348--367.

Digital Library

[2]

Junyan Cao, Wenyan Cong, Li Niu, Jianfu Zhang, and Liqing Zhang. 2021. Deep image harmonization by bridging the reality gap. arXiv preprint arXiv:2103.17104 (2021).

[3]

Xiuwen Chen, Li Fang, Long Ye, and Qin Zhang. 2024. Deep Video Harmonization by Improving Spatial-temporal Consistency. Machine Intelligence Research, Vol. 21, 1 (2024), 46--54.

[4]

Yabo Chen, Yuchen Liu, Dongsheng Jiang, Xiaopeng Zhang, Wenrui Dai, Hongkai Xiong, and Qi Tian. 2022. Sdae: Self-distillated masked autoencoder. In European Conference on Computer Vision. Springer, 108--124.

Digital Library

[5]

Daniel Cohen-Or, Olga Sorkine, Ran Gal, Tommer Leyvand, and Ying-Qing Xu. 2006. Color harmonization. In ACM SIGGRAPH 2006 Papers. 624--630.

Digital Library

[6]

Wenyan Cong, Xinhao Tao, Li Niu, Jing Liang, Xuesong Gao, Qihao Sun, and Liqing Zhang. 2022. High-resolution image harmonization via collaborative dual transformations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18470--18479.

[7]

Wenyan Cong, Jianfu Zhang, Li Niu, Liu Liu, Zhixin Ling, Weiyuan Li, and Liqing Zhang. 2020. Dovenet: Deep image harmonization via domain verification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8394--8403.

[8]

Ken Dancyger. 2018. The technique of film and video editing: history, theory, and practice.

[9]

Christoph Feichtenhofer, Yanghao Li, Kaiming He, et al. 2022. Masked autoencoders as spatiotemporal learners. Advances in Neural Information Processing Systems, Vol. 35 (2022), 35946--35958.

[10]

Zonghui Guo, Zhaorui Gu, Bing Zheng, Junyu Dong, and Haiyong Zheng. 2023. Transformer for Image Harmonization and Beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45, 11 (2023), 12960--12977.

Digital Library

[11]

Zonghui Guo, Dongsheng Guo, Haiyong Zheng, Zhaorui Gu, Bing Zheng, and Junyu Dong. 2021. Image harmonization with transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14870--14879.

[12]

Zonghui Guo, Haiyong Zheng, Yufeng Jiang, Zhaorui Gu, and Bing Zheng. 2021. Intrinsic image harmonization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16367--16376.

[13]

Yucheng Hang, Bin Xia, Wenming Yang, and Qingmin Liao. 2022. Scs-co: Self-consistent style contrastive learning for image harmonization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19710--19719.

[14]

Guoqing Hao, Satoshi Iizuka, and Kazuhiro Fukui. 2020. Image Harmonization with Attention-based Deep Feature Modulation. In British Machine Vision Conference, Vol. 1. 2.

[15]

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16000--16009.

[16]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, Vol. 33 (2020), 6840--6851.

[17]

Hao-Zhi Huang, Sen-Zhe Xu, Jun-Xiong Cai, Wei Liu, and Shi-Min Hu. 2019. Temporally coherent video harmonization using adversarial networks. IEEE Transactions on Image Processing, Vol. 29 (2019), 214--224.

[18]

Jincen Jiang, Xuequan Lu, Lizhi Zhao, Richard Dazaley, and Meili Wang. 2023. Masked autoencoders in 3d point cloud representation learning. IEEE Transactions on Multimedia (2023).

[19]

Jincen Jiang, Qianyu Zhou, Yuhang Li, Xuequan Lu, Meili Wang, Lizhuang Ma, Jian Chang, and Jian Jun Zhang. 2024. DG-PIC: Domain Generalized Point-In-Context Learning for Point Cloud Understanding. In European Conference on Computer Vision. Springer.

[20]

Yifan Jiang, He Zhang, Jianming Zhang, Yilin Wang, Zhe Lin, Kalyan Sunkavalli, Simon Chen, Sohrab Amirghodsi, Sarah Kong, and Zhangyang Wang. 2021. Ssh: A self-supervised framework for image harmonization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4832--4841.

[21]

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision. 694--711.

[22]

Zhanghan Ke, Yuhao Liu, Lei Zhu, Nanxuan Zhao, and Rynson WH Lau. 2023. Neural preset for color style transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14173--14182.

[23]

Zhanghan Ke, Chunyi Sun, Lei Zhu, Ke Xu, and Rynson WH Lau. 2022. Harmonizer: Learning to perform white-box image and video harmonization. In European Conference on Computer Vision. 690--706.

Digital Library

[24]

Wei-Sheng Lai, Jia-Bin Huang, Oliver Wang, Eli Shechtman, Ersin Yumer, and Ming-Hsuan Yang. 2018. Learning blind video temporal consistency. In European Conference on Computer Vision. 170--185.

Digital Library

[25]

Jean-Francois Lalonde and Alexei A Efros. 2007. Using color compatibility for assessing image realism. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1--8.

[26]

Chloe LeGendre, Lukas Lepicovsky, and Paul Debevec. 2022. Jointly Optimizing Color Rendition and In-Camera Backgrounds in an RGB Virtual Production Stage. In The Digital Production Symposium. 1--12.

Digital Library

[27]

Chenyang Lei, Yazhou Xing, and Qifeng Chen. 2020. Blind video temporal consistency via deep video prior. Advances in Neural Information Processing Systems, Vol. 33 (2020), 1083--1093.

[28]

Binzhe Li, Bolin Chen, Zhao Wang, Baoliang Chen, Shiqi Wang, and Yan Ye. 2023. Quality Harmonization for Virtual Composition in Online Video Communications. IEEE Transactions on Circuits and Systems for Video Technology (2023).

[29]

Yuhang Li, Feifan Cai, Yifei Tu, and Youdong Ding. 2023. Low-light image enhancement under non-uniform dark. In International Conference on Multimedia Modeling. 190--201.

Digital Library

[30]

Yuhang Li, Youdong Ding, and Bing Yu. 2020. Inpainting of Vintage Films Based on Variational Auto-encoder. In Mechanical, Control and Computer Engineering. 616--620.

[31]

Yuhang Li, Tianyanshi Liu, Jiaxin Fan, and Youdong Ding. 2023. LDNet: low-light image enhancement with joint lighting and denoising. Machine Vision and Applications, Vol. 34, 1 (2023), 13.

Digital Library

[32]

Yuhang Li, Chao Wang, Bing Liang, Feifan Cai, and Youdong Ding. 2024. Luminance domain-guided low-light image enhancement. Neural Computing and Applications (2024), 1--17.

[33]

Jingtang Liang, Xiaodong Cun, Chi-Man Pun, and Jue Wang. 2022. Spatial-separated curve rendering network for efficient and high-resolution image harmonization. In European Conference on Computer Vision. 334--349.

Digital Library

[34]

Shanchuan Lin, Linjie Yang, Imran Saleemi, and Soumyadip Sengupta. 2022. Robust high-resolution video matting with temporal guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 238--247.

[35]

Jun Ling, Han Xue, Li Song, Rong Xie, and Xiao Gu. 2021. Region-aware adaptive instance normalization for image harmonization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9361--9370.

[36]

Sheng Liu, Cong Phuoc Huynh, Cong Chen, Maxim Arap, and Raffay Hamid. 2023. Lemart: Label-efficient masked region transform for image harmonization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18290--18299.

[37]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012--10022.

[38]

Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.

[39]

Lingxiao Lu, Jiangtong Li, Junyan Cao, Li Niu, and Liqing Zhang. 2023. Painterly image harmonization using diffusion model. In Proceedings of the 31st ACM International Conference on Multimedia. 233--241.

Digital Library

[40]

Xinyuan Lu, Shengyuan Huang, Li Niu, Wenyan Cong, and Liqing Zhang. 2022. Deep video harmonization with color mapping consistency. International Joint Proceedings of the AAAI Conference on Artificial Intelligence (2022).

[41]

Xin Ning, Yuhang Li, Ziwei Feng, Jinhua Liu, and Youdong Ding. 2024. An Efficient Multi-Scale Attention Feature Fusion Network for 4k Video Frame Interpolation. Electronics, Vol. 13, 6 (2024), 1037.

[42]

Konstantin Sofiiuk, Polina Popenova, and Anton Konushin. 2021. Foreground-aware semantic representations for image harmonization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1620--1629.

[43]

Kalyan Sunkavalli, Micah K Johnson, Wojciech Matusik, and Hanspeter Pfister. 2010. Multi-scale image harmonization. ACM Transactions on Graphics, Vol. 29, 4 (2010), 1--10.

Digital Library

[44]

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. 2022. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in Neural Information Processing Systems, Vol. 35 (2022), 10078--10093.

[45]

Yi-Hsuan Tsai, Xiaohui Shen, Zhe Lin, Kalyan Sunkavalli, Xin Lu, and Ming-Hsuan Yang. 2017. Deep image harmonization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3789--3797.

[46]

Yifei Tu, Yuhang Li, Feifan Cai, Chao Wang, Bing Liang, Jiaxin Fan, and Youdong Ding. 2022. Deep Video Decaptioning via Subtitle Mask Prediction and Inpainting. In Information Management, Communicates, Electronic and Automation Control Conference, Vol. 5. 1836--1839.

[47]

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. 2023. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14549--14560.

[48]

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7794--7803.

[49]

Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. 2024. Videocomposer: Compositional video synthesis with motion controllability. Advances in Neural Information Processing Systems, Vol. 36 (2024).

[50]

Tianlin Xia, Yuhang Li, Liting Huang, and Youdong Ding. 2023. Shuttling Through Films: A Recoloring Method Based on Chinese Film Aesthetics. In International Information Technology and Artificial Intelligence Conference, Vol. 11. 1362--1366.

[51]

Zeyu Xiao, Yurui Zhu, Xueyang Fu, and Zhiwei Xiong. 2024. TSA2: Temporal Segment Adaptation and Aggregation for Video Harmonization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 4136--4145.

[52]

Ben Xue, Shenghui Ran, Quan Chen, Rongfei Jia, Binqiang Zhao, and Xing Tang. 2022. Dccf: Deep comprehensible color filter learning framework for high-resolution image harmonization. In European Conference on Computer Vision. 300--316.

Digital Library

[53]

Su Xue, Aseem Agarwala, Julie Dorsey, and Holly Rushmeier. 2012. Understanding and improving the realism of image composites. ACM Transactions on graphics, Vol. 31, 4 (2012), 1--10.

[54]

Chong Zeng, Guojun Chen, Yue Dong, Pieter Peers, Hongzhi Wu, and Xin Tong. 2023. Relighting neural radiance fields with shadow and highlight hints. In ACM SIGGRAPH 2023 Conference Proceedings. 1--11.

Digital Library

[55]

Zhengxia Zou, Rui Zhao, Tianyang Shi, Shuang Qiu, and Zhenwei Shi. 2022. Castle in the sky: Dynamic sky replacement and harmonization in videos. IEEE Transactions on Image Processing, Vol. 31 (2022), 5067--5078.

Digital Library

Index Terms

Harmony Everything! Masked Autoencoders for Video Harmonization
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Computer graphics
    1. Image manipulation
      1. Computational photography

Recommendations

DiffHarmony++: Enhancing Image Harmonization with Harmony-VAE and Inverse Harmonization Model
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Latent diffusion model has demonstrated impressive efficacy in image generation and editing tasks. Recently, it has also promoted the advancement of image harmonization. However, methods involving latent diffusion model all face a common challenge: the ...
Referring Image Harmonization
ICCIP '23: Proceedings of the 2023 9th International Conference on Communication and Information Processing

Image harmonization is the process of modifying the foreground of a composite image in order to achieve a cohesive visual consistency with the background. Existing works viewed image harmonization as a purely visual task, using masks to distinguish ...
MultiMAE: Multi-modal Multi-task Masked Autoencoders
Computer Vision – ECCV 2022
Abstract
We propose a pre-training strategy called Multi-modal Multi-task Masked Autoencoders (MultiMAE). It differs from standard Masked Autoencoding in two key aspects: I) it can optionally accept additional modalities of information in the input besides ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

China Scholarship Council
China Scholarship Council

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
44
Total Downloads

Downloads (Last 12 months)44
Downloads (Last 6 weeks)33

Reflects downloads up to 23 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents