Article

Learning Joint Spatial-Temporal Transformations for Video Inpainting

Authors:

Hongyang ChaoAuthors Info & Claims

Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI

Pages 528 - 543

https://doi.org/10.1007/978-3-030-58517-4_31

Published: 23 August 2020 Publication History

Abstract

High-quality video inpainting that completes missing regions in video frames is a promising yet challenging task. State-of-the-art approaches adopt attention models to complete a frame by searching missing contents from reference frames, and further complete whole videos frame by frame. However, these approaches can suffer from inconsistent attention results along spatial and temporal dimensions, which often leads to blurriness and temporal artifacts in videos. In this paper, we propose to learn a joint Spatial-Temporal Transformer Network (STTN) for video inpainting. Specifically, we simultaneously fill missing regions in all input frames by self-attention, and propose to optimize STTN by a spatial-temporal adversarial loss. To show the superiority of the proposed model, we conduct both quantitative and qualitative evaluations by using standard stationary masks and more realistic moving object masks. Demo videos are available at https://github.com/researchmm/STTN.

References

[1]

Barnes C, Shechtman E, Finkelstein A, and Goldman DB PatchMatch: a randomized correspondence algorithm for structural image editing TOG 2009 28 3 24:1-24:11

Digital Library

[2]

Bertalmio, M., Bertozzi, A.L., Sapiro, G.: Navier-stokes, fluid dynamics, and image and video inpainting. In: CVPR, pp. 355–362 (2001)

[3]

Caelles, S., et al.: The 2018 DAVIS challenge on video object segmentation. arXiv (2018)

[4]

Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR, pp. 6299–6308 (2017)

[5]

Chang, Y.L., Liu, Z.Y., Lee, K.Y., Hsu, W.: Free-form video inpainting with 3D gated convolution and temporal PatchGAN. In: ICCV, pp. 9066–9075 (2019)

[6]

Chang, Y.L., Liu, Z.Y., Lee, K.Y., Hsu, W.: Learnable gated temporal shift module for deep video inpainting. In: BMVC (2019)

[7]

Criminisi A, Pérez P, and Toyama K Region filling and object removal by exemplar-based image inpainting TIP 2004 13 9 1200-1212

Digital Library

[8]

Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: CVPR, pp. 2414–2423 (2016)

[9]

Girdhar, R., Carreira, J., Doersch, C., Zisserman, A.: Video action transformer network. In: CVPR, pp. 244–253 (2019)

[10]

Granados M, Tompkin J, Kim K, Grau O, Kautz J, and Theobalt C How not to be seen-object removal from videos of crowded scenes Comput. Graph. Forum 2012 31 21 219-228

Digital Library

[11]

Hausman DM and Woodward J Independence, invariance and the causal Markov condition Br. J. Philos. Sci. 1999 50 4 521-583

[12]

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

[13]

Huang JB, Kang SB, Ahuja N, and Kopf J Temporally coherent completion of dynamic video TOG 2016 35 6 1-11

[14]

Johnson J, Alahi A, and Fei-Fei L Leibe B, Matas J, Sebe N, and Welling M Perceptual losses for real-time style transfer and super-resolution Computer Vision – ECCV 2016 2016 Cham Springer 694-711

[15]

Kim, D., Woo, S., Lee, J.Y., Kweon, I.S.: Deep blind video decaptioning by temporal aggregation and recurrence. In: CVPR, pp. 4263–4272 (2019)

[16]

Kim, D., Woo, S., Lee, J.Y., Kweon, I.S.: Deep video inpainting. In: CVPR, pp. 5792–5801 (2019)

[17]

Lai W-S, Huang J-B, Wang O, Shechtman E, Yumer E, and Yang M-H Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Learning blind video temporal consistency Computer Vision – ECCV 2018 2018 Cham Springer 179-195

Digital Library

[18]

Lee, S., Oh, S.W., Won, D., Kim, S.J.: Copy-and-paste networks for deep video inpainting. In: ICCV, pp. 4413–4421 (2019)

[19]

Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: ICCV, pp. 7083–7093 (2019)

[20]

Liu G, Reda FA, Shih KJ, Wang T-C, Tao A, and Catanzaro B Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Image inpainting for irregular holes using partial convolutions Computer Vision – ECCV 2018 2018 Cham Springer 89-105

Digital Library

[21]

Ma, S., Fu, J., Wen Chen, C., Mei, T.: DA-GAN: instance-level image translation by deep attention generative adversarial networks. In: CVPR, pp. 5657–5666 (2018)

[22]

Matsushita Y, Ofek E, Ge W, Tang X, and Shum HY Full-frame video stabilization with motion inpainting TPAMI 2006 28 7 1150-1163

Digital Library

[23]

Nazeri, K., Ng, E., Joseph, T., Qureshi, F., Ebrahimi, M.: EdgeConnect: generative image inpainting with adversarial edge learning. In: ICCVW (2019)

[24]

Newson A, Almansa A, Fradet M, Gousseau Y, and Pérez P Video inpainting of complex scenes SIAM J. Imaging Sci. 2014 7 4 1993-2019

Digital Library

[25]

Oh, S.W., Lee, S., Lee, J.Y., Kim, S.J.: Onion-peel networks for deep video completion. In: ICCV, pp. 4403–4412 (2019)

[26]

Patwardhan, K.A., Sapiro, G., Bertalmio, M.: Video inpainting of occluding and occluded objects. In: ICIP, pp. 11–69 (2005)

[27]

Patwardhan KA, Sapiro G, and Bertalmío M Video inpainting under constrained camera motion TIP 2007 16 2 545-553

Digital Library

[28]

Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)

[29]

Wang, C., Huang, H., Han, X., Wang, J.: Video inpainting by jointly learning temporal structure and spatial details. In: AAAI, pp. 5232–5239 (2019)

[30]

Wang, T.C., et al.: Video-to-video synthesis. In: NeuraIPS, pp. 1152–1164 (2018)

[31]

Wexler Y, Shechtman E, and Irani M Space-time completion of video TPAMI 2007 29 3 463-476

Digital Library

[32]

Xu, N., et al.: YouTube-VOS: a large-scale video object segmentation benchmark. arXiv (2018)

[33]

Xu, R., Li, X., Zhou, B., Loy, C.C.: Deep flow-guided video inpainting. In: CVPR, pp. 3723–3732 (2019)

[34]

Yang, F., Yang, H., Fu, J., Lu, H., Guo, B.: Learning texture transformer network for image super-resolution. In: CVPR, pp. 5791–5800 (2020)

[35]

Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Free-form image inpainting with gated convolution. In: ICCV, pp. 4471–4480 (2019)

[36]

Zeng, Y., Fu, J., Chao, H., Guo, B.: Learning pyramid-context encoder network for high-quality image inpainting. In: CVPR, pp. 1486–1494 (2019)

[37]

Zhang, H., Mai, L., Xu, N., Wang, Z., Collomosse, J., Jin, H.: An internal learning approach to video inpainting. In: CVPR, pp. 2720–2729 (2019)

[38]

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR, pp. 586–595 (2018)

Cited By

Wang LSingh SChakareski JHajiesmaili MSitaraman R(2024)BONES: Near-Optimal Neural-Enhanced Video StreamingProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36560148:2(1-28)Online publication date: 29-May-2024
https://dl.acm.org/doi/10.1145/3656014
Liu TWu KWang YLiu WYap KChau LOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Bitstream-corrupted video recoveryProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669113(68420-68433)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3669113
Chang MPrakash AGupta SOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Look ma, no hands! agent-environment factorization of egocentric videosProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667060(21466-21486)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3667060
Show More Cited By

Index Terms

Learning Joint Spatial-Temporal Transformations for Video Inpainting
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object detection
      2. Computer vision tasks
        Activity recognition and understanding
        Visual content-based indexing and retrieval
  2. Machine learning
    1. Machine learning approaches
      1. Learning latent representations
      2. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Exemplar-based video inpainting without ghost shadow artifacts by maintaining temporal continuity

Image inpainting or image completion is the technique that automatically restores/completes removed areas in an image. When dealing with a similar problem in video, not only should a robust tracking algorithm be used, but the temporal continuity among ...
Video inpainting and restoration techniques
MULTIMEDIA '05: Proceedings of the 13th annual ACM international conference on Multimedia

Aged films may contain defects such as spikes or dirt, as well as long vertical defect lines. These defects were produced in file development or due to improper maintenance of films. We present a series of algorithms, which can detect and restore ...
Depth-Aware Endoscopic Video Inpainting
Medical Image Computing and Computer Assisted Intervention – MICCAI 2024
Abstract
Video inpainting fills in corrupted video content with plausible replacements. While recent advances in endoscopic video inpainting have shown potential for enhancing the quality of endoscopic videos, they mainly repair 2D visual information ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI

Aug 2020

845 pages

ISBN:978-3-030-58516-7

DOI:10.1007/978-3-030-58517-4

Editors:
Andrea Vedaldi
University of Oxford, Oxford, UK
,
Horst Bischof
Graz University of Technology, Graz, Austria
,
Thomas Brox
University of Freiburg, Freiburg im Breisgau, Germany
,
Jan-Michael Frahm
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

© Springer Nature Switzerland AG 2020.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 23 August 2020

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang LSingh SChakareski JHajiesmaili MSitaraman R(2024)BONES: Near-Optimal Neural-Enhanced Video StreamingProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36560148:2(1-28)Online publication date: 29-May-2024
https://dl.acm.org/doi/10.1145/3656014
Liu TWu KWang YLiu WYap KChau LOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Bitstream-corrupted video recoveryProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669113(68420-68433)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3669113
Chang MPrakash AGupta SOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Look ma, no hands! agent-environment factorization of egocentric videosProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667060(21466-21486)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3667060
Zhang RWang HDu MLiu HZhou YZeng QEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for Temporal Forgery LocalizationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3613767(8749-8759)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3613767
Wu HYang YChen HRen JZhu LEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Mask-Guided Progressive Network for Joint Raindrop and Rain Streak Removal in VideosProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612001(7216-7225)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612001
Wang JXuan HWu Z(2023)Semantic-Guided Completion Network for Video Inpainting in Complex Urban ScenePattern Recognition and Computer Vision10.1007/978-981-99-8552-4_18(224-236)Online publication date: 13-Oct-2023
https://dl.acm.org/doi/10.1007/978-981-99-8552-4_18
Pei PZhao XLi JCao Y(2023)VIFST: Video Inpainting Localization Using Multi-view Spatial-Frequency TracesPRICAI 2023: Trends in Artificial Intelligence10.1007/978-981-99-7025-4_37(434-446)Online publication date: 15-Nov-2023
https://dl.acm.org/doi/10.1007/978-981-99-7025-4_37
Li JZhao XCao Y(2023)Generalizable Deep Video Inpainting Detection Based on Constrained Convolutional Neural NetworksDigital Forensics and Watermarking10.1007/978-981-97-2585-4_9(125-138)Online publication date: 25-Nov-2023
https://dl.acm.org/doi/10.1007/978-981-97-2585-4_9
Daher RBarbed OMurillo AVasconcelos FStoyanov D(2023)CycleSTTN: A Learning-Based Temporal Model for Specular Augmentation in EndoscopyMedical Image Computing and Computer Assisted Intervention – MICCAI 202310.1007/978-3-031-43999-5_54(570-580)Online publication date: 8-Oct-2023
https://dl.acm.org/doi/10.1007/978-3-031-43999-5_54
You CZhao RLiu FDong SChinchali STopcu UStaib LDuncan JKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)Class-aware adversarial transformers for medical image segmentationProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602415(29582-29596)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3602415
Show More Cited By

View Options

View options

Figures

Tables

Media

View Table of Conten