1 Introduction

Image inpainting is a fundamental task in computer vision and image processing that involves the restoration or completion of missing or damaged regions within an image. Image inpainting offers a significant solution for many tasks including photography restoration, video editing, and medical imaging. Over the years, different techniques have been developed to address image or video inpainting, which range from traditional methods based on patch-based or exemplar-based approaches to recent deep learning-based techniques (Wang et al. 2013). Patch-based image inpainting functions by searching for patches (small regions) within the image to fill in the missing or damaged parts (Newson et al. 2017). The algorithm looks for the best matching patches that align with the boundary conditions of the damaged area and uses these patches to reconstruct the missing parts. It is particularly effective for small, damaged regions. The exemplar-based technique is an extension of the patch-based approach that incorporates additional information, such as texture and structure, into the selection process (Abdulla and Ahmed 2021). These methods prioritize patches that fit with the surrounding region with similar structural or textural information, resulting in the inpainting results being more coherent with the overall image content. Recently, the advent of convolutional neural networks (CNNs) has revolutionized the field of image inpainting by training inpainting models on large-scale datasets (Elharrouss et al. 2024). The learning capabilities of neural networks are forced to effectively capture the contextual information to fill the missing parts in the image whilst conserving the coherence and quality of the image. These approaches have demonstrated remarkable performance improvements over traditional methods, particularly in handling complex and large-scale inpainting tasks. In addition to CNN-based models, advanced techniques, such as generative adversarial networks (GANs), have been employed to generate high-quality inpainted images and to maintain the integrity of the original image in terms of the structural and textural components. Additionally, these techniques are powerful in handling complex textures and large missing regions in an image. Furthermore, the use of transformer-based architectures, initially proposed for natural language processing tasks, has gained significant interest in the field of computer vision including image inpainting (Lin et al. 2022). Transformers with self-attention mechanisms succeed at capturing long-range dependencies, making them suitable for the tasks that require a global context understanding, such as image inpainting (Han et al. 2022). Transformer-based inpainting methods allow the capture of contextual information from the entire image for accurate and coherent inpainting results. In this paper, we aim to provide a comprehensive overview of transformer-based image or video inpainting methods by the category of approaches, architectures, and performance characteristics of these approaches.

Fig. 1
figure 1

Timeline of the proposed image/video inpainting reviews during the last years describing the types of methods analyzed in each review

We present a selection of influential and essential algorithms from prestigious journals and conferences. The focus of this study is on contemporary transformer-based image or video inpainting methodologies, which can provide a deeper understanding of the advancements in image or video inpainting. Additionally, a discussion of recent advancements, challenges, and future research directions in the field of image or video inpainting using transformer-based methodologies is provided. The content of this review is presented as follows:

  • Summarization of the existing surveys.

  • A description of different types of damage.

  • Classification of transformer-based image or video inpainting methods.

  • Public datasets deployed to evaluate the image or video inpainting methods.

  • Evaluation metrics are described with various comparisons of the most significant works.

  • Current challenges and future directions.

The remainder of the paper is organized as follows. An overview of the related studies, the scope, and previous surveys are conducted in Sect. 2. A taxonomy of the image and video inpainting is presented in Sect. 3. Video inpainting methods are discussed in Sect. 4. Used loss functions are presented in Sect. 5. Public datasets are briefly described in Sect. 6 before presenting the evaluation metrics and various comparisons of the most significant image or video inpainting methods in Sect. 7. Current challenges are presented in Sect. 8. Finally, a conclusion is provided in Sect. 9.

2 Related previous reviews and surveys

In the literature, reviews and surveys for image or video inpainting with different techniques and for different purposes were searched (Patel et al. 2015; Pushpalwar and Bhandari 2016; Ahire and Deshpande 2018; Elharrouss et al. 2020; Wang et al. 2020b; Jam et al. 2021; Yap et al. 2021; Liu et al. 2022a; Weng et al. 2022; He et al. 2022; Xiang et al. 2023; Zhang et al. 2023b). These reviews can be categorized based on the data used in inpainting, such as scratched pictures inpainting (Patel et al. 2015), Depth images inpainting using 3D to 2D representations (Pushpalwar and Bhandari 2016; Ahire and Deshpande 2018), RGB images (Elharrouss et al. 2020; Wang et al. 2020b), or forensics imaging (Liu et al. 2022a; Zhu et al. 2023). In addition, the techniques-based reviews can be divided into two categories, including traditional-based methods and deep learning-based approaches. The first reviews of traditional techniques used in image or video inpainting, including texture and patch-based techniques, are proposed in Patel et al. (2015), Pushpalwar and Bhandari (2016), and Ahire and Deshpande (2018). In some papers, the authors briefly discuss the traditional methods and then the deep learning methods, as in Weng et al. (2022). For the deep learning-based reviews which were mostly published during the last 5 years, papers included those based on CNNs (Elharrouss et al. 2020; Jam et al. 2021; He et al. 2022; Zhang et al. 2023b), while (Elharrouss et al. 2020; Wang et al. 2020b; Weng et al. 2022; Xiang et al. 2023) focused on GANs. A summarization of the proposed survey is provided in Table 1. While Fig. 1 present the timeline of the proposed reviews for image and video inpainting using different type of data (RGB, Depth images) and various including traditional, deep learning (CNN,GAN), and diffusion models.

Table 1 Summarization of the existing surveys for image/video inpainting. Journal paper type is abbreviated to J and conference paper is abbreviated to Conf

With the introduction of transformer techniques in the field of computer vision, studies have taken preference for the proposed architectures, in addition to the improvements added, especially for image or video inpainting. Unlike previous surveys, our work is focused on the transformer-based techniques for image or video inpainting. We summarize the existing image or video inpainting models from different aspects and list some of the effective approaches in terms of qualitative results on severe image or video inpainting datasets. Furthermore, we describe and analyze the most commonly proposed architectures in terms of the technique used and solved challenges in image inpainting. Finally, we list the open issues and challenges for image inpainting, in addition to the future directions. Through this survey, we expect to make reasonable inferences and predictions for the future development of image or video inpainting, and provide feasible solutions and guidance for the problem of image processing in other domains.

3 Taxonomy of image or video inpainting

In this section, we review transformer-based image inpainting algorithms in the following taxonomies. First, we discuss the different types of damages (masks) in the image or video. Then, we present the different types of transformer-based architectures for image or video inpainting in detail. The important models are described in chronological order.

Fig. 2
figure 2

Types of masks added to the edited images

3.1 Mask types

Image inpainting was originally the operation of restoring old images by eradicating scratches and enhancing damaged portions. Presently, it is also employed to eliminate unwanted objects by substituting them with estimated values within the target area. In addition, it is used to repair various distortions or as masks, such text, blocks, noise, scratches, lines, and diverse masks. These masks are used to indicate the areas in an image or video that need to be filled or reconstructed. Figure 2 illustrates the existing distortion or mask types that are used in different image inpainting methods. Several types of masks commonly used in image inpainting are described as follows:

  • Blocks: a simple mask where a rectangular or a square region in the image is selected for inpainting. It is easy to create and use; however, it may not always be the most accurate representation of the damaged area.

  • Object: in some cases, only specific objects or regions within an image need to be inpainted. Object masks are used to specify these areas for reconstruction while leaving other parts of the image untouched.

  • Noise: random variations in brightness or color within an image, typically introduced by factors such as low light conditions, sensor limitations, or compression.

  • Scribble: involve marking the areas to be inpainted with simple strokes or scribbles. These masks provide a rough guideline for the inpainting algorithm and are often used in interactive inpainting systems. These masks are irregularly shaped and can be drawn by hand or generated using algorithms, such as segmentation techniques.

  • Text: unwanted text overlaid on an image, such as watermarks, captions, or annotations. Inpainting text in an image is the operation of removing or replacing the text with the original content. Text masks are often used in document restoration or editing applications.

  • Scratches: thin, elongated marks or lines on the surface of the image, often caused by physical damage or degradation of the image medium. Generally, this type of mask is used in old pictures.

Fig. 3
figure 3

Techniques exploited in image inpainting and the mask used

3.2 Transformer network representations

To differentiate various types of transformer-based network architectures, we divided the image inpainting models into three categories: blind image inpainting networks, mask-required networks, and GAN-based methods. These categories are illustrated in Fig. 3, and a description of each method is detailed in Table 2. Also the categories of the Transformer-based architectures are presented in Fig. 4.

3.2.1 Blind image inpainting

The blind image (single-stream) inpainting network is a neural network architecture designed to fill in missing or corrupted regions within an image using only the corrupted image as the input. Through a series of convolutional and/or transformer layers and dedicated inpainting modules, the network predicts the missing regions based on the available information from the input image. Despite its simplicity compared to mask-required (multi-stream) networks, single-stream architectures can still produce impressive inpainting results by learning to infer missing details solely from the provided input image.

  • CTN (Deng et al. 2021) introduces the contextual transformer network (CTN) for image inpainting. CTN tackles the challenge of modeling relationships between corrupted and uncorrupted regions while considering their internal structures. Unlike traditional methods, CTN utilizes transformer blocks to capture long-range dependencies and a multi-scale attention module to model intricate connections within different image regions.

  • ICT (Wan et al. 2021) combines transformers and CNNs for superior image completion. Transformers capture global structures and various outputs, while CNNs refine textures. This method achieves high fidelity, diverse results.

  • MAT (Li et al. 2022a) introduces a transformer-based model named the mask-aware transformer (MAT) to efficiently inpaint large holes in high-resolution images. It combines the strengths of transformers for long-range interactions and convolutions for efficient processing. The model incorporates a customized transformer block with a dynamic mask to focus on relevant information and achieve high-fidelity, various image reconstructions.

  • BAT-Fill (Yu et al. 2021a) is a novel image inpainting method that addresses the limitations of existing CNN-based approaches. While CNNs struggle with capturing long-range features, BAT-Fill leverages a "bidirectional autoregressive transformer" (BAT) for diverse and realistic content generation. Unlike traditional autoregressive transformers, BAT incorporates masked language modeling to analyze context from any direction, enabling better handling of irregularly shaped missing regions.

  • T-former (Deng et al. 2022) proposes a new image inpainting method called T-former, aiming to overcome the limitations of CNNs. While CNNs struggle with complex and diverse image damage, T-former leverages a novel attention mechanism inspired by transformers, which offers efficient long-range modeling capabilities while maintaining computational efficiency compared to traditional transformers.

  • PUT (Liu et al. 2022b) is a novel transformer-based method for image inpainting that addresses the information loss issues in existing approaches. Existing methods down-sample images and quantize pixel values, losing information. PUT uses a patch-based autoencoder and an un-quantized transformer that processes the image in patches without down-sampling. The un-quantized transformer directly uses features from the autoencoder, avoiding information loss from quantization.

  • TransCNN-HAE (Wang et al. 2022) ] is used for blind image inpainting, which addresses the challenges of unknown and various image damage. Unlike existing two-stage approaches, TransCNN-HAE operates in a single stage that combines transformers and CNNs; it leverages transformers for global context modelling to repair damaged regions and CNNs for local context modelling to reconstruct the repaired image. The crosslayer dissimilarity prompt (CDP) accelerates identifying and inpainting damaged areas.

  • InstaFormer (Zhao et al. 2022) is a novel network architecture for image-to-image translation that effectively combines global and instance-level information. Global context: utilizes transformers to analyze the overall content of an image, capturing relationships between different parts. (1) Instance-awareness: Incorporates bounding box information to understand individual objects within the image and their interactions with the background. (2) Style control: Enables the application of different artistic styles to the translated image using adaptive instance normalization. (3) Improved instance translation: Introduces a specific loss function to enhance the quality and faithfulness of translated object regions.

  • Campana et al. (2023) Advanced image inpainting has evolved with the integration of transformers in computer vision; however, their high computational costs pose challenges, especially with large, damaged regions. To overcome this, a novel variable-hyperparameter vision transformer architecture is proposed, showing superior performance in reconstructing semantic content, such as human faces.

  • U2AFN (Ma et al. 2023) is an uncertainty-aware adaptive feedback network (U2AFN) used to enhance image inpainting for large holes. Unlike conventional methods, U2AFN predicts uncertainty alongside inpainting results and employs an adaptive feedback mechanism. This mechanism progressively refines inpainting regions by utilizing low-uncertainty pixels from previous iterations to guide subsequent learning.

  • CBNet (Jin et al. 2023) effective at completing small-sized or specifically masked corruptions, but struggles with large-proportion corrupted images due to limited consideration of semantic relevance. To address this, the authors propose CBNet, a novel image inpainting approach. CBNet combines an adjacent transfer attention (ATA) module in the decoder to preserve contour structure and blend structure-texture information. Additionally, a multi-scale contextual blend (MCB) block assembles multi-stage feature information. Extra deep supervision through a cascaded loss ensures high-quality feature representation.

  • CoordFill (Liu et al. 2023) is a novel method using continuous implicit representation to address limitations in image restoration. By utilizing an attentional fast Fourier convolution (FFC)-based parameter generation network, the degraded image is down-sampled and encoded to derive spatial-adaptive parameters. These parameters are then used in a series of multi-layer perceptrons (MLP) to synthesize color values from encoded continuous coordinates. This approach allows the capturing of larger reception fields by encoding high-resolution images at lower resolutions, while continuous position encoding enhances the synthesis of high-frequency textures. Additionally, the framework enables efficient parallel querying of missing pixel coordinates.

  • CMT (Ko and Kim 2023) is a continuous mask-aware transformer for image inpainting. CMT utilizes a continuous mask to represent error amounts in tokens. It employs masked self-attention with overlapping tokens and updates the mask to model error propagation. Through multiple masked self-attention and mask update layers, CMT predicts initial inpainting results, which are further refined for improved image reconstruction.

  • TransInpaint (Shamsolmoali et al. 2023) is a model for image inpainting that generates realistic content for missing regions while ensuring consistency with the overall context of the image. It utilizes a context-adaptive transformer and a texture enhancement network to produce superior results compared to existing methods.

  • NDMA (Phutke and Murala 2023) is a lightweight architecture for image inpainting, leveraging nested deformable attention-based transformer layers. These layers efficiently extract contextual information, particularly for facial image inpainting tasks. Comparative evaluations on Celeb HQ and Places2 datasets demonstrate the superiority of the proposed approach.

  • Blind-Omni-Wav-Net (Phutke et al. 2023) restores corrupted regions without additional mask information. This is challenging due to difficulties in distinguishing between corrupted and valid areas. To address this, the authors proposed an end-to-end architecture combining a wavelet query multi-head attention transformer block with omni-dimensional gated attention. The wavelet query multi-head attention provides encoder features using processed wavelet coefficients, while the omni-dimensional gated attention facilitates effective feature transmission from the encoder to decoder.

  • DNNAM (Chen et al. 2024a) is a new image inpainting algorithm addressing issues like fuzzy images and semantic inaccuracies by using a partial multi-scale channel attention mechanism and deep neural networks. It utilizes a Res-U-Net module for image encoding and decoding, incorporates a residual network to improve feature extraction, and adds a channel attention module to enhance feature utilization.

  • ITrans (Miao et al. 2024) proposed the Inpainting Transformer (ITrans) network, which addresses the limitations of CNNs in handling global image dependencies by integrating self-attention and convolution operations. ITrans enhances a convolutional encoder–decoder with two key components: the Global Transformer, which captures high-level global context for the decoder, and the Local Transformer, which extracts detailed local information efficiently. This combination allows ITrans to model global relationships and encode local details, improving the realism of inpainted images.

  • SyFormer (Wu et al. 2024) introduced Structure-Guided Synergism Transformer (SyFormer), a new approach for large-portion image inpainting. SyFormer addresses issues in high-resolution images with large missing areas, like distorted dependencies and limited reference information. It features a dual-routing filtering module to remove noise and establish global texture correlations. Additionally, a structurally compact perception module aids in matching and filling patches in heavily damaged images using structural priors. These modules work together for complementary feature representation, and the decoding alignment scheme ensures effective texture integration.

  • MFMAM (Chen et al. 2024b) is an improved image inpainting network that addresses information loss and poor restoration of texture and semantic features in current deep learning methods. It introduces a multi-scale fusion module using dilated convolution to maintain information during convolution by integrating multi-scale features. An enhanced attention module improves semantic feature restoration, resulting in clearer texture details.

  • BIDS-Net (Liu et al. 2024) is a bidirectional interaction dual-stream network for image inpainting, designed to combine the strengths of CNNs and Transformers. BIDS-Net uses a CNN stream to capture local details for reconstruction and refinement and a Transformer stream to model long-range contextual correlations globally. The network features a bidirectional feature interaction (BFI) module for selective feature fusion, enhancing the Transformer’s locality and the CNN’s long-range awareness. Both streams utilize a hierarchical encoder–decoder structure for effective multi-scale context reasoning and improved efficiency.

The proposed methods in blind image inpainting have been significantly advanced by the integration of transformers with convolutional neural networks (CNNs). These innovations take benefit from the power of transformers to model long-range dependencies and enhance the relationship modeling between various image regions, tackling challenges such as large missing areas and complex image damages. By combining transformers with CNNs, these approaches achieve better global context capture and texture refinement, resulting in high resolution inpainting results. Also, improving computational efficiency have led to new architectures that reduce information loss and optimize performance. Additionally, uncertainty-aware mechanisms are being incorporated to enhance predictive accuracy by progressively refining inpainting regions. Overall, these advances make a considerable improvement in image inpainting, enhancing adaptability, efficiency, and the quality of restored images.

Table 2 Summarization of image/video inpainting methods

3.2.2 Mask-required image inpainting

Mask-required networks for inpainting have a neural network architecture that utilizes multiple input streams to perform the inpainting task. The mask is fed into the network with the distorted image. This architecture is designed to handle various types of input information, which can improve the inpainting performance by leveraging different features and representations (Xiang el al. 2024). The network takes in multiple streams of input data, each representing different types of information relevant to the inpainting task. For example, one stream could contain the corrupted image, another stream could contain additional contextual information, such as edge maps or semantic segmentation masks, and another stream could contain guidance from reference images.

  • ZITS (Dong et al. 2022) tackles the challenge of restoring both textures and structures in corrupted images. While CNNs struggle with capturing holistic structures, attention-based models are computationally expensive for large images. For that, a structure restorer network uses a transformer model in a low-resolution space to efficiently recover the overall structure of the image. The recovered structure is then integrated with existing inpainting models to add details and textures.

  • APT (Huang and Zhang 2022) is a two-stage image inpainting framework using a novel "atrous pyramid transformer" (APT). APT captures long-range dependencies to reconstruct damaged areas, while a "dual spectral transform convolution" (DSTC) module refines textures.

  • SPN (Zhang et al. 2023a) restoring realistic content in images with missing regions is challenging. Existing image inpainting models often produce blurred textures or distorted structures in complex scenes due to contextual ambiguity. To address this, the authors proposed the semantic pyramid network (SPN), leveraging multi-scale semantic priors learned from pretext tasks. SPN comprises two components: a prior learner distilling semantic priors into a multi-scale feature pyramid, ensuring a coherent understanding of global context and local structures, and a fully context-aware image generator progressively refining visual representations with the prior pyramid. Optionally, variational inference enables probabilistic inpainting.

  • SWMH (Chen et al. 2023a) is a model that combines a specialized transformer, named the stripe window multi-head (SWMH) transformer, with a traditional CNN. It has a novel loss function to enhance color details beyond RGB channels.

  • ZITS++ (Cao et al. 2023) is an improved model of the authors previous work, ZITS. ZITS++ combines a specialized transformer with a traditional CNN. It introduces the transformer structure restorer (TSR) module for holistic structural priors at low resolution, upsampled by the simple structure upsampler (SSU). Texture details are restored using the Fourier CNN texture restoration (FTR) module, enhanced by Fourier and large-kernel attention convolutions. The upsampled structural priors from TSR are further processed by the structure feature encoder (SFE) and optimized incrementally with the zero-initialized residual addition (ZeroRA). Additionally, a new masking positional encoding addresses large, irregular masks.

  • TransRef (Liao et al. 2023) is a transformer-based encoder–decoder network for reference-guided image inpainting. The guidance process involves progressively aligning and fusing referencing features with the features of the corrupted image. To precisely utilize reference features, they introduce the reference-patch alignment (Ref-PA) module, which aligns patch features from both reference and corrupted images while harmonizing style differences. Additionally, the reference-patch transformer (Ref-PT) module refines the embedded reference feature.

  • UFFC (Chu et al. 2023) examines the limitations of using the vanilla FFC module in image inpainting, including spectrum shifting and limited receptive fields. To address these issues, a novel unbiased fast Fourier convolution (UFFC) module was proposed, incorporating range transform, absolute position embedding, dynamic skip connection, and adaptive clipping. The experimental results demonstrate that the UFFC module outperforms existing methods in capturing texture and achieving faithful reconstruction in image inpainting tasks.

  • DF3Net (Huang et al. 2024): the authors presented DF3Net, a Dual Frequency Feature Fusion Network for image inpainting. It addresses transformers’ limitations in handling high-frequency image details by using a dual-frequency convolution (DFC) module to separate and process low and high-frequency components individually. The hierarchical atrous transformer (HAT) handles low-frequency data, while a gated convolution module deals with high-frequency data. These are fused to improve image reconstruction, enhancing both global structures and local textures.

  • SM-DCA (Xiang et al. 2024a): the authors introduced SM-DCA, a structure-aware multi-view image inpainting method using dual consistency attention. It addresses limitations in traditional single-view inpainting by utilizing additional structure views to enhance consistency and reduce artifacts. SM-DCA has two components: the first involves structure-aware inpainting using a structure inpainting network (SSC) for strong structure correction and an image inpainting network (IDCA) for maintaining content consistency. The second part is image refinement, achieved through an image local refinement network (ILR). This approach enhances inpainting results by making them more coherent and visually accurate.

Fig. 4
figure 4

Transformer-based architectures used in different image inpainting methods. The transformer blocks are combined with different convolutional neural network (CNN)-based parts like in Enc-TR-Dec, TR-Dec, and Enc-Tr representations. Some architectures used a pure transformer-based network, such as TR and UNet-TR, which utilizes a transformer-based encoder and decoder

3.2.3 GAN with transformer image inpainting

GAN-based image inpainting utilizes GANs to fill in missing or corrupted regions of an image. In this approach, two neural networks are trained simultaneously: a generator network and a discriminator network (Goodfellow et al. 2014). The generator generates realistic content for the missing regions, while the discriminator tries to distinguish between the inpainted images and real images. Through adversarial training, the generator learns to produce convincing inpainted images that fool the discriminator. This method often produces visually appealing results by leveraging the adversarial loss to capture high-level image structures and textures. However, GAN-based inpainting is prone to issues such as mode collapse or blurriness, requiring careful optimization and architectural choices to address these challenges.

  • ACCP-GAN (Wang et al. 2021) is a method for automatically repairing defects in serial section images used in histology studies. ACCP-GAN combines two stages: one to detect and roughly fix damaged areas, and another to precisely refine the repairs. The model leverages transformers and convolutions to analyze neighboring images and healthy regions within the defective image, achieving high accuracy in both segmentation and restoration tasks.

  • AOT-GAN (Zeng et al. 2022) tackles challenges in high-resolution image inpainting. It improves both context reasoning and texture synthesis, leading to more realistic reconstructions compared to existing approaches. This method is particularly effective for large, irregular missing regions.

  • HiMFR Hosen and Islam (2022) is a system for recognizing masked faces. HiMFR first detects masked faces using a pre-trained model and inpaints the occluded regions using a GAN-based method. Finally, it recognizes the face, masked or reconstructed, using a hybrid recognition module. Experiments show competitive performance on benchmark datasets.

  • Wnag et al. (2022)[Generative image inpainting with enhanced gated convolution and Transformers] addresses limitations in reconstructing large, damaged areas with image inpainting. The authors propose enhanced gated convolution to extract detailed features from the masked region using a gating mechanism. U-net-like deep structure modeling combines transformers’ long-range modeling with CNNs’ texture learning to capture global structures. Next, the reconstruction module merges shallow and deep features to generate the final inpainted image.

  • Li et al. (2023a) recent image inpainting advancements perform well on simple backgrounds but struggle with complex images due to the lack of semantic understanding and distant context. To address this, a semantic prior-driven fused contextual transformation network was proposed. It utilizes a semantic prior generator to map features from ground truth and damaged images, followed by a fusion strategy to enhance multi-scale texture features and an attention aware module for structure restoration. Additionally, a mask-guided discriminator improves output quality. Results on various datasets show significant improvements over existing methods.

  • Swin-GAN (Zhou et al. 2023b) presents a transformer-based method for image inpainting, aiming to overcome limitations in capturing global and semantic information. The technique utilizes self-supervised attention and a hierarchical Swin transformer in the discriminator. Experimental results show superior performance compared to existing approaches, demonstrating the effectiveness of the proposed transformer-based approach.

  • SFI-Swin (Naderi et al. 2023) image inpainting involves filling in the holes or missing parts of an image. When it comes to inpainting face images with symmetric characteristics, the challenge is even greater than for natural scenes. Existing powerful models struggle to fill in missing parts while considering both symmetry and homogeneity. Additionally, standard metrics for assessing repaired face image quality fail to capture the preservation of symmetry between rebuilt and existing facial features. To address this, the authors propose a GAN-transformer-based solution: multiple discriminators that independently verify the reality of each facial organ, combined with a transformer-based network. They also introduce a novel metric called the "symmetry concentration score" to measure the symmetry of repaired face images.

  • IIN-GCMAM (Liu et al. 2023) is an image inpainting network using gated convolution and a multi-level attention mechanism to address deficiencies in existing methods. By weighing features with gated convolutions and employing multi-level attention, it enhances global structure consistency and repair result precision. Extensive experiments on datasets, such as Paris Street View (PSV) and CelebA, have validated its effectiveness.

  • WAT-GAN (Chen et al. 2023b) is a novel transformer network with cross-window aggregated attention use to address limitations of convolutional networks, such as over-smoothing and limited long-range dependencies. Integrated into a generative adversarial network model, this approach embeds the window aggregation transformer (WAT) module to enhance information aggregation between windows without increasing computational complexity. Initially, the encoder extracts multi-scale features using convolution kernels of varying scales. These features are then input into the WAT module for inter-window aggregation, followed by reconstruction by the decoder. The resulting image undergoes assessment by a global discriminator for authenticity. Experimental validation demonstrates that the transformer window attention network enhances the structured texture of restored images, particularly in scenarios involving large or complex structural restoration tasks.

  • PATMAT (Motamed et al. 2023) is a method for face inpainting that enhances the preservation of facial details and identity. By fine-tuning a MAT with reference images, it outperforms existing models in quality and identity preservation.

  • GCAM (Chen et al. 2023c) is a lightweight image inpainting method that emphasizes both restoration quality and efficiency on limited processing platforms. By combining group convolution and a rotating attention mechanism, the traditional convolution module is enhanced or replaced. Group convolution enables multi-level inpainting, while the rotating attention mechanism addresses information mobility issues between channels. A parallel discriminator structure ensures local and global consistency in the inpainting process. Experimental results show that the proposed method achieves high-quality inpainting while significantly reducing inference time and resource usage compared to other lightweight approaches.

  •   SFI-Swin (Givkashi et al. 2024): addresses the challenge of face image inpainting, particularly focusing on maintaining symmetry, which is a limitation of current models. It introduces a new method using multiple discriminators to evaluate the realism of each facial feature separately and a transformer-based network. Additionally, a new metric, the "symmetry concentration score," is proposed to measure symmetry in repaired face images.

4 Video inpainting

  • FuseFormer (Liu et al. 2021a) is a transformer model tailored for video inpainting tasks to address issues with blurry edges. It utilizes Soft Split and Soft Composition operations to enhance finegrained feature fusion. Soft Split divides feature maps into patches with overlap, while Soft Composition stitches patches together, allowing for more effective interaction between neighboring patches. These operations are integrated into tokenization and detokenization processes for better feature propagation. Additionally, FuseFormer enhances the capability of 1D linear layers to model 2D structures, improving subpatch level feature fusion. The evaluation results demonstrated the superiority of FuseFormer over existing methods in both quantitative and qualitative assessments.

  • FAST (Yu et al. 2021b) is a frequency-aware spatiotemporal transformer used for video inpainting detection. It utilizes global self-attention mechanisms to capture long-range relations and employs a spatiotemporal transformer framework to detect spatial and temporal connections. Additionally, FAST exploits frequency domain information using a specially designed decoder. Experimental results show competitive performance and good generalization.

  • DSTT (Liu et al. 2021b) is a decoupled spatial-temporal transformer used for efficient video inpainting. It separates learning spatial-temporal attention into two tasks: one for temporal object movements and another for background textures. This allows precise inpainting. Additionally, a hierarchical encoder is used for robust feature learning.

  • E2FGVI (Li et al. 2022b) is an end-to-end framework for flow-guided video inpainting to improve the efficiency and effectiveness compared to existing methods. It replaces separate hand-crafted processes with three trainable modules: flow completion, feature propagation, and content hallucination. These modules correspond to previous stages but can be jointly optimized, leading to better results.

  • FGT (Zhang et al. 2022) is a flow-guided transformer used for high-fidelity video inpainting, which utilizes motion discrepancy from optical flows to guide attention retrieval in transformers. A flow completion network is introduced to restore corrupted flows by leveraging relevant flow features within a local temporal window. With completed flows, the content is propagated across frames and flow-guided transformers are employed to fill in corrupted regions. Transformers are decoupled along temporal and spatial dimensions to integrate the completed flows for spatial attention. Additionally, a flow-reweight module controls the impact of completed flows on each spatial transformer. For efficiency, a window partition strategy is employed in both the spatial and temporal transformers.

  • DeViT (Cai et al. 2022) deformed vision transformer (DeViT), presents three key innovations: DePtH for patch alignment, MPPA for enhanced feature matching, and STA for accurate attention assignment. DeViT outperforms previous methods in quality and quantity, setting a new state-of-the-art for video inpainting.

  • DLFormer (Ren et al. 2022) is a discrete latent transformer. Unlike previous methods operating in continuous feature spaces, DLFormer utilizes a discrete latent space, leveraging a compact codebook and autoencoder to represent the target video. By inferring proper codes for unknown areas via self-attention, DLFormer produces fine-grained content with long-term spatial-temporal consistency. Additionally, it enforces short-term consistency to reduce temporal visual jitters.

Fig. 5
figure 5

Loss functions used in each transformer-based image or video inpainting method. Five loss functions are combined for transformer-based learning

  • DMT (Yu et al. 2023) is a dual-modality-compatible inpainting framework used to address deficiencies in video inpainting. DMT_img, a pretrained image inpainting model, serves as a prior for distilling DMT_vid, enhancing performance in deficiency cases. The self-attention module selectively incorporates spatiotemporal tokens, accelerating inference and removing noise signals. Additionally, a receptive field contextualizer improves performance further.

  • FGT++ (Zhang et al. 2024) is an enhanced version of the flow-guided transformer (FGT), resulting in more effective and efficient video inpainting. FGT++ addresses query degradation using a lightweight flow completion network and introduces flow guidance feature integration and flow-guided feature propagation modules. The transformer is decoupled along the temporal and spatial dimensions, utilizing flows for token selection and employing a dual-perspective multi-head self-attention (MHSA) mechanism. Experimental results show that FGT++ outperforms existing video inpainting networks in both quality and efficiency.

  • Liao et al. (2020) propose an automatic video inpainting algorithm for clear street views in autonomous driving. Using depth/point cloud guidance, this method removes traffic agents from videos and fills missing regions. By creating a dense 3D map from point clouds, frames are geometrically correlated, allowing for straightforward pixel transformation. Multiple videos can be fused through 3D point cloud registration, addressing long-time occlusion challenges.

  • FITer (Li et al. 2023b) is a video inpainting method that enhances missing region representations using a feature pre-inpainting network (FPNet) before the transformer stage. This improves the accuracy of self-attention weights and dependency learning. FITer also employs an interleaving transformer with global and window-based local self-attention mechanisms for efficient aggregation of spatial- temporal features into missing regions.

  • ProPainter (Zhou et al. 2023a) is an enhanced framework for video inpainting that addresses the limitations in flow-based propagation and spatiotemporal transformers. ProPainter combines image and feature warping for more reliable global correspondence and employs a mask-guided sparse video transformer for increased efficiency.

  • SViT (Lee et al. 2023) is a new transformer-based video inpainting technique that leverages semantic information to enhance reconstruction quality. Using a mixture-of-experts scheme, multiple experts can be trained to handle mixed scenes with various semantics. By producing different local network parameters at the token level, this method achieves semantic-aware inpainting results

  • FSTT (Liu and Zhu 2023) uses a flow-guided spatial temporal transformer (FSTT) for video inpainting, which effectively utilizes optical flow to establish correspondence between missing and valid regions in spatial and temporal dimensions. FSTT incorporates a flow-guided fusion feed-forward module to enhance features with optical flow guidance, reducing inaccuracies during MHSA. Additionally, a decomposed spatiotemporal MHSA module captures dependencies effectively. To improve efficiency, a global-local temporal MHSA module was designed.

5 Complexity-based comparison

Each inpainting method present unique approaches and innovations to tackle the complexities of reconstructing missing part in and image/video. In this section, a comparison of different methods in terms of proposed architectures as well as the complexity of each model represented here by the number of parameters and Floating Point Operations per Second (FLOPS) in each model. For that, we selected a set of method for image and video inpainting that already discussed in the previous sections.

Fig. 6
figure 6

The flowchart of one-stream TransCNN-HAE architecture (Wang et al. 2022) composed of a Visual-Transformer-based encoder and a CNN-based decoder. While Cross-layer Dissimilarity Prompt (CDP) with transformer blocks used for inpainting then CNN is used for reconstructing of blind image inpainting.

Fig. 7
figure 7

The Overview of the two-stream TransRef architecture (Liao et al. 2023). While the first stream used the corrupted image and the used mask as inputs for the patch embedding and Main-PT modules form. The reference stream is processed using the reference embedding procedure at each scale through the Ref-PA and Ref-PT modules. Finally, the hierarchical features from the Main-PT module, along with the decoder features from the transformer decoder block, are input into a convolution tail to create the final image.

For image inpainting, five methods that provided the FLOPS and parameters are compared here, including TransCNN-HAE (Wang et al. 2022), Campana et al. (2023), CBNet (Jin et al. 2023), TransRef (Liao et al. 2023), Blind-Omni-Wav-Net (Phutke et al. 2023). Campana et al. (2023) is the lowest in terms of parametres, utilized a transformer-based architecture aimed at managing computational costs while effectively capturing global image information. With the highest FLOPS of 20.12G and a relatively low parameter count of 1.65M, this method demonstrates efficiency and computational effectiveness. In the same context, like presented in Fig. 6 TransCNN-HAE (Wang et al. 2022) introduced a one-stage Transformer-CNN Hybrid AutoEncoder designed to combine the strengths of global contextual modeling from Transformers and local feature extraction from CNNs; it operates at 19.71G FLOPS and 2.75M parameters. In contrast, CBNet (Jin et al. 2023) enhanced feature representation for large-corruption scenarios with its Adjacent Transfer Attention and Multi-scale Contextual Blend mechanisms. It achieves 17.94G FLOPS but requires a higher parameter count of 21.03. TransRef (Liao et al. 2023) is another image inpainting which presented a two-stream architecture (as presented in Fig. 7) used reference images in its encoder–decoder setup, and focused on aligning and refining features for improved semantic coherence. The TransRef method has the lowest FLOPS at 7.55G and the highest parameter count at 41.97M that make it a complex architecture. Finally, Blind-Omni-Wav-Net (Phutke et al. 2023) employed wavelet query multi-head attention and omni-dimensional gated attention, presenting a balanced approach with 16.61G FLOPS and 3.24M parameters without the need for mask predictions. These methods illustrate different image inpainting techniques that balancing between computational efficiency and model complexity.

Table 3 Table 1 Comparison of image/video inpainting methods based on the key innovation, advantages, and complexity of the proposed architectures

For video inpainting, each method employ distinct strategies to enhance their effectiveness and efficiency with a difference in terms of complexity and size. For example, ProPainter (Zhou et al. 2023a) addressed propagation and memory issues by combining dual-domain propagation with a mask-guided sparse transformer, achieving significant PSNR improvements but with higher FLOPS value of 808 G. With 523 G FLOPS, FSTT (Liu and Zhu 2023) enhanced spatiotemporal integration using optical flows, incorporating modules like Flow-Guided Fusion and decomposed spatio-temporal MHSA for refined attention accuracy and efficiency. Better than ProPainter and FSTT in terms of FLOPS value, FGT, FGT++, and FITer are close in terms of FLOPS and number of parameters. While FGT (Zhang et al. 2022) utilized optical flow to enhance spatial attention and optimize high-fidelity video inpainting with a flow completion network and dual perspective spatial MHSA. FGT achieved high fidelity with 455.91 G FLOPS and 42.31 M parameters. Based on FGT, FGT++ (Zhang et al. 2024) (Fig. 8) introduced a lightweight flow completion network and temporally deformable MHSA to tackle query degradation, furthering its efficiency and effectiveness with 488.59 G and 53.30 M parameters. In the same context, with 266 G as FLOPS value and 26.8 M parameters, DeViT (Cai et al. 2022) improved patch alignment, saliency-based feature matching, Mask Pruning-based Patch Attention, and a novel Spatial-Temporal Weighting Adaptor as presented in Fig. 9. The lowest FLOPS value of 128 G, DSTT (Liu et al. 2021b) (Decoupled Spatial-Temporal Transformer) focused on separating spatial and temporal attention for efficient computation.

Each method contributes unique architectural advancements that address specific challenges in video inpainting, using different features and techniques like transformers and optical flows to optimize both temporal and spatial information integration. Also, the proposed methods attempted optimize video inpainting performance and efficiency with a variation in terms of complexity represented by FLOPS and number of parameters, which make the selection of the best model a challenging task. For that, this comparison which is presented also in Table 3 can be used to select the suitable model for available resources to train/test an image/video inpainting method.

Fig. 8
figure 8

The FGT++ architecture (Zhang et al. 2024) that comprises two steps. First, the use of the Local Aggregation Flow Completion network (LAFC) to restore the corrupted target flows. Second, synthesize the damaged regions using an enhanced flow-guided transformer, guided by the completed optical flows. The “Flow-guided content propagation” module is optional. PEG stands for Position Embedding Generator.

Fig. 9
figure 9

Overview of the DeViT framework (Cai et al. 2022) includes a frame-level encoder and decoder, Deformed Patch-based Homography (DePtH), which warps and aligns each patch-wise Key (K) and Value (V) to the Query (Q), and a Spatial-temporal Transformer with a weighting Adaptor (STA), where the basic attention operator is Multi-Projection Path Aggregation (MPPA).

6 Loss functions

Following a review of the cited transformer-based methods for image or video inpainting, various loss functions have been utilized to guide the generation of realistic results. Generally, to train these methods, authors combined more than one loss function in their implementation due to the difference in the objective of each one of these functions. The most used loss functions in image inpainting include mean absolute error loss (L1) (Wang et al. 2018a), Adversarial Loss, Perceptual Loss (Gatys et al. 2016), Reconstruction Loss (Wang et al. 2018c), Style Loss (Johnson et al. 2016), and Feature Map Loss. Also some other loss functions are used in a small number of papers like Mask, SSIM loss function used in ACCP-GAN (Wang et al. 2021), binary cross entropy loss (Ruby and Yendapalli 2020), Cross entropy loss (Zhou et al. 2021), and diversified Markov random field loss (He and Yin 2021).

For image inpainting and image translation tasks, including image generation and image segmentation, incorporating various loss functions can produce better results visually and is semantically effective. These loss functions can be categorized into three classes: contextual-based, style-based, and structure-based loss. Contextual-based loss functions focus on preserving the content or semantic information of the image, ensuring that the inpainted regions are coherent and homogeneous with the neighboring regions. Furthermore, they can be used to measure the similarity between the inpainted image and the ground truth in terms of both low-level details and high-level structures, preserving realistic content. For this category, L1 and reconstruction loss functions can be found (Wang et al. 2018a; Wang et al. 2018c).The style-based loss category is focused on capturing high-level semantic information rather than pixel-level details. It specifically targets the texture and artistic style of the original image, which is achieved by comparing the statistics of feature maps across different layers of the network. In this category exists perceptual loss, style loss and adversarial loss (Johnson et al. 2016). Structural loss that can be categorized as a type of contextual loss, which emphasizes maintaining the contextual coherence and structural integrity of the inpainted image, preserving the surrounding content. Figure 5 summarizes the used loss functions in this review. By the following a description of each one of the used loss functions is performed.

Mean Absolute Error (L1) loss measures the absolute pixel-wise differences between the inpainted image and the ground-truth. It allows the difference between the generated image to be minimized to be close to the original image in terms of pixel values.

Adversarial loss introduced for GANs, consists of a generator and a discriminator. The generator aims to produce realistic images, while the discriminator learns to distinguish between real and generated images. The adversarial loss encourages the generator to generate visually robust content.

Perceptual loss focuses on capturing high-level semantic information by measuring the differences between feature representations of the original and generated images. It typically uses deep learning models to extract features at various layers. By minimizing perceptual loss, the inpainted image is encouraged to capture structural similarities and perceptual characteristics to the original image.

High Receptive Field (HRF) refers to the capability of a neural network to consider a larger context from the input image, which can enhance its understanding of overall structure and content.

Reconstruction loss evaluates the comparison between the inpainted and the original image after each transformation. It also assesses the definite differences between the generated images and real images.

Style loss captures the texture and artistic style of the original image. It is computed by comparing the statistics of feature maps across different layers. By minimizing style loss, the inpainted regions are encouraged to mimic the artistic style of the surrounding image content.

Feature Map loss measures the similarity between feature maps extracted from the inpainted image and those from the ground-truth image. It encourages the inpainted regions to preserve important visual structures and textures present in the original image. Feature map loss is often used in conjunction with perceptual loss to guide the inpainting process effectively.

Hinge loss for video inpainting is used in adversarial settings. which can help in training discriminators to distinguish between real and inpainted video frames, ensuring that inpainted images are indistinguishable from the original content. It is not commonly used for the direct inpainting tasks but can improve the quality of inpainted videos in GAN-based approaches.

Cross-entropy loss in the context of video inpainting, is mainly suitable for classification tasks within the inpainting process, such as segmenting regions to be inpainted. It measures the difference between the predicted and actual distribution of classes (e.g., inpainted vs. not inpainted pixels).

7 Image and video inpainting datasets

To evaluate the image inpainting method, various datasets have been used. Paris Street View dataset (Pathak et al. 2016) was created for image inpainting, and others are from other tasks, such as Places2 (Zhou et al. 2017) for scenes recognition, CelebA-HQ (Karras et al. 2017) and FFHQ (Karras et al. 2019) for Face recognition, Youtube-VOS (Xu et al. 2018) for video object segmentation. In this section, the most frequent datasets used in the transformed-based image inpainting methods are reviewed.

Paris Street View Dataset: The Paris Street View dataset (Pathak et al. 2016) consists of 14,900 training images and 100 test images captured from street views in Paris. These images primarily focus on the city’s buildings, making the dataset valuable for tasks related to urban scenes and architectural elements.

CelebA-HQ Dataset: CelebA-HQ (Karras et al. 2017) is an extension of the CelebA dataset, providing high-quality images of celebrities with diverse attributes. It contains over 30,000 high-resolution images (1024 \(\times\) 1024 pixels) of celebrities in various poses, lighting conditions, and backgrounds. CelebA-HQ is commonly used for tasks such as facial recognition, attribute classification, and image generation, including image inpainting.

Places2 Dataset: Places2 (Zhou et al. 2017) is a large-scale dataset focusing on scene understanding, containing images of various indoor and outdoor scenes from around the world. It includes over 10 million images covering 365 scene categories, ranging from natural landscapes to urban environments. Places2 is used for several tasks such as scene classification, semantic segmentation, and image inpainting.

FFHQ Dataset: Flickr-Faces-HQ (FFHQ) (Karras et al. 2019) dataset is a high-quality image collection of human faces: 70,000 high-quality images with a resolution of 1024 \(\times\) 1024 pixels. The dataset contains various images with variations in terms of age, ethnicity, and image background. In addition, it includes a diverse range of attributes such as eyeglasses, sunglasses, and hats. FFHQ is used in different tasks, such as image generation, super-resolution, denoising, and inpainting

YouTube-VOS Dataset: The YouTube-VOS (Video Object Segmentation) (Xu et al. 2018) dataset is designed for the task of semi-supervised video object segmentation, where the goal is to segment objects of interest in videos. It contains high-resolution video sequences with pixel-level annotations for foreground objects across multiple frames. The YouTube-VOS dataset is used for video object segmentation and video inpainting tasks.

DAVIS Dataset: The Densely Annotated Video Segmentation (DAVIS) (Perazzi et al. 2016) dataset is a comprehensive resource designed specifically for the task of video object segmentation, offering high-quality annotations across consecutive frames for precise delineation of object boundaries. It contains 50 high-quality video sequences. Furthermore, DAVIS provides a benchmark for evaluating algorithms in the field of video segmentation, in addition to video inpainting.

8 Results and discussion

8.1 Evaluation metrics

To evaluate the performance of image or video inpainting methods, a set of metrics are used to compare between the generated image and the ground-truth. In this section, we selected the most used metrics for image and video inpainting, including peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), learned perceptual image patch similarity (LPIPS), and Frechet inception distance (FID). The metrics can be divided into two categories: pixel-based metric to evaluate the quality of the generated images and patch-based metric to compute the perceptual similarity between two images. FID and LPIPS are patch-based metrics, and PSNR and SSIM are pixel-based metrics.

8.2 Pixel-based metrics

Pixel-based metrics evaluate images at the level of individual pixels. The PSNR and SSIM are pixel-based metrics used to evaluate image inpainting method.

Peak Signal-to-Noise Ratio (PSNR ): Used to evaluate the quality of the generated image by comparing it with the ground-truth. A higher PSNR indicates less noise and better quality. The PSNR is defined as follows:

$$\begin{aligned} PSNR = 10 \log _{10} \left( \frac{MAX_I^2}{MSE} \right) \end{aligned}$$
(1)

where \(MAX_I\) is the maximum pixel value of the image, and MSE is the Mean Squared Error between the ground-truth and generated image.

Structural Similarity Index (SSIM): This metric compares how similar two images are in terms of luminance, contrast, and structure, mimicking human perception. A higher SSIM indicates that the images are more alike. The SSIM is defined as follows:

$$\begin{aligned} SSIM(x, y) = \frac{(2 \mu _x \mu _y + C_1)(2 \sigma _{xy} + C_2)}{(\mu _x^2 + \mu _y^2 + C_1)(\sigma _x^2 + \sigma _y^2 + C_2)} \end{aligned}$$
(2)

where \(\mu _x\) and \(\mu _y\) denote the mean luminance values of images x and y respectively. the \(\sigma _x\) and \(\sigma _y\) denote the standard deviations of x and y respectively. And \(\sigma _{xy}\) is the covariance between x and y. While \(C_1\) and \(C_2\) are constant.

8.3 Patch-based metrics

Patch-based metrics evaluate images by comparing patches or local regions instead of individual pixels. These metrics typically use deep learning techniques to extract features from the patches. LPIPS and FID are patch-based metrics used to evaluate image inpainting methods.

Learned Perceptual Image Patch Similarity (LPIPS): This metric is used to evaluate the perceptual similarity between two images. Unlike traditional metrics, such as the MSE or PSNR, which measure pixel-wise differences, LPIPS is designed to capture perceptual differences that align more closely with human perception. LPIPS calculates the average Euclidean distance between the feature representations of corresponding patches or layers extracted from the images. This distance reflects the perceptual difference between the two images. The LPIPS metric is expressed in Wang et al. (2024) as follows:

$$\begin{aligned} \begin{aligned} \text {LPIPS}(I_1, I_2)&= \sum _{i=1}^{L} \frac{1}{H_i W_i C_i} \\&\sum _{h,w,c}( \phi _i(I_1)_{h,w,c} - \phi _i(I_2)_{h,w,c})^2 \end{aligned} \end{aligned}$$
(3)

where \(I_1\) and \(I_2\) are two images, L is the number of layers of the deep network, \(\phi _i\) is the feature extraction function of the \(i-th\) layer, and \(H_i\), \(W_i\), and \(C_i\) are the height, width and number of channels of the \(i-th\) layer, respectively

Fréchet Inception Distance (FID): This is used to evaluate the quality of generated images in generative models such as those using GANs. This metric is considered as to be patch-based metrics in some papers (Naderi et al. 2023), while in other, it is considered to be a features-based metric. A lower FID value signifies a higher consistency between the two image sets. For video inpainting, researchers used VFID. The FID is formulated as follows:

$$\begin{aligned} FID = \Vert \mu _{\text {real}} - \mu _{\text {gen}} \Vert ^2 + \text {Tr}(\Sigma _{\text {real}} + \Sigma _{\text {gen}} - 2(\Sigma _{\text {real}}\Sigma _{\text {gen}})^{0.5}) \end{aligned}$$
(4)

where \(\Vert .\Vert _2\) denotes the Euclidean distance. \(\mu _{\text {real}}\) and \(\mu _{\text {gen}}\) are the means of the feature representations of real and generated images respectively. \(\Sigma _{\text {real}}\) and \(\Sigma _{\text {gen}}\) are the covariance matrices of the feature representations of real and generated images respectively. Tr() denotes the trace of a matrix.

Paired/Unpaired Inception Discriminative Score (P/U-ID) is utilized to assess the quality of inpainted images by evaluating how distinguishable they are from genuine images using a deep learning model like an Inception network.

Unpaired Inception Score (U-IDS) (Richard et al. 2018) evaluates a set of inpainted images in a broader distribution without direct one-to-one pairing, often through statistical similarity.

Paired Inception Score (P-IDS) (Shengyu et al. 2020) measures how well paired inpainted images (those that are compared one-to-one with their originals) are indistinguishable from real images.

Table 4 The performance of each method on PSV and Places2 image inpainting datasets using 40–50% as mask ratio

For video inpainting, PSNR and SSIM ate also used for evaluating the proposed methods. In addition VFID and FWE (Flow Warping Error) which is named also as \(E_{warp}\) are used (Zhao et al. 2021; Lai et al. 2018; Wang et al. 2018b).

8.4 Results discussion

In this section, we performed a comparison of the proposed methods in terms of the obtained results using the evaluation metrics on various image and video inpainting datasets. For image inpainting, the most commonly used datasets were PSV, Places2, CelebA-HQ, and FHHQ. For video inpainting, the proposed method mostly commonly used the YouTube-HQ and DAVIS datasets. A comparison of the number of parameters of each model is performed. This allows the researchers assess the lightweight models regarding the computational resource challenges, especially for image and video inpainting with high-resolution images/videos.

8.4.1 Evaluation on PSV and Places2 dataset

Table 4 shows the results obtained from transformer-based image inpainting methods with the PSNR, SSIM, LPIPS, and FID metrics applied to the PSV and Places2 datasets. In this comparison, we present the results for the mask ratio 40–50%, which is the most used ratio for the majority of the papers. In addition, the Table shows the image input size used in the experiments.

For the PSV dataset, the obtained results show that the Blind-Omni-Wav-Net outperforms the other methods, achieving the highest PSNR values, which demonstrate it efficiency in reconstructing high-fidelity images. The methods proposed by Compana et al. and Wang et al. obtained the second best results in terms of the PSNR with a difference of two points from the Blind-Omni-Wav-Net. The majority of remaining methods exceed 24. In terms of SSIM metrics, the Blind-Omni-Wav-Net method also had the highest result, which was 0.905 better than the TransInpaint and BAT-Fill methods. This indicates that the Blind-Omni-Wav-Net preserves the structural and textural integrity of the inpainted images. On the other hand, the TransInpaint and Campana et al. methods achieved the lowest LPIPS values, reflecting superior texture and detail accuracy. In terms of the LPIPS metric, the values are close for all methods. For the FID metric, TransInpaint obtained the minimal FID score, indicating its effectiveness in generating images close to real images. This due the results obtained for SSIM, LPIPS and FID.

In the same context, the comparison was performed on the Places2 dataset using the same evaluation metrics. Almost all methods used this dataset for their experiments. The Blind-Omni-Wav-Net method obtained the highest performance in terms of the PSNR and SSIM metrics, which demonstrate the effectiveness of this method against the other proposed methods. In the second place, CoordFill reached 26.365 for PSNR and 0.912 for SSIM. This is proven by the use of attentional FFC, in addition to analyzing just the missing regions during the network process while the other regions are not analyzed and keep the same pixels values. For that, CoordFill methods can work on high-resolution images. For the other methods, including AMT ZITS, ZITS++, and TransCNN-HAE, the PSNR values were close. The CoordFill and ZITS++ methods obtained the lowest LPIPS scores, highlighting their proficiency in capturing and reproducing the complicated textures and details to different scenes in the Places2 dataset. In terms of the FID metric, the Compana et al. and SPN reached the lowest values. The differences between the obtained metrics values for each method were compared, and we found that some methods are good using one metric yet were not the best for others. This can be explained by the effectiveness of each method in specific tasks, such as preserving high-resolution quality, preserving the semantic similarity, or generating effective texture.

The presented results in 4 are for the mask ratio of 40–50%; however, some methods are performed with different mask ratios. These methods were collected and compared in terms of PSNR based on each mask ratio and illustrated in Fig. 10. We observed that the PSNR of these methods decreases when we increase the ratio of the mask. On the PSV dataset, the method by Li et al. performed well when the ratio was at 10–20%. Furthermore, for the ratio of 50–60%, the APT method was the best one in terms of the PSNR value. With the Places2 dataset, the same observation was found using the method by Li et al., with a 10–20% and 20–30%, ratio, while SwMH was the best for the 40–50% and 50–60% ratio. Some methods performed their experiment using two or three of these ratios, such as Compana et al., APT, and GCMAM.

Fig. 10
figure 10

Performance using PSNR of image inpainting method based on mask ratio

Table 5 The performance of each method on the CelebA-HQ and FHHQ image inpainting datasets using 40–50% as mask ratio

8.4.2 Evaluation on CelebA and FHHQ dataset

Related to the PSV and Places2, the two other datasets used in different image inpainting methods using transformers which include the CelebA-HQ and FHHQ datasets, which are two datasets are of human faces. To compare the proposed methods on these two datasets, we present the obtained results using various metrics in Table 5. Most of the proposed methods reached convincing results in terms of the quality of the generated images and the precision of the filled region represented by SSIM metrics. For example, using the PSNR metric on the CelebA-HQ dataset, we found that 15 of the 20 methods reached a PSNR value > 24, while all SSIM values were > 80%. The Blind-Omni-Wav-Net reached the best PSNR result of 28.21, followed by CoordFill. While using SSIM metric, Blind-Omni-Wav-Net and TransInpaint generated the best results. For the LPIPS metric, the Compana et al. and CoordFill methods were the best. Each one these methods work to solve a specific challenge; thus, each is best in some metrics. For example, CoordFill used a technique that can preserve the region pixels that are not missing, making it better in terms of PSNR and SSIM metrics.

Using the different mask ratios represented in Fig. 10, we can detect the differences between the methods in terms of the PSNR values. Some methods are not represented in this figure because they did not use different mask ratio, such as Blind-Omni-Wav-Net and CoorFill. From the methods that use various mask ratios, the method of Li et al. was the best in almost all ratios, except for 10–20%. Furthermore, the SwMH method was not the best for the ratios <50%, but the results were close to the best for the mask ratio of 50–60%.

Table 6 The performance of each method on the Places2 and CelebA-HQ image datasets using P-IDS and U-IDS metrics with a mask ratio > 50%

In the same context, some methods were evaluated on the FHHQ dataset resulted in lower values than the other datasets. The Blind-Omni-Wav-Net method achieved the best PSNR and SSIM values, while ZITS++ was the best using LPIPS and FID metrics. The number of parameters of the models can be significant for the robustness of a model, while its can be also a challenge in terms of computational resources. NDMAL and SWMH have the lowest number of parameters; however, in terms of the obtained results, they were less also than the others.

In conclusion, the obtained results of these inpainting methods across the datasets indicates not only the improvements made in the field, but also the impact of transformer-based techniques on the inpainting task. furthermore, the diversity of the methods enables the possibility of working on different aspects, such as the image quality, structural similarity, or computational efficiency, for the purpose of generating realistic images.

Evaluation on Places2 and CelebA-HQ using P-IDS and U-IDS

8.4.3 Evaluation on Places2 and CelebA-HQ using P-IDS and U-IDS

Table 6 evaluates various image inpainting methods on two datasets, Places2 and CelebA-HQ, using Paired and Unpaired Inception Discriminative Scores (P-IDS and U-IDS). These scores evaluates the capability of methods to fill missing parts of images. These methods used a mask ratio greater than 50%, where higher scores are indicative of more effective inpainting.

On the Places2 dataset, the Co-Mod method achieves the highest scores at \(256\times 256\) resolution for both P-IDS and U-IDS metrics which demonstrates the capability in reconstructing scenes with a significant score. For \(512\times 512\) resolution, MAT demonstrates the best performance across both scores. While ZITS++ method also perform well in terms of U-IDS which indicate the competitive capabilities of these methods in feature reconstruction at medium resolutions. For the CelebA-HQ dataset, MAT is the only method with available paired and unpaired scores, with lower than its performance on Places2. UFFC provides just U-IDS score.

Overall, the table shows how different methods perform across resolutions and datasets, with distinct scores in each category. It illustrates the impact of resolution on inpainting quality and effectiveness of some techniques like Co-Mod and MAT in dealing with image inpainting task.

During the comparison of different ViT-based methods on various datasets, some methods succeed to fill the missing part with an efficient way compared with the others image inpainting methods. For example, on PSV dataset the best methods are Blind-Omni-Wav-Net (Phutke et al. 2023), Campana et al. Campana et al. (2023), BAT-Fill (Yu et al. 2021a), TransInpaint (Shamsolmoali et al. 2023), and CBNet (Jin et al. 2023). While, Blind-Omni-Wav-Net (Phutke et al. 2023), CoordFill (Liu et al. 2023), and Campana et al. (2023) are the best on Places2 dataset. On CelebA-HQ dataset, CoordFill (Liu et al. 2023), Blind-Omni-Wav-Net (Phutke et al. 2023), TransInpaint (Shamsolmoali et al. 2023), and Campana et al. (2023) are the best.

Each one of these image inpainting methods offers unique solutions to specific challenges in the field. For example, BAT-Fill distinguishes itself through the use of a bidirectional autoregressive transformer, focusing on long-range dependencies to provide diverse and high content generation without directional constraints. In contrast, TransInpaint (Shamsolmoali et al. 2023) employs a context-adaptive transformer paired with a texture enhancement network, excelling in visual coherence and texture integration to ensure the homogeneity between the inpainted regions and surrounding contexts. Blind-Omni-Wav-Net (Phutke et al. 2023) is an approach that addresses blind image inpainting without requiring mask information using wavelet query multi-head attention, then producing reasonable inpainted results without predetermined masks. For CoordFill (Liu et al. 2023) tackled high-resolution images efficiently, using continuous implicit representation and attentional Fast Fourier Convolution to manage large receptive fields, thus producing high-quality detail. CBNet (Jin et al. 2023) focuses on enhancing semantic relevance and structural coherence through its cascading network and adjacent transfer attention. It deals with large corrupted areas while maintaining textural integrity. Lastly, Campana et al. (2023) proposes a variable-hyperparameter visual transformer that optimizes computational efficiency, making it highly suitable for high-resolution images.

Table 7 The performance of each method on the YouTube-VOS and DAVIS image datasets
Fig. 11
figure 11

The obtained results on Three videos from the DAVIS dataset. First column: video frame with mask image. Second column: DSTT. Third column: FuseFormer. Fourth column: E2FGVI. Fifth column: ProPainter

8.4.4 Evaluation on YouTube-VOS and DAVIS video datasets

To evaluate the proposed method for video inpainting, we performed a comparison on the most used datasets, YouTube-VOS and DAVIS. The obtained results are presented in Table 7 using different metrics, including the PSNR, SSIM, VFID, LPIPS, and \(E_warp\). On both datasets, all methods used PSNR and SSIM, and VFID metrics for evaluation, while the other two other metrics used some of these methods.

For the YouTube dataset, all the obtained PSNR values exceed 33; FGT++ was the best with a value of 35.02, followed by FGT, DMT, ProPainter, and FSTT, which reached 34. The same methods reached close results in terms of the SSIM and VFID metrics. The best value for SSIM was 97%, which reveals the improvement reached in the inpainting video with a high quality. In addition, compared to image inpainting results, the obtained results on videos are better.

Using the DAVIS dataset for video inpainting, the ProPainter and DLFormer methods reached better results in terms of the PSNR values, with a difference of 0.2 between them. The other method also achieved close PSNR values, with a difference of 1 point for most of them. The same observation was found for SSIM metrics. These results demonstrate the capability of these methods in inpainting videos, with a convincing performance. This remains true when we compare the results with the YouTube-VOS and DAVIS datasets; the results are similar even though the scenes of the two datasets are different.

To exemplify the obtained results using metrics we tested the proposed methods with source codes, including DSTT, FuseFormer, E2FGVI, and ProPainter, on three videos from the DAVIS dataset. The obtained results are illustrated in Fig. 11. The used masks are presented in the first images and the remaining images are the obtained results. The methods do not succeed in inpainting the bus, with ProPaint inpainting it better than the other methods. For the second video the results are good. While for the third video, the method succeeds in inpainting most parts of the object; however, the shadow of the player still exists using the E2FGVI and DSTT. Overall, the transformer-based methods improve the video inpainting task in terms of quality.

9 Image/video inpainting challenges

Deep learning is a trending technology for all computer science and robotics tasks to help and assist human actions. Using artificial neural networks, which are supposed to work like a human brain, deep learning is an aspect of artificial intelligence (AI) that consists of solving the classification and recognition goals for machine learning from specific data for specific scenarios. For image inpainting, the process of filling in missing or damaged areas of an image poses several challenges. These challenges can be divided into the challenges related to computer vision and architecture challenges and those related to images inpainting, such as the quality of the images. Furthermore, some challenges are related to transformer-based methods. We discuss the following set of challenges in detail.

Preservation of Semantics: Inpainting algorithms must preserve the semantic content of the image while filling in missing regions. The filled-in areas should blend seamlessly with the surrounding context and maintain the overall meaning of the image. Transformer-based models may struggle with preserving spatial coherence in inpainted regions, especially when dealing with complex textures or intricate structures. Ensuring smooth transitions and consistent patterns across the inpainted areas remains a challenge. In addition, inpainting requires synthesizing textures and structures to replace missing regions. Generating realistic textures that match the surrounding areas and maintaining structural coherence is essential for producing convincing inpainted results.

Context Understanding: Transformer models are effective at capturing long-range dependencies in sequential data, while understanding contextual information in images can be challenging. For image or video inpainting, understanding the global context of the image, including scene semantics and object relationships, is crucial for generating realistic and coherent inpainted results. In addition, inpainting algorithms need to accurately reconstruct missing edges and ensure smooth transitions between filled-in and original regions. This can be a challenge for deep learning models, including transformer-based models. Furthermore, there are challenges in handling the missing regions of different scales; from small scratches to large objects, removing noise or artifacts can complicate the inpainting process.

Complexity of Architecture: In the literature, the effective architectures used as feature extraction are generally complex, making them challenging to train, interpret, and optimize (Gatys et al. 2016). Balancing model complexity with performance requirements is crucial but it can be difficult to achieve due to the other parameters, such as computational resources, especially for large scale datasets, and the number of parameters of each model (GFLOPS). In addition, training a complex architecture on some specific tasks can be more time-consuming than others. For transformer-based models based on self-attention techniques, this can make these models more complex. Thus, efficient implementation strategies and optimization techniques are required to make transformer-based inpainting methods practical for real-world use cases.

Overfitting: Deeper feature extraction architectures are sensitive to overfitting, where the model memorizes the training data rather than learning generalizable features (Rice et al. 2020). Finding the best parameters, such as dropout and weight decay, can minimize the impact of this challenge; however, finding the right combination require a lot of tests and it can change from one task to another.

Data quality requirements: Training a CNN model requires large-scale annotated datasets, which can be expensive, time-consuming, or even unavailable for certain domains or applications. Data augmentation techniques can help to handle this challenge for some cases but may not address all scenarios for representative training data. The quality of data also represents a challenge for deep learning architecture; for example, high resolution images robustly obtain good results, but training requires a computational resource, which is another challenge. For transformer models in the case of high-resolution images, the training of the model requires dividing the image into smaller patches or applying hierarchical approaches, which can affect the quality.

Computational Resources: Deep-learning-based models require significant computational resources, including powerful GPUs or TPUs, for training and inference. Scaling CNNs to handle larger datasets or more complex architectures increases the demand for computational resources and limits the accessibility for researchers. The same observation was made for transformer-based models that are computationally intensive; generating real-time or interactive inpainting applications represents a challenge and requires large-scale training datasets to obtain meaningful representations.

Domain Adaptation: CNNs trained on specific datasets or for a task may not be suitable for different datasets or real-world environments due to domain shifts or biases. Adapting pre-trained CNNs to new domains or tasks with limited annotated data represents a challenge, especially when the target domain differs significantly from the source domain (Farahani et al. 2021). This is shown in the feature extraction models used for some specific tasks in the previous section. For, transformer-based models, generating diverse and high-quality training data for image inpainting tasks, particularly for specific image types, can be challenging. Furthermore, it is a challenge to ensure that the model generalizes well to unseen data and various inpainting scenarios.

Temporal Consistency: Maintaining temporal consistency across frames in a video is difficult. Each frame must be inpainted in a way that ensures smooth transitions and coherence over time, to avoid noticeable discontinuities. Also, video inpainting requires understanding both spatial (within a frame) and temporal (across frames) information. Designing a model that effectively integrates these two aspects is challenging.

Complex Motion in complex background: Videos often contain complex and dynamic movements, which transformers need to understand and predict accurately for consistent inpainting. While filling in missing areas where objects or complex backgrounds move or change requires an in-depth understanding of the scene, which can be hard to achieve.

Diffusion models and its impact to improve the inpainting: Diffusion models represents a new technique in the computer vision domain including image and video inpainting tasks. Diffusion models are generative frameworks that progressively add noise to data in a forward process and then learn to reverse this process to denoise and generate new samples. This strategy improves computer vision tasks, but diffusion models are often complex and require a large number of parameters, leading to high computational and memory demands. This can be especially challenging when processing high-resolution images or long video sequences. Also, the iterative nature of diffusion models means that generating an image or video frame can be slow, as it typically involves many forward and backward passes through the network. Reducing the computational load while maintaining quality is a key challenge.

The initial step involves meticulous data preparation. A diverse dataset of images, including untouched originals and their corresponding versions with missing parts (masked), is required. Proper preprocessing ensures the model can generalize effectively across various inpainting scenarios. The architectural backbone for diffusion models typically includes a U-Net or similar encoder-decoder architecture. U-Nets, known for their efficiency in generating high-quality outputs in tasks like segmentation, facilitate handling the noised images and predicting their denoised, inpainted counterparts.

Training the model requires a strategic application of noise schedules where noise is incrementally added to the images in the forward process. The model is then trained to denoise these images, learning the reverse process by minimizing the difference between predicted and original images. Incorporating loss functions like Mean Squared Error alongside perceptual or adversarial losses ensures both the accuracy and realism of the inpainted results.

During inference, the trained diffusion model takes masked images (images with missing parts) as input, iteratively denoising them to fill in the gaps. Ensuring consistency with the noise schedule and reverse process learned during training is crucial for this stage. After initial training, fine-tuning the model on specific domains can enhance performance significantly. Rigorous testing on a variety of inpainting tasks guarantees the model’s robustness and capability to produce high-quality results.

For continuous improvements, experimenting with different noise schedules, network architectures, and training strategies is advisable. Optimizations like using ensembles or multi-scale approaches can further enhance the inpainting quality and robustness. With these steps, the potency of diffusion models can be effectively harnessed to tackle image inpainting challenges, producing visually compelling and accurate results.

10 Concluding remarks and future directions

In this paper, we reviewed several research papers on image and video inpainting techniques based on vision transformers, including their ability to capture long-range dependencies and model complex relationships within images. The proposed methods attempt to enhance the task, including efficiency and information preservation, achieving both realistic textures and structures.

Image and video inpainting have advanced significantly with the rise of deep learning, notably CNNs and GANs, which excel at filling missing or damaged regions while preserving context. Recently, transformer-based architectures have emerged as promising alternatives, leveraging self-attention mechanisms to understand global context effectively. This paper undertakes a comprehensive review, focusing on transformer-based techniques for image and video inpainting. Through a systematic categorization based on architectural configurations, types of damages, and performance metrics, we aim to demonstrate the significant progress and offer guidance to aspiring researchers in the field.

In the domain of transformer-based image and video in- painting, a notable challenge lies in refining the model’s ability to effectively handle complex and dynamic visual contexts. This requires developing mechanisms that can seamlessly integrate temporal information in video sequences while preserving spatial coherence, thus ensuring the faithful reconstruction of missing regions. Additionally, addressing the computational cost associated with the large-scale transformer architectures demands innovative strategies for optimizing efficiency without compromising performance, thereby enabling real-time inpainting for practical applications. Furthermore, enhancing the model’s robustness to diverse and challenging inpainting scenarios, such as occlusions, irregular shapes, and varying textures, remains a critical frontier in advancing the capabilities of this transformative technology.

In terms of future research directions, several open questions remain in the realm of image and video inpainting, particularly concerning transformer-based techniques. Key avenues for further exploration include enhancing the handling of long-range dependencies to improve inpainting accuracy, investigating the performance of transformer-based approaches on diverse datasets beyond standard image and video formats to uncover new challenges and opportunities, and refining the realism and consistency of inpainted regions, especially in scenarios involving intricate textures or complex structures. Additionally, addressing temporal consistency across frames in video inpainting and ensuring robustness to various damage types, such as occlusions, corruptions, and missing data, are crucial areas for future research. Furthermore, optimizing the efficiency and scalability of transformer-based architectures for large-scale datasets or real-time applications remains an ongoing challenge. By addressing these open questions, the field can advance towards more versatile, robust, and efficient inpainting solutions.