Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
\UseTblrLibrary

siunitx

Zero-Shot Image Denoising for High-Resolution Electron Microscopy

Xuanyu Tian,  Zhuoya Dong, Xiyue Lin, Yue Gao,  Hongjiang Wei, Yanhang Ma, Jingyi Yu,  and Yuyao Zhang Corresponding authors: Jingyi Yu; Yuyao Zhang Xuanyu Tian is with School of Information Science and Technology, ShanghaiTech University, 201210, Shanghai, China and Lingang Laboratory, 20031, Shanghai, China (e-mail: tianxy@shanghaitech.edu.cn) Xiyue Lin, Jingyi Yu and Yuyao Zhang are with School of Information Science and Technology, ShanghaiTech University, 201210, Shanghai, China (e-mail: linxy2022@shanghaitech.edu.cn; yujingyi@shanghaitech.edu.cn; zhangyy8@shanghaitech.edu.cn) Zhuoya Dong and Yanhang Ma are with School of Physical Science and Technology, ShanghaiTech University, 201210, Shanghai, China (e-mail: dongzhy@shanghaitech.edu.cn; mayh2@shanghaitech.edu.cn) Yue Gao is with the BNRist, THUIBCS, KLISS, BLBCI, School of Software, Tsinghua University, Beijing 100084, China (e-mail: gaoyue@tsinghua.edu.cn). Hongjiang Wei is with the School of Biomedical Engineering and Institute of Medical Robotics, Shanghai Jiao Tong University, 200127 Shanghai, China (e-mail: hongjiang.wei@sjtu.edu.cn). The code for this work is availble at https://github.com/MeijiTian/ZS-Denoisier-HREM
Abstract

High-resolution electron microscopy (HREM) imaging technique is a powerful tool for directly visualizing a broad range of materials in real-space. However, it faces challenges in denoising due to ultra-low signal-to-noise ratio (SNR) and scarce data availability. In this work, we propose Noise2SR, a zero-shot self-supervised learning (ZS-SSL) denoising framework for HREM. Within our framework, we propose a super-resolution (SR) based self-supervised training strategy, incorporating the Random Sub-sampler module. The Random Sub-sampler is designed to generate approximate infinite noisy pairs from a single noisy image, serving as an effective data augmentation in zero-shot denoising. Noise2SR trains the network with paired noisy images of different resolutions, which is conducted via SR strategy. The SR-based training facilitates the network adopting more pixels for supervision, and the random sub-sampling helps compel the network to learn continuous signals enhancing the robustness. Meanwhile, we mitigate the uncertainty caused by random-sampling by adopting minimum mean squared error (MMSE) estimation for the denoised results. With the distinctive integration of training strategy and proposed designs, Noise2SR can achieve superior denoising performance using a single noisy HREM image. We evaluate the performance of Noise2SR in both simulated and real HREM denoising tasks. It outperforms state-of-the-art ZS-SSL methods and achieves comparable denoising performance with supervised methods. The success of Noise2SR suggests its potential for improving the SNR of images in material imaging domains.

Index Terms:
Zero-shot, Electron Microscopy, Denoising, Self-supervised

I Introduction

High-resolution electron microscopy (HREM) imaging [1, 2, 3] is an indispensable tool in the fields of materials science and nanotechnology. HREM enables direct visualization of structures at the atomic level through interactions between the sample and high-energy electrons.

However, HREM is inevitably susceptible to noise due to inherent properties of electron beams and detection process etc. For example, the low-dose conditions are often applied to imaging electron-beam sensitive materials to minimize the damage from the electrons. [4, 5]. Capuring the dynamic events at kilohertz frame rates with direct electronic detection systems [6, 7], the image is affected by severe shot noise due to the shortened exposure time. Improving signal-to-noise (SNR) is critical for HREM to enhance image quality and facilitate accurate information extraction.

Recently, data-driven methods based on deep learning (DL) [8, 9, 10] have obtained favorable performance compared to conventional methods in image denoising. However, applying supervised DL denoising methods to electron microscopy (EM) images is challenging due to the lack of paired noisy-clean image datasets. In the absence of ground-truth images, several self-supervised image denoising methods[11, 12, 13, 14, 15, 16, 17, 18, 19] have been proposed. Some works [12, 11, 13, 17] utilize blind-spot networks (BSNs) to prevent identical mappings in self-supervised learning. BSN aims to eliminate the influence of each pixel on the corresponding output pixel to satisfy the 𝒥𝒥\mathcal{J}caligraphic_J-invariance [11] theory. Noise2Void [12] and Noise2Self [11] employ a masked-based strategy while subsequent works [14, 13, 20, 17] design tailored networks to build BSN. The scarcity of HREM data poses challenges for BSN-based methods, and their performance tends to degrade when trained with limited data [15]. Huang et al. proposes Neighbor2Neighbor (NB2NB) [19] to generate paired noisy images from subsampling paired neighbor pixels for self-supervised training. While NB2NB and its variants [21, 22] fail to address the intensity gap issue between neighbor pixels, resulting in relatively poor performance in low-SNR scenarios. Some works [18, 23] introduce explicit noise modeling to generate training paired noisy images by adding synthetic noise. However, the noise distribution of real HREM is unknown and challenging to estimate due to the extremely low signal-to-noise ratio in HREM images.

In this paper, we propose an efficient zero-shot self-supervised denoising framework for HREM images named Noise2SR. We propose to train a denoising network with paired noisy images with different resolutions, which is conducted via super-resolution (SR) strategy. Inspired by NB2NB [19], we introduce the Random Sub-sampler module to generate sub-sampled noisy images that form a noisy pair with the original noisy image. Unlike NB2NB, we utilize paired noisy images with different resolutions for training. We indicated that the sub-sampled noisy image and the original noisy image have a consistent underlying clean image. Meanwhile, we provide theoretical proof that the Noise2SR training scheme is statically equivalent to using a clean image for supervision. Combined with the SR-based training strategy, we address the coordinate mismatch and intensity gap issues present in paired neighbor pixels in NB2NB. The proposed Random Sub-sampler helps to break the noise correlation of real-world and serves as an effective data augmentation for enhancing the zero-shot image denoising performance. We adopt minimum mean squared error (MMSE) estimation to mitigate the uncertainty caused by random sub-sampling and further enhance the denoising performance. With the distinctive integration of SR training strategy, Random sub-sampling and MMSE estimation, Noise2SR exhibits significant performance for single image denoising, particularly in scenarios with extremely low signal-to-noise ratios (SNR) such as HREM.

We conduct a series of experiments on both simulated and real HREM images to demonstrate the effectiveness and superiority of the proposed Noise2SR framework for HREM image denoising. The main contributions of this work can be summarized as follows:

  1. 1.

    We proposed Noise2SR, which efficiently improves the signal-to-noise ratio (SNR) of single HREM images without involving any external dataset. To the best of our knowledge, Noise2SR could be one of the first zero-shot self-supervised denoising methods for HREM images.

  2. 2.

    We propose a novel self-supervised training scheme incorporating SR strategies without noise model assumptions and can be combined with any network or framework.

  3. 3.

    We propose the Random Sub-sampler serves as an effective data augmentation in network training and incorporates MMSE estimation to effectively produce reliable denoised results in ultra-low SNR scenarios.

  4. 4.

    Incorporating the training scheme and designs, our method performs very favorably against state-of-the-art self-supervised denoising methods in HREM image denoising.

The quantitative and qualitative results demonstrate the effectiveness of our methods in enhancing the SNR of simulated HREM images and low-dose HREM images of two electron beam sensitive zeolites.

This work is built upon our previous work [24] and introduces several notable improvements. Firstly, we extend the previous work to address the denoising of single low-dose HREM data. Secondly, we propose an approximate MMSE estimation in the inference stage, which enhances the denoising performance and provides stability in the context of single HREM image denoising. Furthermore, we conduct a comprehensive discussion on the proposed Random Sub-sampler module and evaluate its effectiveness within the overall framework.

II Related Work

II-A Image Denoising Without Clean Signal Prior

In recent years, supervised image denoising methods based on deep neural networks e.g., DnCNN [8] have achieved great success and outperform conventional image denoiser. However, the acquisition of aligned noisy-clean images is infeasible and impractical in many scientific imaging applications, such as electron microscopy, which limits the use of supervised deep learning approaches.

Noise2Noise (N2N) [25] is the first work that proposes training a deep denoiser using paired noisy images of the same scene and demonstrates that it is statistically equivalent to supervised learning. Subsequently, self-supervised denoising methods have been proposed, enabling the denoiser to be trained from individual noisy images without paired noisy images. Noise2Void (N2V) [12] and Noise2Self (N2S) [11] design the masked-based blind-spot network (BSN) for self-supervised learning. Specifically, the masked-based BSN replaces certain pixels of input noisy images and predicts their value based on neighboring pixels. However, masked-based BSN suffers from the limited pixels of supervision and imperfect replacement strategies, leading to inefficient training and artifacts in the denoised results [13, 16]. To address these issues, the following works [13, 20, 14, 17, 26] have proposed to design novel BSN architectures by incorporating centrally masked convolution and dilated convolutions. However, these methods are limited by the large amount of calculation and the inflexible network structure. Different from BSN methods, another category of self-supervised learning approaches proposed generating approximate paired noisy images from individual noisy images. For example, Noisier2Noise [18] generates paired noisy images by adding additional noise to noisy images, and similar ideas are explored in [27, 23]. However, these methods require a known noise model of original noisy images to guarantee denoising performance. Additionally, Neighbor2Neighbor (NB2NB) [19] proposed to sub-sample the neighbor pixels of the noisy image to generate paired noisy images. Since a gap exists between the underlying ground truth of paired neighbor sub-sampled images, NB2NB has limited performance in low-SNR scenarios.

II-B Advances in Unsupervised and Zero-shot Image Restoration

Deep image prior (DIP) [28] first demonstrated that convolutional neural networks (CNNs) inherently learn image priors with their inductive bias. DIP proposed using early-stopping strategies for recovering corrupted images with CNNs. Subsequently, several works leverage various approaches such as denoising [29], GANs [30] or diffusion models [31] to learn the prior image distribution for zero-shot image restoration. However, these approaches require a substantial amount of data for pre-training to learn the data distribution, which is impractical in the context of HREM. Therefore, we focus on zero-shot learning restoration methods, training with only a single corrupted image without additional external data. Compared with dataset-based training, zero-shot learning is significantly more challenging to avoid overfitting. In image dehazing, several works [32, 33] have achieved remarkable performance by training on a single hazy image. For zero-shot image denoising, Self2Self [15] incorporates a dropout regularization with a blind-spot networks framework to mitigate overfitting. FBI-Denoiser [17] showed superior performance by carefully designing the BSN. ZS-Noise2Noise [22] introduced a lightweight network to achieve comparable performance in zero-shot denoising. Chen et al. [34] proposed a zero-shot method for medical image artifact reduction.

Refer to caption
Figure 1: Pipeline of proposed Noise2SR framework. A. Training Phase: First, Random Sub-sampler takes a noisy image 𝐲𝐲\mathbf{y}bold_y as input and generates a sub-sampled noisy image 𝐲Jsubscript𝐲superscript𝐽complement\mathbf{y}_{J^{\complement}}bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT along with corresponding unsampled mask 𝐦Jsubscript𝐦𝐽\mathbf{m}_{J}bold_m start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT. Then, the network fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT takes the sub-sampled image 𝐲Jsubscript𝐲superscript𝐽complement\mathbf{y}_{J^{\complement}}bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT as input and generates a denoised image of full resolution fθ(𝐲J)subscript𝑓𝜃subscript𝐲superscript𝐽complementf_{\theta}(\mathbf{y}_{J^{\complement}})italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ). The network is optimized by computing the loss on the difference between unsampled noisy pixels 𝐲Jsubscript𝐲𝐽\mathbf{y}_{J}bold_y start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT and the output of the network. B. Inference Phase: A sub-sampled noisy set 𝒴𝒴\mathcal{Y}caligraphic_Y can be obtained by repeatedly sub-sampling a noisy image 𝐲𝐲\mathbf{y}bold_y M𝑀Mitalic_M times using the Random Sub-sampler. Given a sub-sampled noisy set 𝒴𝒴\small{\mathcal{Y}}caligraphic_Y, well-trained network fθ^subscript𝑓^𝜃f_{\hat{\theta}}italic_f start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT can generated a plausible denoised image set 𝒳^^𝒳\small{\hat{\mathcal{X}}}over^ start_ARG caligraphic_X end_ARG. Finally, the clean image can be estimated by averaging the images in 𝒳^^𝒳\small{\hat{\mathcal{X}}}over^ start_ARG caligraphic_X end_ARG using the MMSE estimation.

II-C Electron Microscopy (EM) Denoising

Conventional spatial filter based methods have been applied to EM, such as Bilateral filter, Non-local Means, BM3D, etc. Recent, deep learning based methods have been applied in EM imaging[35, 36]. Refs. [37, 38] proposed using Cycle-GAN for STEM images denoising without paired training images. Chong et al. [39] utilized paired noisy images achieving real-world optical and electron microscopy data denoising. Refs. [40] [41] proposed the simulation-based denoising (SBD) framework creating large simulated datasets to train CNNs for denoising in a supervised learning manner. Mohan et al. [41] introduced simulated paired clean/noisy HREM images of the same substance for various imaging parameters. However, creating large simulated datasets can be a time-consuming process, and its performance will degrade sharply due to the domain gap between simulated and real noisy images. Thus, it is imperative to propose a zero-shot image denoising method that is robust to ultra-low SNR to enhance HREM data quality effectively.

III Methodology

Refer to caption
Figure 2: The figure illustrates the random sub-sampling operations in the Random Sub-sampler with sampling stride s𝑠sitalic_s. The input image is pixel first unshuffled with a stride of s𝑠sitalic_s. At each location in the pixel unshuffled image, the Random Sub-sampler randomly selects 1111 element along the channel dimension to compose the sub-sampled image 𝐲Jsubscript𝐲superscript𝐽complement\mathbf{y}_{J^{\complement}}bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. Meanwhile, the sampler sets the unsampled pixel mask 𝐦Jsubscript𝐦𝐽\mathbf{m}_{J}bold_m start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT to 0 at the corresponding location if the element has been selected. Otherwise, it assigns a value of 1 (where black represents 0, and white represents 1). The specific example in the figure demonstrates the sub-sampling process of the Random Sub-sampler with sampling stride (s=2)𝑠2(s=2)( italic_s = 2 ) applied to an input image of size 4×4444\times 44 × 4.

We introduce a novel self-supervised training framework called Noise2SR, which enables the training of a denoiser using paired noisy images with different resolutions obtained from an individual noisy observation. The effectiveness of our framework is supported by the theoretical proof of the 𝒥𝒥\mathcal{J}caligraphic_J-invariant property [11]. Specifically, we introduce a novel random sub-sampling strategy to generate training image pairs of different resolutions. Subsequently, these pairs are utilized in conjunction with a super-resolution (SR) neural network. Such an approach effectively leverages the inherent relationship in signal content while minimizing noise correlation, thus significantly enhancing the efficiency of the denoiser training process. We provide a comprehensive overview of the training and inference stages within the Noise2SR framework. The overall architecture of our proposed Noise2SR framework is illustrated in Fig. 1.

III-A Related Theory Revisit

III-A1 Noise2Noise

With the absence of lack of clean images for supervised learning, Noise2Noise (N2N) [25] proposed to train denoising network with paired noisy measurements of the same scene. Given paired noisy measurements {𝐲1,𝐲2}subscript𝐲1subscript𝐲2\{\mathbf{y}_{1},\mathbf{y}_{2}\}{ bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } where {𝐲1=𝐱+𝐧1\{\mathbf{y}_{1}=\mathbf{x}+\mathbf{n}_{1}{ bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_x + bold_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐲1=𝐱+𝐧2}\mathbf{y}_{1}=\mathbf{x}+\mathbf{n}_{2}\}bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_x + bold_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }, N2N demonstrates that learning the mapping between paired noisy measurements yields the same solutions as supervised learning with clean images statistically:

𝔼𝐱,𝐲1,𝐲2fθ(𝐲1)𝐲222=𝔼𝐱,𝐲1,𝐲2fθ(𝐲1)𝐱22+σ2,subscript𝔼𝐱subscript𝐲1subscript𝐲2superscriptsubscriptnormsubscript𝑓𝜃subscript𝐲1subscript𝐲222subscript𝔼𝐱subscript𝐲1subscript𝐲2superscriptsubscriptnormsubscript𝑓𝜃subscript𝐲1𝐱22superscript𝜎2\mathbb{E}_{\mathbf{x},\mathbf{y}_{1},\mathbf{y}_{2}}\left\|f_{\theta}(\mathbf% {y}_{1})-\mathbf{y}_{2}\right\|_{2}^{2}=\mathbb{E}_{\mathbf{x},\mathbf{y}_{1},% \mathbf{y}_{2}}\left\|f_{\theta}(\mathbf{y}_{1})-\mathbf{x}\right\|_{2}^{2}+% \sigma^{2},blackboard_E start_POSTSUBSCRIPT bold_x , bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - bold_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT bold_x , bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - bold_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (1)

where σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is a constant the variance of noise 𝐧𝐧\mathbf{n}bold_n.

III-A2 𝒥𝒥\mathcal{J}caligraphic_J-invariant Denoiser

With the assumption that noise is zero-mean and pixel-wise independent, Noise2Self [11] proved that denoising network with individual noisy measurements is possible if the network is 𝒥𝒥\mathcal{J}caligraphic_J-invariant.

Definition 1 [11] Let 𝒥𝒥\mathcal{J}caligraphic_J be a partition of the dimensions {1,,n}1𝑛\{1,\dots,n\}{ 1 , … , italic_n } and let J𝒥𝐽𝒥J\in\mathcal{J}italic_J ∈ caligraphic_J. A function f:nn:𝑓superscript𝑛superscript𝑛f:\mathbb{R}^{n}\to\mathbb{R}^{n}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is J𝐽Jitalic_J-invariant if f(𝐱)J𝑓subscript𝐱𝐽f(\mathbf{x})_{J}italic_f ( bold_x ) start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT does not depend on the value of 𝐱Jsubscript𝐱𝐽\mathbf{x}_{J}bold_x start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT. It is 𝒥𝒥\mathcal{J}caligraphic_J-invariant if is J-invariant for each J𝒥𝐽𝒥J\in\mathcal{J}italic_J ∈ caligraphic_J.

III-B Super-resolved Based Denoising Methods

We propose training a denoiser using pairs of noisy images with different resolutions, generated from an individual image. These paired images consist of a sub-sampled noisy image S(𝐲)𝑆𝐲S(\mathbf{y})italic_S ( bold_y ) and its corresponding full-resolution noisy image 𝐲𝐲\mathbf{y}bold_y. Following the N2N strategy, the self-supervised loss can be generally stated as

minθ𝔼fθ(S(𝐲))𝐲2.subscript𝜃𝔼superscriptnormsubscript𝑓𝜃𝑆𝐲𝐲2\min_{\theta}\mathbb{E}\|f_{\theta}(S(\mathbf{y}))-\mathbf{y}\|^{2}.roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E ∥ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_S ( bold_y ) ) - bold_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (2)

However, directly minimizing the loss function above may result in the network learning an identity mapping for the noisy input pixels. Thus, we proposed to use unsampled noisy pixels for network supervision. Under this supervision, we can theoretically prove that using noisy images for network training is equivalent to using clean signals for supervision.

Based on the 𝒥𝒥\mathcal{J}caligraphic_J-invariant theory, the sub-sampling operation partitions the image with n𝑛nitalic_n pixels into two disjoint sets denoted as J𝐽Jitalic_J and Jsuperscript𝐽complementJ^{\complement}italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT, where |J|+|J|=n𝐽superscript𝐽complement𝑛|J|+|J^{\complement}|=n| italic_J | + | italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT | = italic_n. Consequently, the sub-sampled noisy image S(𝐲)𝑆𝐲S(\mathbf{y})italic_S ( bold_y ) can be represented as 𝐲Jsubscript𝐲superscript𝐽complement\mathbf{y}_{J^{\complement}}bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, while the unsampled noisy pixels 𝐲S(𝐲)𝐲𝑆𝐲\mathbf{y}\setminus S(\mathbf{y})bold_y ∖ italic_S ( bold_y ) are denoted as 𝐲Jsubscript𝐲𝐽\mathbf{y}_{J}bold_y start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT. When training a network using paired noisy images with different resolutions, the resulting network naturally becomes a 𝒥𝒥\mathcal{J}caligraphic_J-invariant function. This is because the output fθ(𝐲J)Jsubscript𝑓𝜃subscriptsubscript𝐲superscript𝐽complement𝐽f_{\theta}(\mathbf{y}_{J^{\complement}})_{J}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT does not rely on the specific values of 𝐲Jsubscript𝐲𝐽\mathbf{y}_{J}bold_y start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT.

Theorem 1. Let 𝐲=𝐱+𝐧𝐲𝐱𝐧\mathbf{y}=\mathbf{x}+\mathbf{n}bold_y = bold_x + bold_n be an image corrupted by zero-mean noise 𝐧𝐧\mathbf{n}bold_n with variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. 𝐲Jsubscript𝐲superscript𝐽complement\mathbf{y}_{J^{\complement}}bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is the sub-sampled noisy image 𝐲𝐲\mathbf{y}bold_y and 𝐲Jsubscript𝐲𝐽\mathbf{y}_{J}bold_y start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT is the unsampled noisy pixels of original image 𝐲𝐲\mathbf{y}bold_y. Suppose the noise 𝐧𝐧\mathbf{n}bold_n is independent of the clean image 𝐱𝐱\mathbf{x}bold_x and the noise 𝐧𝐧\mathbf{n}bold_n is pixel-wise independent. Then it holds that:

𝔼𝐱,𝐲fθ(𝐲J)J𝐲J22=𝔼𝐱,𝐲fθ(𝐲J)J𝐱J22+σ2.subscript𝔼𝐱𝐲superscriptsubscriptnormsubscript𝑓𝜃subscriptsubscript𝐲superscript𝐽complement𝐽subscript𝐲𝐽22subscript𝔼𝐱𝐲superscriptsubscriptnormsubscript𝑓𝜃subscriptsubscript𝐲superscript𝐽complement𝐽subscript𝐱𝐽22superscript𝜎2\mathbb{E}_{\mathbf{x},\mathbf{y}}\left\|f_{\theta}(\mathbf{y}_{J^{\complement% }})_{J}-\mathbf{y}_{J}\right\|_{2}^{2}=\mathbb{E}_{\mathbf{x},\mathbf{y}}\left% \|f_{\theta}(\mathbf{y}_{J^{\complement}})_{J}-\mathbf{x}_{J}\right\|_{2}^{2}+% \sigma^{2}.blackboard_E start_POSTSUBSCRIPT bold_x , bold_y end_POSTSUBSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT - bold_y start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT bold_x , bold_y end_POSTSUBSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (3)

The proof is given below:

Proof :

𝔼𝐱,𝐲fθ(𝐲J)J𝐲J22subscript𝔼𝐱𝐲superscriptsubscriptnormsubscript𝑓𝜃subscriptsubscript𝐲superscript𝐽complement𝐽subscript𝐲𝐽22\displaystyle\mathbb{E}_{\mathbf{x},\mathbf{y}}\left\|f_{\theta}(\mathbf{y}_{J% ^{\complement}})_{J}-\mathbf{y}_{J}\right\|_{2}^{2}blackboard_E start_POSTSUBSCRIPT bold_x , bold_y end_POSTSUBSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT - bold_y start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =𝔼𝐱,𝐲fθ(𝐲J)J𝐱J𝐧J22absentsubscript𝔼𝐱𝐲superscriptsubscriptnormsubscript𝑓𝜃subscriptsubscript𝐲superscript𝐽complement𝐽subscript𝐱𝐽subscript𝐧𝐽22\displaystyle=\mathbb{E}_{\mathbf{x},\mathbf{y}}\left\|f_{\theta}(\mathbf{y}_{% J^{\complement}})_{J}-\mathbf{x}_{J}-\mathbf{n}_{J}\right\|_{2}^{2}= blackboard_E start_POSTSUBSCRIPT bold_x , bold_y end_POSTSUBSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT - bold_n start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (4)
=𝔼𝐱,𝐲fθ(𝐲J)J𝐱J22+σ2absentsubscript𝔼𝐱𝐲superscriptsubscriptnormsubscript𝑓𝜃subscriptsubscript𝐲superscript𝐽complement𝐽subscript𝐱𝐽22superscript𝜎2\displaystyle=\mathbb{E}_{\mathbf{x},\mathbf{y}}\left\|f_{\theta}(\mathbf{y}_{% J^{\complement}})_{J}-\mathbf{x}_{J}\right\|_{2}^{2}+\sigma^{2}= blackboard_E start_POSTSUBSCRIPT bold_x , bold_y end_POSTSUBSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
2𝔼𝐱,𝐲fθ(𝐲J)J𝐱J,𝐧J.2subscript𝔼𝐱𝐲subscript𝑓𝜃subscriptsubscript𝐲superscript𝐽complement𝐽subscript𝐱𝐽subscript𝐧𝐽\displaystyle\quad-2\mathbb{E}_{\mathbf{x},\mathbf{y}}\langle f_{\theta}(% \mathbf{y}_{J^{\complement}})_{J}-\mathbf{x}_{J},\mathbf{n}_{J}\rangle.- 2 blackboard_E start_POSTSUBSCRIPT bold_x , bold_y end_POSTSUBSCRIPT ⟨ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT , bold_n start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ⟩ .

Due to the independence between 𝐲Jsubscript𝐲superscript𝐽complement\mathbf{y}_{J^{\complement}}bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and 𝐱Jsubscript𝐱𝐽\mathbf{x}_{J}bold_x start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT to 𝐧Jsubscript𝐧𝐽\mathbf{n}_{J}bold_n start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT, there holds:

𝔼𝐱,𝐲fθ(𝐲J)J𝐱J,𝐧J=𝔼𝐱,𝐲[fθ(𝐲J)J𝐱J]𝔼𝐱,𝐲[𝐧J].subscript𝔼𝐱𝐲subscript𝑓𝜃subscriptsubscript𝐲superscript𝐽complement𝐽subscript𝐱𝐽subscript𝐧𝐽subscript𝔼𝐱𝐲delimited-[]subscript𝑓𝜃subscriptsubscript𝐲superscript𝐽complement𝐽subscript𝐱𝐽subscript𝔼𝐱𝐲delimited-[]subscript𝐧𝐽\displaystyle\mathbb{E}_{\mathbf{x},\mathbf{y}}\langle f_{\theta}(\mathbf{y}_{% J^{\complement}})_{J}-\mathbf{x}_{J},\mathbf{n}_{J}\rangle=\mathbb{E}_{\mathbf% {x},\mathbf{y}}\left[f_{\theta}(\mathbf{y}_{J^{\complement}})_{J}-\mathbf{x}_{% J}\right]\mathbb{E}_{\mathbf{x},\mathbf{y}}[\mathbf{n}_{J}].blackboard_E start_POSTSUBSCRIPT bold_x , bold_y end_POSTSUBSCRIPT ⟨ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT , bold_n start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ⟩ = blackboard_E start_POSTSUBSCRIPT bold_x , bold_y end_POSTSUBSCRIPT [ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ] blackboard_E start_POSTSUBSCRIPT bold_x , bold_y end_POSTSUBSCRIPT [ bold_n start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ] . (5)

Since the noise is zero-mean 𝔼𝐱,𝐲(𝐧J)=0subscript𝔼𝐱𝐲subscript𝐧𝐽0\mathbb{E}_{\mathbf{x},\mathbf{y}}(\mathbf{n}_{J})=0blackboard_E start_POSTSUBSCRIPT bold_x , bold_y end_POSTSUBSCRIPT ( bold_n start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ) = 0, we have:

𝔼𝐱,𝐲fθ(𝐲J)J𝐲J22=𝔼𝐱,𝐲fθ(𝐲J)J𝐱J22+σ2.subscript𝔼𝐱𝐲superscriptsubscriptnormsubscript𝑓𝜃subscriptsubscript𝐲superscript𝐽complement𝐽subscript𝐲𝐽22subscript𝔼𝐱𝐲superscriptsubscriptnormsubscript𝑓𝜃subscriptsubscript𝐲superscript𝐽complement𝐽subscript𝐱𝐽22superscript𝜎2\mathbb{E}_{\mathbf{x},\mathbf{y}}\left\|f_{\theta}(\mathbf{y}_{J^{\complement% }})_{J}-\mathbf{y}_{J}\right\|_{2}^{2}=\mathbb{E}_{\mathbf{x},\mathbf{y}}\left% \|f_{\theta}(\mathbf{y}_{J^{\complement}})_{J}-\mathbf{x}_{J}\right\|_{2}^{2}+% \sigma^{2}.blackboard_E start_POSTSUBSCRIPT bold_x , bold_y end_POSTSUBSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT - bold_y start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT bold_x , bold_y end_POSTSUBSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (6)

Since σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is constant, Theorem 1 states that optimizing the self-supervised loss function over the proposed training scheme yields the same solutions as the supervised loss function.

III-C Generating Sub-sampled Image Randomly

Self-supervised image denoising methods typically assume that noise is signal-independent and spatially uncorrelated. Therefore, it is crucial to sub-sample a noisy image to maintain the denoising performance when training with paired noisy images of different resolutions. Recently, pixel-shuffling downsampling (PD) [42, 26] has emerged as an effective technique for breaking the spatial correlation of real-world noise. Motivated by the success of PD, we introduce a Random Sub-sampler for generating sub-sampled noisy images. The Random Sub-sampler follows a similar sub-sampling strategy as PD but introduces additional randomness into the sub-sampling operation. The randomness serves as a data augmentation strategy, which helps to prevent training overfitting.

The process of using Random Sub-sampler with sampling stride s𝑠sitalic_s to generate a sub-sampled noisy image 𝐲JH/s×W/ssubscript𝐲superscript𝐽complementsuperscript𝐻𝑠𝑊𝑠\mathbf{y}_{J^{\complement}}\in\mathbb{R}^{H/s\times W/s}bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H / italic_s × italic_W / italic_s end_POSTSUPERSCRIPT from an image 𝐲H×W𝐲superscript𝐻𝑊\mathbf{y}\in\mathbb{R}^{H\times W}bold_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT is summarized as follows:

  1. 1.

    Perform an inverse pixel shuffling (PS) [43] operation on image 𝐲H×W𝐲superscript𝐻𝑊\mathbf{y}\in\mathbb{R}^{H\times W}bold_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT with a stride of s𝑠sitalic_s, resulting in: PSs1(𝐲)H/s×W/s×s2superscriptsubscriptPS𝑠1𝐲superscript𝐻𝑠𝑊𝑠superscript𝑠2\text{PS}_{s}^{-1}(\mathbf{y})\in\mathbb{R}^{H/s\times W/s\times s^{2}}PS start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_y ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_H / italic_s × italic_W / italic_s × italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT;

  2. 2.

    For (i,j)𝑖𝑗(i,j)( italic_i , italic_j )-th location of sub-sampled image 𝐲Jsubscript𝐲superscript𝐽complement\mathbf{y}_{J^{\complement}}bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, randomly select one elements from PSs1(𝐲)ijsuperscriptsubscriptPS𝑠1subscript𝐲𝑖𝑗{\text{PS}_{s}^{-1}(\mathbf{y})}_{ij}PS start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_y ) start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT;

  3. 3.

    Generate a binary matrix mask 𝐦JH×Wsubscript𝐦𝐽superscript𝐻𝑊\mathbf{m}_{J}\in\mathbb{R}^{H\times W}bold_m start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT to select unsampled pixels in the original image 𝐲𝐲\mathbf{y}bold_y. The mask 𝐦Jsubscript𝐦𝐽\mathbf{m}_{J}bold_m start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT is set to 0 at locations where an element from 𝐲𝐲\mathbf{y}bold_y is selected, and 1 otherwise.

Following this process, the sub-sampled noisy image 𝐲Jsubscript𝐲superscript𝐽complement\mathbf{y}_{J^{\complement}}bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT with dimensions of H/s×W/s𝐻𝑠𝑊𝑠H/s\times W/sitalic_H / italic_s × italic_W / italic_s is obtained. The binary matrix mask 𝐦Jsubscript𝐦𝐽\mathbf{m}_{J}bold_m start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT is also generated to identify the unsampled pixels in 𝐲𝐲\mathbf{y}bold_y. Fig. 2 illustrates the workflow of the Random Sub-sampler, demonstrating the generation of a sub-sampled image from an input image of size 4×4444\times 44 × 4 with a stride of s=2𝑠2s=2italic_s = 2.

Algorithm 1 Zero-shot learning for Noise2SR
0:  An individual noisy image 𝐲𝐲\mathbf{y}bold_y; denoising network fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT; Random Sub-sampler.
0:  Well-trained denoising network fθ^subscript𝑓^𝜃f_{\hat{\theta}}italic_f start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT.
1:  while not converged do
2:     Generate a random sub-sampled noisy image 𝐲Jsubscript𝐲superscript𝐽complement\mathbf{y}_{J^{\complement}}bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, and a binary mask 𝐦Jsubscript𝐦𝐽\mathbf{m}_{J}bold_m start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT from a Random Sub-sampler;
3:     For a sub-sampled noisy image 𝐲Jsubscript𝐲superscript𝐽complement\mathbf{y}_{J^{\complement}}bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, derive the denoised image fθ(𝐲J)subscript𝑓𝜃subscript𝐲superscript𝐽complementf_{\theta}(\mathbf{y}_{J^{\complement}})italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT );
4:     Select the unsampled pixels in noisy image 𝐲𝐲\mathbf{y}bold_y and corresponding pixels in fθ(𝐲J)subscript𝑓𝜃subscript𝐲superscript𝐽complementf_{\theta}(\mathbf{y}_{J^{\complement}})italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ):   𝐲J=𝐲𝐦Jsubscript𝐲𝐽direct-product𝐲subscript𝐦𝐽\mathbf{y}_{J}=\mathbf{y}\odot\mathbf{m}_{J}bold_y start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT = bold_y ⊙ bold_m start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT,   fθ(𝐲J)J=fθ(𝐲J)𝐦Jsubscript𝑓𝜃subscriptsubscript𝐲superscript𝐽complement𝐽direct-productsubscript𝑓𝜃subscript𝐲superscript𝐽complementsubscript𝐦𝐽f_{\theta}(\mathbf{y}_{J^{\complement}})_{J}=f_{\theta}(\mathbf{y}_{J^{% \complement}})\odot\mathbf{m}_{J}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ⊙ bold_m start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT;
5:     Calculate =fθ(𝐲J)J𝐲J2superscriptnormsubscript𝑓𝜃subscriptsubscript𝐲superscript𝐽complement𝐽subscript𝐲𝐽2\mathcal{L}=\|f_{\theta}(\mathbf{y}_{J^{\complement}})_{J}-\mathbf{y}_{J}\|^{2}caligraphic_L = ∥ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT - bold_y start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT;
6:     Update the denoising network fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT by minimizing the objective \mathcal{L}caligraphic_L.
7:  end while

The stride s𝑠sitalic_s determines the relatively sub-sampling interval and sampling ratio of noisy pixels, which have an impact on the training of Noise2SR. In Section IV-F, we conduct a comprehensive study to evaluate the influence of the sampling stride s𝑠sitalic_s of the Random Sub-sampler. We also compare the performance between fix-location and random-location sub-sampling strategies.

III-D Optimizing Network

Given a noisy image 𝐲H×W𝐲superscript𝐻𝑊\mathbf{y}\in\mathbb{R}^{H\times W}bold_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT, We can parameterize the Noise2SR as CNN-based denoising and SR function fθ:|J|H×W:subscript𝑓𝜃superscriptsuperscript𝐽complementsuperscript𝐻𝑊f_{\theta}:\mathbb{R}^{|J^{\complement}|}\to\mathbb{R}^{H\times W}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT | italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT, where the parameters θ𝜃\thetaitalic_θ are weights of the network, |J|superscript𝐽complement|J^{\complement}|| italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT | is number of sub-sampled noisy image pixels. Noise2SR takes the sub-sampled noisy image 𝐲Jsubscript𝐲superscript𝐽complement\mathbf{y}_{J^{\complement}}bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT as input and outputs a prediction of denoised image 𝐱^=fθ(𝐲J)H×W^𝐱subscript𝑓𝜃subscript𝐲superscript𝐽complementsuperscript𝐻𝑊\hat{\mathbf{x}}=f_{\theta}(\mathbf{y}_{J^{\complement}})\in\mathbb{R}^{H% \times W}over^ start_ARG bold_x end_ARG = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT.

Based on the proof provided in Section III-B, we optimize the network by minimizing the loss function that compares the unsampled pixels of the original noisy image denotes as 𝐲Jsubscript𝐲𝐽\mathbf{y}_{J}bold_y start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT, with corresponding pixels in the network output, denotes as fθ(𝐲J)Jsubscript𝑓𝜃subscriptsubscript𝐲superscript𝐽complement𝐽f_{\theta}(\mathbf{y}_{J^{\complement}})_{J}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT. To facilitate the selection of these pixels, we can utilize a binary mask 𝐦Jsubscript𝐦𝐽\mathbf{m}_{J}bold_m start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT. Specifically, the unsampled pixels of the original image can be obtained by element-wise multiplication with the mask, i.e., 𝐲J=𝐲𝐦Jsubscript𝐲𝐽direct-product𝐲subscript𝐦𝐽\mathbf{y}_{J}=\mathbf{y}\odot\mathbf{m}_{J}bold_y start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT = bold_y ⊙ bold_m start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT. Similarly, the corresponding pixels in the network output can be obtained as fθ(𝐲J)J=fθ(𝐲J)𝐦Jsubscript𝑓𝜃subscriptsubscript𝐲superscript𝐽complement𝐽direct-productsubscript𝑓𝜃subscript𝐲superscript𝐽complementsubscript𝐦𝐽f_{\theta}(\mathbf{y}_{J^{\complement}})_{J}=f_{\theta}(\mathbf{y}_{J^{% \complement}})\odot\mathbf{m}_{J}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ⊙ bold_m start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT. Here, the symbol direct-product\odot represents the Hadamard product. Thus, the Noise2SR network can learn a denoising and SR function by minimizing the self-supervised loss

argmin𝜃n=1NJ𝒥(fθ(𝐲J)J,𝐲J),𝜃superscriptsubscript𝑛1𝑁subscript𝐽𝒥subscript𝑓𝜃subscriptsubscript𝐲superscript𝐽complement𝐽subscript𝐲𝐽\underset{\theta}{\arg\min}\sum_{n=1}^{N}\sum_{J\in\mathcal{J}}\mathcal{L}(f_{% \theta}(\mathbf{y}_{J^{\complement}})_{J},\mathbf{y}_{J}),underitalic_θ start_ARG roman_arg roman_min end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_J ∈ caligraphic_J end_POSTSUBSCRIPT caligraphic_L ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ) , (7)

where 𝒥𝒥\mathcal{J}caligraphic_J represents the set of partitions used during the training stage, and N𝑁Nitalic_N denotes the number of images in the dataset.

III-E Clean Image Restoration with MMSE Estimation

The pipeline of clean image reconstruction using the well-trained N2SR model is shown in Fig. 1.B. For a random sub-sampled noisy image 𝐲Jsubscript𝐲superscript𝐽complement\mathbf{y}_{J^{\complement}}bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, well-trained Noise2SR can generated a plausible denoised result 𝐱^=fθ^(𝐲J)^𝐱subscript𝑓^𝜃subscript𝐲superscript𝐽complement\hat{\mathbf{x}}=f_{\hat{\theta}}(\mathbf{y}_{J^{\complement}})over^ start_ARG bold_x end_ARG = italic_f start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ). Since the sub-sampled image is generated from the original noisy image randomly, the clean image can be expressed as:

𝐱=𝔼[fθ^(𝐲J)]=J𝒥fθ^(𝐲J)p(𝐱|𝐲J),𝐱𝔼delimited-[]subscript𝑓^𝜃subscript𝐲superscript𝐽complementsubscript𝐽𝒥subscript𝑓^𝜃subscript𝐲superscript𝐽complement𝑝conditional𝐱subscript𝐲superscript𝐽complement\mathbf{x}=\mathbb{E}[f_{\hat{\theta}}(\mathbf{y}_{J^{\complement}})]=\sum_{J% \in\mathcal{J}}f_{\hat{\theta}}(\mathbf{y}_{J^{\complement}})p(\mathbf{x}|% \mathbf{y}_{J^{\complement}}),bold_x = blackboard_E [ italic_f start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ] = ∑ start_POSTSUBSCRIPT italic_J ∈ caligraphic_J end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) italic_p ( bold_x | bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , (8)

where 𝒥𝒥\mathcal{J}caligraphic_J is the possible Random Sub-sampler partition set. Since the randomness of sub-sampling operation , |𝒥|𝒥|\mathcal{J}|| caligraphic_J | is numerous and p(𝐱|𝐲J)𝑝conditional𝐱subscript𝐲superscript𝐽complementp(\mathbf{x}|\mathbf{y}_{J^{\complement}})italic_p ( bold_x | bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) is undetermined, making it challenging to precisely compute 𝔼(𝐱^)𝔼^𝐱\mathbb{E}(\hat{\mathbf{x}})blackboard_E ( over^ start_ARG bold_x end_ARG ).

Thus, we propose to use MMSE estimation to approximate 𝔼(𝐱^)𝔼^𝐱\mathbb{E}(\hat{{\mathbf{x}}})blackboard_E ( over^ start_ARG bold_x end_ARG ) of the clean signal. Given a set of sub-sampled noisy images 𝐲Jsubscript𝐲superscript𝐽complement\mathbf{y}_{J^{\complement}}bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, we can approximate the MMSE estimate of the clean image by averaging all the plausible denoised results fθ^(𝐲J)subscript𝑓^𝜃subscript𝐲superscript𝐽complementf_{\hat{\theta}}(\mathbf{y}_{J^{\complement}})italic_f start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ). Specifically, we first randomly sub-sampled the noisy image M𝑀Mitalic_M times to obtain a sub-sampled noisy image set 𝒴={𝐲J1,,𝐲JM}𝒴subscript𝐲superscriptsubscript𝐽1complementsubscript𝐲superscriptsubscript𝐽𝑀complement\mathcal{Y}=\left\{\mathbf{y}_{J_{1}^{\complement}},\dots,\mathbf{y}_{J_{M}^{% \complement}}\right\}caligraphic_Y = { bold_y start_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , … , bold_y start_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT }. For each subsampled noisy image 𝐲Jmsubscript𝐲superscriptsubscript𝐽𝑚complement\mathbf{y}_{J_{m}^{\complement}}bold_y start_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, the well-trained Noise2SR takes it as input and generates the corresponding denoising and SR result fθ^(𝐲Jm)subscript𝑓^𝜃subscript𝐲superscriptsubscript𝐽𝑚complementf_{\hat{\theta}}\left(\mathbf{y}_{J_{m}^{\complement}}\right)italic_f start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ). Finally, the approximate MMSE estimation of the clean image can be computed below:

𝐱1Mm=1Mfθ^(𝐲Jm),𝐱1𝑀superscriptsubscript𝑚1𝑀subscript𝑓^𝜃subscript𝐲superscriptsubscript𝐽𝑚complement\mathbf{x}\approx\frac{1}{M}\sum_{m=1}^{M}f_{\hat{\theta}}\left(\mathbf{y}_{J_% {m}^{\complement}}\right),bold_x ≈ divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∁ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , (9)

where Jmcsuperscriptsubscript𝐽𝑚𝑐J_{m}^{c}italic_J start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT denotes the sub-sampling partition at m𝑚mitalic_m time.

Refer to caption
Figure 3: The architecture of Noise2SR used for parameterizing the super-resolved denoising function fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, which consists of the U-Net Based Encoder and the Super-resolved Decoder.

III-F Network Architecture

As shown in Fig. 3, the network of Noise2SR fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT consists of a U-Net based Encoder and a Super-resolved Decoder.

III-F1 U-Net based Encoder

We adopt the same U-Net architecture as encoder [25] with modifying the last convolution block to generate a feature map with 128 channels for Super-resolved Decoder.

III-F2 Super-resolved Decoder

The Super-resolved Decoder takes the feature map from the encoder as input and generates the denoised image prediction. It is comprised of an Upsampling layer followed by three 1×1111\times 11 × 1 convolution layers. In this work, we employ pixel-shuffling [43] as the super-resolution strategy for the Upsampling layer. In Section IV-F, we conduct an ablation study to evaluate the impact of different super-resolution strategies on denoising performance.

IV Experiments and Results

Refer to caption
Figure 4: Comparison of Noise2SR and other image denoising methods on simulated \cePe/CeO2 catalyst corrupted with Poisson-Gaussian noise (a=0.05,b=0.02)formulae-sequence𝑎0.05𝑏0.02(a=0.05,b=0.02)( italic_a = 0.05 , italic_b = 0.02 ). The second and fourth rows display the corresponding error maps of the denoised results. * indicates that the dataset-based self-supervised denoising method was performed in a zero-shot learning manner.
Refer to caption
Figure 5: Comparison of intensity profiles on the surface atomic columns is conducted for the denoised results obtained using FBI-Denoiser (FBI-D) and Noise2SR (N2SR) on simulated \cePe/CeO2 catalyst corrupted with Poisson-Gaussian noise (a=0.05,b=0.02)formulae-sequence𝑎0.05𝑏0.02(a=0.05,b=0.02)( italic_a = 0.05 , italic_b = 0.02 ), alongside the corresponding ground truth data.

In this section, we aim to demonstrate the feasibility and effectiveness of the proposed Noise2SR framework. First, we conduct an experiment to evaluate the influence of sub-sampling on clean/noisy natural and HREM images. Then, We conduct two noise removal experiments comparing Noise2SR with eight other denoising methods. 1) Synthetic Poisson-Gaussian noise removal on simulated TEM datasets, and 2) Real noise removal on Scanning Transmission Electron Microscopy (STEM) imaging data. Moreover, we perform a comprehensive ablation study to analyze the impact of various factors, such as sub-sampling patterns, the number of samples used for MMSE estimation of the clean signal and the up-sampling strategy adopted in Super-resolved decoder.

IV-A Compared Methods and Metrics

IV-A1 Compared Methods

For comprehensive comparisons, we compared the proposed Noise2SR with 8 methods which are divided into four groups. 1) non-learning methods: Adaptive Wiener filtering [44] and BM3D[45]; 2) Zero-shot Self-supervised Learning (ZS-SSL) methods: Self2Self (S2S) [15], FBI-Denoiser (FBI-D) [17] and ZS-Noise2Noise [22]; 3) Dataset-based Self-supervised Learning (DS-SSL) methods: Noise2Self (N2S) [11], Neighbor2Neighbor (NB2NB) [19] and Noise2Noise (N2N) [25]; 4) Supervised Deep learning methods: Simulated-based denoising (SBD) [41]. It should be noted that N2S and NB2NB can also be trained in zero-shot learning manner, and we denote the corresponding extensions of these methods as N2S* and NB2NB*, respectively.

IV-A2 Implementation Details

For the training of the Noise2SR model, each iteration begins by randomly cropping a patch image of size 256×256256256256\times 256256 × 256 from the original image. Subsequently, we utilize the Random Sub-sampler with sampling stride (s=2)𝑠2(s=2)( italic_s = 2 ) to generate a sub-sampled noisy image, which serves as the input for the network. To optimize the model, we employ the Adam optimizer [46] with the following hyper-parameters: β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999, and ϵ=108italic-ϵsuperscript108\epsilon=10^{-8}italic_ϵ = 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT. The initial learning rate is set to 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. We set the batch size of 12 and the training process consists of 1500 epochs. During the inference stage, we sub-sampled 50 times to generate an approximate MMSE denoised result.

For the compared DL-based methods, we adhere to the same network architecture as described in their respective original papers. Specifically, for the SBD method, we employ the U-Net architecture [25], also referred to as small-Unet in the original paper. All experiments were conducted on a server equipped with Python 3.7.3, PyTorch 1.3, and NVIDIA TITAN GPUs.

TABLE I: Quantitative results (PSNR/SSIM) of compared denoising methods and Noise2SR on simulated \cePt/CeO2 TEM data corrupted with Poisson-Gaussian noise. * indicates that the dataset-based self-supervised denoising methods are performed in a zero-shot learning manner. The best performance among non-learning and zero-shot learning methods is highlighted in bold, while the second-best performance is underlined.
Noise Parameters a=0.1,b=0.02formulae-sequence𝑎0.1𝑏0.02a=0.1,b=0.02italic_a = 0.1 , italic_b = 0.02 a=0.05,b=0.02formulae-sequence𝑎0.05𝑏0.02a=0.05,b=0.02italic_a = 0.05 , italic_b = 0.02 a=0.02,b=0.02formulae-sequence𝑎0.02𝑏0.02a=0.02,b=0.02italic_a = 0.02 , italic_b = 0.02
Category Method PSNR SSIM PSNR SSIM PSNR SSIM
Noisy 4.57±plus-or-minus\pm±2.67 0.0120±plus-or-minus\pm±0.01 7.54±plus-or-minus\pm±2.67 0.0232±plus-or-minus\pm±0.01 11.40±plus-or-minus\pm±2.67 0.0526±plus-or-minus\pm±0.02
Non- Learning Adaptive Wiener [44] 20.70±plus-or-minus\pm±1.90 0.6093±plus-or-minus\pm±0.08 22.96±plus-or-minus\pm±1.97 0.7398±plus-or-minus\pm±0.07 25.66±plus-or-minus\pm±1.95 0.8656±plus-or-minus\pm±0.03
VST+BM3D [45] 22.30±plus-or-minus\pm±2.34 0.5429±plus-or-minus\pm±0.10 25.33±plus-or-minus\pm±2.33 0.7266±plus-or-minus\pm±0.08 29.97±plus-or-minus\pm±2.33 0.8659±plus-or-minus\pm±0.05
ZS-SSL Neighbour2Neighbour* [19] 24.56±plus-or-minus\pm±2.48 0.6983±plus-or-minus\pm±0.10 26.32±plus-or-minus\pm±2.61 0.7451±plus-or-minus\pm±0.12 28.44±plus-or-minus\pm±2.87 0.7915±plus-or-minus\pm±0.09
Noise2Self* [11] 28.51±plus-or-minus\pm±2.99 0.8817±plus-or-minus\pm±0.06 29.76±plus-or-minus\pm±2.36 0.9056±plus-or-minus\pm±0.04 31.68±plus-or-minus\pm±2.69 0.9207±plus-or-minus\pm±0.06
Self2Self [15] 25.36±plus-or-minus\pm±2.79 0.7423 ±plus-or-minus\pm±0.13 27.77±plus-or-minus\pm±2.93 0.8353±plus-or-minus\pm±0.07 31.05±plus-or-minus\pm±2.50 0.9076±plus-or-minus\pm±0.04
FBI-D [17] 28.17±plus-or-minus\pm±2.73 0.7916±plus-or-minus\pm±0.09 31.23±plus-or-minus\pm±2.74 0.9231±plus-or-minus\pm±0.04 34.23±plus-or-minus\pm±2.74 0.9541±plus-or-minus\pm±0.04
ZS-N2N [22] 21.95±plus-or-minus\pm±2.00 0.5776±plus-or-minus\pm±0.07 24.34±plus-or-minus\pm±2.31 0.6736±plus-or-minus\pm±0.08 27.84±plus-or-minus\pm±2.62 0.7910±plus-or-minus\pm±0.07
Noise2SR (w/o MMSE) (Ours) 28.84±plus-or-minus\pm±2.95 0.9296±plus-or-minus\pm±0.05 30.16±plus-or-minus\pm±3.08 0.9532±plus-or-minus\pm±0.04 34.57±plus-or-minus\pm±2.46 0.9788±plus-or-minus\pm±0.01
Noise2SR (w/ MMSE) (Ours) 31.68±plus-or-minus\pm±2.68 0.9656±plus-or-minus\pm±0.02 33.66±plus-or-minus\pm±2.68 0.9772±plus-or-minus\pm±0.01 36.35±plus-or-minus\pm±2.46 0.9873±plus-or-minus\pm±0.01
DS-SSL Noise2Self[11] 31.62±plus-or-minus\pm±2.77 0.9642±plus-or-minus\pm±0.02 33.50±plus-or-minus\pm±2.58 0.9734±plus-or-minus\pm±0.01 35.68±plus-or-minus\pm±2.91 0.9848±plus-or-minus\pm±0.01
Neighbour2Neighbour [19] 28.80±plus-or-minus\pm±3.10 0.9305±plus-or-minus\pm±0.03 30.75±plus-or-minus\pm±2.37 0.9530±plus-or-minus\pm±0.02 33.21±plus-or-minus\pm±2.72 0.9674±plus-or-minus\pm±0.01
Noise2Noise [25] 32.73±plus-or-minus\pm±2.97 0.9790±plus-or-minus\pm±0.01 34.11±plus-or-minus\pm±2.88 0.9837±plus-or-minus\pm±0.01 36.30±plus-or-minus\pm±2.78 0.9888±plus-or-minus\pm±0.01
Supervised SBD [41] 33.12±plus-or-minus\pm±2.79 0.9799±plus-or-minus\pm±0.01 34.25±plus-or-minus\pm±3.09 0.9838±plus-or-minus\pm±0.01 36.52±plus-or-minus\pm±2.90 0.9891±plus-or-minus\pm±0.01

IV-A3 Evaluation Metrics

We calculate Peak-Signal-to-Noise Ratio (PSNR) and Structured Similarity Index Measure (SSIM) [47] to measure the performance of compared methods quantitatively. PSNR is defined based on pixel-by-pixel distance and SSIM measures structural similarity using the mean and variance of images.

IV-B Datasets

Simulated TEM Datasets

IV-B1 Simulated TEM Datasets

Based on the simulated TEM dataset 111https://github.com/sreyas-mohan/electron-microscopy-denoising released by Mohan et al. [41] which consists of approximate 18000 simulated images of \cePt/CeO2 catalyst with various combinations of imaging parameters, we select 1000 images as training set and 5 images for the test set. The training set is used to prepare dataset-based training methods and evaluate their performance on the test set. On the other hand, zero-shot learning models train a model for each image in the test set.

IV-B2 Noise simulation

HREM images are affected by mixed Poisson-Gaussian noise [48, 49, 50], which combines the effects of dark noise and photon noise (Poisson noise) and readout noise (Gaussian noise). The Poisson noise 𝐧psubscript𝐧𝑝\mathbf{n}_{p}bold_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is dependent on the signal intensity, while the Gaussian noise 𝐧gsubscript𝐧𝑔\mathbf{n}_{g}bold_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT originates from imperfections in the output amplifier during charge-to-voltage conversion. We adopt the Poisson-Gaussian Model formulation described in [51].

𝐲=𝐱+𝐧p(𝐱)+𝐧g,𝐲𝐱subscript𝐧𝑝𝐱subscript𝐧𝑔\displaystyle\mathbf{y}=\mathbf{x}+\mathbf{n}_{p}(\mathbf{x})+\mathbf{n}_{g},bold_y = bold_x + bold_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( bold_x ) + bold_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , (10)
(𝐱+𝐧p(𝐱))a𝒫(1a𝐱),similar-to𝐱subscript𝐧𝑝𝐱𝑎𝒫1𝑎𝐱\displaystyle(\mathbf{x}+\mathbf{n}_{p}(\mathbf{x}))\sim{a}\mathcal{P}(\frac{1% }{a}\mathbf{x}),( bold_x + bold_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( bold_x ) ) ∼ italic_a caligraphic_P ( divide start_ARG 1 end_ARG start_ARG italic_a end_ARG bold_x ) ,
𝐧g𝒩(0,b2).similar-tosubscript𝐧𝑔𝒩0superscript𝑏2\displaystyle\mathbf{n}_{g}\sim\mathcal{N}(0,b^{2}).bold_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

The expected value and the variance of the noisy measurement 𝐱𝐱\mathbf{x}bold_x is:

𝔼[𝐲]=𝐱,Var[𝐱]=a𝐱+b2.formulae-sequence𝔼delimited-[]𝐲𝐱Vardelimited-[]𝐱𝑎𝐱superscript𝑏2\mathbb{E}[\mathbf{y}]=\mathbf{x},\quad\text{Var}[\mathbf{x}]=a\mathbf{x}+b^{2}.blackboard_E [ bold_y ] = bold_x , Var [ bold_x ] = italic_a bold_x + italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (11)

Thus, the noisy image corrupted with Poisson-Gaussian noise can be modeled using characterized parameters (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) as

𝐲=𝐱+𝐧,𝐧𝒩(0,a𝐱+b2).formulae-sequence𝐲𝐱𝐧similar-to𝐧𝒩0𝑎𝐱superscript𝑏2\mathbf{y}=\mathbf{x}+\mathbf{n},\quad\mathbf{n}\sim\mathcal{N}(0,a\mathbf{x}+% b^{2}).bold_y = bold_x + bold_n , bold_n ∼ caligraphic_N ( 0 , italic_a bold_x + italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (12)

Here, a𝑎aitalic_a represents the noise level that is dependent on the signal, while b𝑏bitalic_b represents the noise level that is independent of the signal.

To comprehensively evaluate the denoising performance of the compared methods on different levels of noise, we apply three different levels of Poisson-Gaussian noise to the TEM dataset. The noise parameters used are summarized below: (a=0.1,b=0.02),(a=0.05,b=0.02),formulae-sequence𝑎0.1𝑏0.02formulae-sequence𝑎0.05𝑏0.02(a=0.1,b=0.02),(a=0.05,b=0.02),( italic_a = 0.1 , italic_b = 0.02 ) , ( italic_a = 0.05 , italic_b = 0.02 ) , and (a=0.01,b=0.02)formulae-sequence𝑎0.01𝑏0.02(a=0.01,b=0.02)( italic_a = 0.01 , italic_b = 0.02 ).

Refer to caption
Figure 6: Comparison of Noise2SR with other image denoising methods on real low-dose Te STEM images. The last column displays the corresponding Te imaging under high-dose conditions for reference. * indicates that the dataset-based self-supervised denoising method was performed in a zero-shot learning manner.

Real STEM Data Sample preparation of Te crystals, ZSM-5 (MFI framework) and MOR zeolite (http://asia.iza-structure.org/IZA-SC/ftc_table.php) were all followed in the same way. The crystals were first crushed with mortar and pestle, and then the powders were dispersed in ethanol under ultrasonication. Few drops of the suspension were placed onto holey carbon copper grids to be further checked by TEM. The real annular dark field scanning transmission electron microscopy (ADF-STEM) images were all collected with a GrandARM 300F (JEOL Ltd.) transmission electron microscopy operated at 300 kV. The crystals were first tilted to specific zone axis under TEM mode with the selected area electron diffraction (SAED) patterns and then switched to STEM mode to record the experimental ADF-STEM images.

Refer to caption
Figure 7: Comparison of other image denoising methods with Noise2SR on real STEM data. The top two rows display the denoising results of ZSM-5 zeolite, while the last two rows showcase the denoising results of MOR zeolite. The red and yellow arrows point to Silicon and Oxygen atoms, respectively. * indicates that the dataset-based self-supervised denoising method was performed in a zero-shot learning manner.

IV-C Comparisons on Simulated TEM Datasets

Table I shows the quantitative results. Compared with other zero-shot self-supervised learning denoising methods, Noise2SR achieves the best denoising performance. Compared with dataset-based self-supervised denoising methods and supervised denoising, Noise2SR achieves comparable performance.

Fig. 4 shows the qualitative results on two test samples of the simulated noisy \cePt/CeO2 catalyst TEM images. Compared with zero-shot denoising methods, Noise2SR precisely restores the nanoparticle structures and achieves a cleaner background. Moreover, Noise2SR exhibits fewer additional structural patterns in the error maps compared to dataset-based self-supervised denoising methods and supervised denoising methods. In Fig. 5, we compare the intensity profiles on the surface atomic columns for the denoised data obtained from FBI-D and Noise2SR, along with the ground truth. The intensity profiles of both denoised data show a similar overall trend to the ground truth. Compared to FBI-D, Noise2SR exhibits a closer resemblance to the ground truth in terms of both peaks and troughs, with fewer fluctuations.

IV-D Comparisons on Real STEM Data

IV-D1 Low-Dose Te Crystal STEM Data

In Fig. 6, we present the qualitative results obtained from denoising real low-dose Te crystal STEM data using different methods. The corresponding high-dose Te crystals STEM data is also included for reference. Two different areas of the low-dose and high-dose STEM images of one same Te crystal were shown in Fig. 6, where bright contrast that corresponds to each Te atom can be observed while the contrast is less clear in low-dose image than high-dose image because of the low signal-to-noise ratio (SNR). After denoising of low-dose STEM images using different methods, it can be observed that conventional learning methods (e.g., BM3D, Wiener Filter) tend to produce over-smoothed results, making it difficult to distinguish the contours of the atoms. On the other hand, DL-based denoising methods exhibit good atom contrast so that each Te atom can be clearly resolved. Notably, compared to other zero-shot denoising methods, the denoised results of Noise2SR exhibit reduced noise and clearer atom contours.

Refer to caption
Figure 8: Comparison of fix-location and random location sub-sampling in Sub-sampler with sampling stride (s=2)𝑠2(s=2)( italic_s = 2 ).

IV-D2 ZSM-5 & MOR Zeolites STEM Data

To further apply this method, two kinds of zeolites, ZSM-5 zeolite and MOR zeolite were imaged under very low-dose conditions and then denoised. It is worth mentioning that zeolites, as one kind of import nano-porous materials that were widely used in catalysis, separation, etc, are extremely sensitive to electron beams. Thus, low electron dose conditions were applied, resulting in low SNR of STEM images which hinder the visualization of atomic structure.

In Fig. 7, we present the qualitative results obtained from denoising ZSM-5 and MOR zeolite STEM data using different methods. Conventional non-learning methods yield blurred denoising results that fail to capture the atomic structure clearly. Among the zero-shot denoising methods, S2S introduces artifacts resembling pepper salt noise, while NB2NB results in a blurred output due to artifacts. In contrast, both FBI-D and Noise2SR clearly represent the atomic structure. However, in the zoomed-in region, FBI-D exhibits some structural artifacts, whereas Noise2SR demonstrates higher contrast with fewer artifacts. In the Noise2SR result of ZSM-5 zeolite, not only the bright and shape contrast of Silicon (Si) atoms in 5-, 6- and 10- member ring (MR) can be well resolved, but also the weak contrast of light Oxygen (O) atom between two Si atoms can be clearly visualized. Similarly, in the Noise2SR result of MOR zeolite, the contrast of Si atoms can be clearly resolved while in the raw image, it is hard to identify because of the noise and low SNR. These results show the great potential of our Noise2SR method in denoising the HREM images, which helps to visualize and study the atomic structures in materials science.

IV-E Effectiveness of Randomness in Sub-sampler

In this paper, we propose the Random Sub-sampler, a method for generating sub-sampled noisy images from full-size noisy images. To evaluate the effectiveness of randomness in sub-sampler on the denoising performance of Noise2SR, we compare the fix-location and random location sampling strategies within sub-sampler. Fig. 8 illustrates the difference of fix and random location sub-sampling strategies of sampling stride s=2𝑠2s=2italic_s = 2. Table II and Fig. 9 show the quantitative and qualitative results, respectively. The results demonstrate that random sub-sampling improves the denoising performance compared to fixed location sub-sampling. Both the denoised results from an individual sub-sampled image (M=1)𝑀1(M=1)( italic_M = 1 ) and MMSE denoised results, random sub-sampling shows the superiority of fixed sub-sampling. In the qualitative result, it is evident that the denoised images obtained from random sub-sampling exhibit clearer backgrounds and sharper atom contours compared to those obtained from fixed location sub-sampling. This observation is further supported by the error maps of the denoising results.

TABLE II: Comparisons of fix and random location sub-sampling strategy in sub-sampler. Quantitative results (PSNR/SSIM) are evaluated on simulated TEM datasets with three different noise levels.
Noise Parameters a=0.1𝑎0.1a=0.1italic_a = 0.1 a=0.05𝑎0.05a=0.05italic_a = 0.05 a=0.02𝑎0.02a=0.02italic_a = 0.02
Fix (M=1)𝑀1(M=1)( italic_M = 1 ) 25.33/0.8732 27.45/0.9131 29.68/0.9401
Random (M=1)𝑀1(M=1)( italic_M = 1 ) 28.44/0.9235 30.16/0.9532 32.70/0.9729
Fix MMSE 29.36/0.9443 31.92/0.9655 34.57/0.9788
Random MMSE 31.92/0.9645 33.44/0.9793 36.41/0.9880
Refer to caption
Figure 9: Qualitative results of fix and random location sub-sampling comparisons of denoising simulated TEM dataset.

IV-F Ablation Study

IV-F1 Influence of Sampling stride in Random Sub-sampler

To evaluate the influence of sampling stride s𝑠sitalic_s on the denoising performance, we conducted experiments using the Noise2SR model with three different values of s𝑠sitalic_s. The experiments were performed on noisy simulated TEM datasets with corrupted Poisson-Gaussian noise (a=0.05,b=0.02formulae-sequence𝑎0.05𝑏0.02a=0.05,b=0.02italic_a = 0.05 , italic_b = 0.02).

Fig. 10 presents the performing curves Noise2SR with different sampling stride s𝑠sitalic_s. The results show that Noise2SR with sampling stride (s=2)𝑠2(s=2)( italic_s = 2 ) achieved the best denoising performance in PSNR/SSIM metrics. Increasing the stride s𝑠sitalic_s led to a decrease in denoising performance. Fig. 11, shows qualitative comparisons of denoised results. All the denoised results closely resembled the ground truth, but the error maps revealed a decrease in denoising performance as the stride s𝑠sitalic_s increased.

Refer to caption
Figure 10: Performance curves of Noise2SR with different Random Sub-sampler parameters over training epochs on simulated TEM dataset with corrupted with Poisson-Gaussian noise (a=0.05,b=0.02formulae-sequence𝑎0.05𝑏0.02a=0.05,b=0.02italic_a = 0.05 , italic_b = 0.02).
Refer to caption
Figure 11: Qualitative results of three version of Noise2SR models with different Random Sub-sampler parameter of stride (s=2,3,4)𝑠234(s=2,3,4)( italic_s = 2 , 3 , 4 ) on a test sample.

IV-F2 Influence of the Sampling Times in Inference

To assess the influence of the number of sub-sampling on the MMSE estimation of the clean signal during the inference stage, we compared the quality of the clean image estimation using different numbers of sub-sampling, ranging from 1 to 50. Fig. 12 showcases the variation in PSNR of the restored images as the number of sub-sampling increases. Furthermore, it presents a comparison of the quantitative and qualitative results of the MMSE estimation of clean images for different numbers of sub-sampling when M=1,5,20,𝑀1520M=1,5,20,italic_M = 1 , 5 , 20 , and 50505050. From the figure, it is observed that as the number of samples increases, the PSNR of the estimated clean images also increases while the variance decreases. Notably, there is a significant improvement in the PSNR when the number of samples increases from 1 to 5, with an approximate gain of 2222 dB (31.3431.3431.3431.34 vs. 33.6133.6133.6133.61). However, as the number of samples increases from 20 to 50, the improvement in the quality of the clean images becomes less significant. Additionally, the error maps of the qualitative results show that the difference between the estimated clean images for M=20𝑀20M=20italic_M = 20 and M=50𝑀50M=50italic_M = 50 is relatively small. This observation is consistent with the qualitative results of the estimation of clean image for M=20𝑀20M=20italic_M = 20 and M=50𝑀50M=50italic_M = 50, where the difference between the two is relatively small.

IV-F3 Influence of Encoder Networks

To further explore the influence of the encoder network on the performance of Noise2SR, we compare the U-net based encoder with two well-known transformer-based backbones Uformer [52] and Restormer [53]. Additionally, to better demonstrate the superiority of Noise2SR, we also compare the performance of Noise2Self with the three encoders and a SOTA BSN, FBI-Net [17].

In the experiments, we implement Uformer and Restormer following the same parameters as in the original papers. We evaluate their performance on two noisy samples with different PSNRs (3 dB and 10 dB) and compared their computational complexity. The quantitative results are shown in Table III. In Noise2SR, the Uformer shows a slight performance degradation compared to the U-net-based encoder, while the Restormer shows a significant improvement of approximately 1.5 dB and 2 dB on two samples, respectively. In contrast, the transformer-based encoders in Noise2Self do not exhibit a performance gain over the U-net encoder but rather a decrease in the SSIM evaluations. For HREM images with high-structured features, CNN-based networks can achieve superior performance due to the inductive bias. One possible explanation is that the masked-based strategy in Noise2Self is not suitable for 𝒥𝒥\mathcal{J}caligraphic_J-invariant denoisers training with transformer encoders. Since the self-attention is not 𝒥𝒥\mathcal{J}caligraphic_J-invariant, the transformer may overfit to mapping replacements to noisy pixels, resulting in performance degradation. Noise2SR enjoys more flexibility in terms of network structures for enhancing performance. It is worth noting that the sub-sampling operation allows us to significantly reduce the computational complexity in the case of using a transformer-based encoder.

TABLE III: Quantitative comparisons of Noise2Self and Noise2SR with three distinct encoder networks and FBI-Net on two noisy samples. The computational complexity of different networks is also compared. MACs are computed on patch size of 128×128128128128\times 128128 × 128.
SSL Strategy Encoder 3 dB 10 dB MACs (G)
Noise2Self Unet 25.20/0.7455 33.44/0.9306 4.61
Uformer 25.56/0.6754 31.78/0.8899 10.24
Restormer 26.17/0.6374 33.16/0.8665 35.21
Noise2SR Unet 29.75/0.9595 35.89/0.9881 6.82
Uformer 29.53/0.9563 34.98/0.9838 4.15
Restormer 31.25/0.9630 37.84/0.9851 11.59
FBI-Net FBI-Net 25.75/0.7434 34.33/0.9379 6.56

IV-F4 Influence of Different Upsampling Strategies

To evaluate the impact of different upsampling strategies on the denoising performance of Noise2SR, three versions of the Noise2SR model with different upsampling techniques are compared: 1) Transposed convolution (TransConv.); 2) Pixel shuffling (PixelShuff.); 3) Bilinear interpolation (BiInter.). The performance of these models was evaluated by denoising two noisy samples with PSNR of 3 dB and 10 dB, respectively.

Table IV shows the quantitative results. The results show that for the test sample with a PSNR of 3 dB, the denoising performance of the three upsampling methods is comparable, with Noise2SR using pixel shuffling showing a slight advantage. For the test sample with a higher SNR of 10 dB, Noise2SR using transposed convolution achieves a slight improvement of 0.3 dB compared to the other two methods. It is worth noting that the bilinear interpolation upsampling method, which does not have any trainable parameters, achieves comparable performance with the added benefit of faster inference time. This implies that Noise2SR is effective in denoising even when simpler upsampling techniques, such as bilinear interpolation, are employed.

Refer to caption
Figure 12: PSNR versus the number of sub-sampling times for MMSE estimation of the clean signal.
TABLE IV: Ablation on different upsampling strategies in SR decoder. Quantitative results (PSNR/SSIM) are evaluated on two noisy samples with different SNRs in simulated TEM Dataset. The best performance is highlighted in Bold.
Upsampling 3 dB 10 dB Time
TransConv. 27.36/0.9292 36.22/0.9894 5.84s
PixelShuff. 27.51/0.9349 35.89/0.9881 5.48s
BiInter. 27.31/0.9325 35.59/0.9876 4.43s

V Conclusion

In this work, we propose Noise2SR, zero-shot self-supervised learning (ZS-SSL) method for denoising HREM images. We introduce the Random Sub-sampler to generate paired noisy images with different resolutions, preserving structural information while altering high-frequency signal-dependent noise. This enables effective noise removal through supervised training using paired noisy images. By employing a super-resolved strategy, the network generates denoised results of the original image size from sub-sampled noisy inputs. The generation of paired noisy images serves as data augmentation to mitigate overfitting in single image denoising. Furthermore, we present an approximate MMSE estimation of the clean signal by ensembling multiple denoised results from randomly sub-sampled noisy images of a single input. This further enhances the denoising performance. The training strategy and MMSE estimation enable Noise2SR to achieve superior denoising performance using only a single ultra-low SNR HREM image. Experimental results on synthetic and real HREM data demonstrate that Noise2SR outperforms state-of-the-art ZS-SSL methods and achieves comparable performance to supervised methods. Despite achieving state-of-the-art performance in zero-shot self-supervised denoising for HREM images, Noise2SR still has certain limitations. One such limitation is the random sub-sampling would introduce high-frequency information loss in the input noisy image. Consequently, when denoising images rich in details, Noise2SR with MMSE estimation tends to produce relatively smooth results. Another limitation is that the zero-shot DL denoiser demands minutes of training time, preventing it from achieving real-time denoising performance. Besides the limitations, We believe that this work has the potential to improve the signal-to-noise ratio of images in materials imaging domains.

References

  • [1] M. O’keefe, P. Buseck, and S. Iijima, “Computed crystal structure images for high resolution electron microscopy,” Nature, vol. 274, no. 5669, pp. 322–324, 1978.
  • [2] J. C. Spence and A. V. Crewe, “Experimental high-resolution electron microscopy,” 1981.
  • [3] C. Kisielowski, B. Freitag, M. Bischoff, H. Van Lin, S. Lazar, G. Knippels, P. Tiemeijer, M. van der Stam, S. von Harrach, M. Stekelenburg et al., “Detection of single atoms and buried defects in three dimensions by aberration-corrected electron microscope with 0.5-å information limit,” Microscopy and Microanalysis, vol. 14, no. 5, pp. 469–477, 2008.
  • [4] N. I. Kato, “Reducing focused ion beam damage to transmission electron microscopy samples,” Journal of electron microscopy, vol. 53, no. 5, pp. 451–458, 2004.
  • [5] N. Jiang, “Electron beam damage in oxides: a review,” Reports on Progress in Physics, vol. 79, no. 1, p. 016501, 2015.
  • [6] A. Faruqi and G. McMullan, “Direct imaging detectors for electron microscopy,” Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, vol. 878, pp. 180–190, 2018.
  • [7] P. Ercius, I. Johnson, H. Brown, P. Pelz, S.-L. Hsu, B. Draney, E. Fong, A. Goldschmidt, J. Joseph, J. Lee et al., “The 4d camera–an 87 khz frame-rate detector for counted 4d-stem experiments,” Microscopy and Microanalysis, vol. 26, no. S2, pp. 1896–1897, 2020.
  • [8] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE transactions on image processing, vol. 26, no. 7, pp. 3142–3155, 2017.
  • [9] Y. Tai, J. Yang, X. Liu, and C. Xu, “Memnet: A persistent memory network for image restoration,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 4539–4547.
  • [10] K. Zhang, W. Zuo, and L. Zhang, “Ffdnet: Toward a fast and flexible solution for cnn-based image denoising,” IEEE Transactions on Image Processing, vol. 27, no. 9, pp. 4608–4622, 2018.
  • [11] J. Batson and L. Royer, “Noise2self: Blind denoising by self-supervision,” in International Conference on Machine Learning.   PMLR, 2019, pp. 524–533.
  • [12] A. Krull, T.-O. Buchholz, and F. Jug, “Noise2void-learning denoising from single noisy images,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 2129–2137.
  • [13] S. Laine, T. Karras, J. Lehtinen, and T. Aila, “High-quality self-supervised deep image denoising,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [14] S. Cha and T. Moon, “Fully convolutional pixel adaptive image denoiser,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4160–4169.
  • [15] Y. Quan, M. Chen, T. Pang, and H. Ji, “Self2self with dropout: Learning self-supervised denoising from single image,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 1890–1898.
  • [16] Y. Xie, Z. Wang, and S. Ji, “Noise2same: Optimizing a self-supervised bound for image denoising,” Advances in neural information processing systems, vol. 33, pp. 20 320–20 330, 2020.
  • [17] J. Byun, S. Cha, and T. Moon, “Fbi-denoiser: Fast blind image denoiser for poisson-gaussian noise,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5768–5777.
  • [18] N. Moran, D. Schmidt, Y. Zhong, and P. Coady, “Noisier2noise: Learning to denoise from unpaired noisy data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 064–12 072.
  • [19] T. Huang, S. Li, X. Jia, H. Lu, and J. Liu, “Neighbor2neighbor: A self-supervised framework for deep image denoising,” IEEE Transactions on Image Processing, vol. 31, pp. 4023–4038, 2022.
  • [20] X. Wu, M. Liu, Y. Cao, D. Ren, and W. Zuo, “Unpaired learning of deep image denoising,” in European conference on computer vision.   Springer, 2020, pp. 352–368.
  • [21] J. Lequyer, R. Philip, A. Sharma, W.-H. Hsu, and L. Pelletier, “A fast blind zero-shot denoiser,” Nature Machine Intelligence, vol. 4, no. 11, pp. 953–963, 2022.
  • [22] Y. Mansour and R. Heckel, “Zero-shot noise2noise: Efficient image denoising without any data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 018–14 027.
  • [23] T. Pang, H. Zheng, Y. Quan, and H. Ji, “Recorrupted-to-recorrupted: Unsupervised deep learning for image denoising,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 2043–2052.
  • [24] X. Tian, Q. Wu, H. Wei, and Y. Zhang, “Noise2sr: Learning to denoise from super-resolved single noisy fluorescence image,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VI.   Springer, 2022, pp. 334–343.
  • [25] J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, and T. Aila, “Noise2noise: Learning image restoration without clean data,” in International Conference on Machine Learning.   PMLR, 2018, pp. 2965–2974.
  • [26] W. Lee, S. Son, and K. M. Lee, “Ap-bsn: Self-supervised denoising for real-world images via asymmetric pd and blind-spot network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 725–17 734.
  • [27] J. Xu, Y. Huang, M. Cheng, L. Liu, F. Zhu, Z. Xu, and L. Shao, “Noisy-as-clean: Learning self-supervised denoising from corrupted image.” IEEE Transactions on Image Processing: a Publication of the IEEE Signal Processing Society, 2020.
  • [28] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9446–9454.
  • [29] K. Zhang, Y. Li, W. Zuo, L. Zhang, L. Van Gool, and R. Timofte, “Plug-and-play image restoration with deep denoiser prior,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 10, pp. 6360–6376, 2021.
  • [30] X. Pan, X. Zhan, B. Dai, D. Lin, C. C. Loy, and P. Luo, “Exploiting deep generative prior for versatile image restoration and manipulation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7474–7489, 2021.
  • [31] Y. Wang, J. Yu, and J. Zhang, “Zero-shot image restoration using denoising diffusion null-space model,” in The Eleventh International Conference on Learning Representations, 2022.
  • [32] B. Li, Y. Gou, J. Z. Liu, H. Zhu, J. T. Zhou, and X. Peng, “Zero-Shot Image Dehazing,” IEEE Transactions on Image Processing, vol. 29, pp. 8457–8466, Aug. 2020.
  • [33] B. Li, Y. Gou, S. Gu, J. Z. Liu, J. T. Zhou, and X. Peng, “You Only Look Yourself: Unsupervised and Untrained Single Image Dehazing Neural Network,” International Journal of Computer Vision, pp. 1–14, Mar. 2021.
  • [34] Y.-J. Chen, Y.-J. Chang, S.-C. Wen, Y. Shi, X. Xu, T.-Y. Ho, Q. Jia, M. Huang, and J. Zhuang, “Zero-shot medical image artifact reduction,” in 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI).   IEEE, 2020, pp. 862–866.
  • [35] J. M. Ede, “Deep learning in electron microscopy,” Machine Learning: Science and Technology, vol. 2, no. 1, p. 011004, 2021.
  • [36] S. V. Kalinin, C. Ophus, P. M. Voyles, R. Erni, D. Kepaptsoglou, V. Grillo, A. R. Lupini, M. P. Oxley, E. Schwenker, M. K. Chan et al., “Machine learning in scanning transmission electron microscopy,” Nature Reviews Methods Primers, vol. 2, no. 1, p. 11, 2022.
  • [37] T. M. Quan, D. G. C. Hildebrand, K. Lee, L. A. Thomas, A. T. Kuan, W.-C. A. Lee, and W.-K. Jeong, “Removing imaging artifacts in electron microscopy using an asymmetrically cyclic adversarial network without paired training data,” in 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).   IEEE, 2019, pp. 3804–3813.
  • [38] F. Wang, T. R. Henninen, D. Keller, and R. Erni, “Noise2atom: unsupervised denoising for scanning transmission electron microscopy images,” Applied Microscopy, vol. 50, no. 1, pp. 1–9, 2020.
  • [39] X. Chong, M. Cheng, W. Fan, Q. Li, and H. Leung, “M-denoiser: Unsupervised image denoising for real-world optical and electron microscopy data,” Computers in Biology and Medicine, vol. 164, p. 107308, 2023.
  • [40] R. Lin, R. Zhang, C. Wang, X.-Q. Yang, and H. L. Xin, “Temimagenet training library and atomsegnet deep-learning models for high-precision atom segmentation, localization, denoising, and deblurring of atomic-resolution images,” Scientific reports, vol. 11, no. 1, p. 5386, 2021.
  • [41] S. Mohan, R. Manzorro, J. L. Vincent, B. Tang, D. Y. Sheth, E. P. Simoncelli, D. S. Matteson, P. A. Crozier, and C. Fernandez-Granda, “Deep denoising for scientific discovery: A case study in electron microscopy,” IEEE Transactions on Computational Imaging, vol. 8, pp. 585–597, 2022.
  • [42] Y. Zhou, J. Jiao, H. Huang, Y. Wang, J. Wang, H. Shi, and T. Huang, “When awgn-based denoiser meets real noises,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 13 074–13 081.
  • [43] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1874–1883.
  • [44] J. S. Lim, “Two-dimensional signal and image processing,” Englewood Cliffs, 1990.
  • [45] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3-d transform-domain collaborative filtering,” IEEE Transactions on image processing, vol. 16, no. 8, pp. 2080–2095, 2007.
  • [46] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR (Poster), 2015. [Online]. Available: http://arxiv.org/abs/1412.6980
  • [47] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
  • [48] A. K. Boyat and B. K. Joshi, “A review paper: Noise models in digital image processing,” Signal & Image Processing, vol. 6, no. 2, p. 63, 2015.
  • [49] P. Sarder and A. Nehorai, “Deconvolution methods for 3-d fluorescence microscopy images,” IEEE signal processing magazine, vol. 23, no. 3, pp. 32–45, 2006.
  • [50] W. Meiniel, J.-C. Olivo-Marin, and E. D. Angelini, “Denoising of microscopy images: a review of the state-of-the-art, and a new sparsity-based method,” IEEE Transactions on Image Processing, vol. 27, no. 8, pp. 3842–3856, 2018.
  • [51] N. Bähler, M. El Helou, É. Objois, K. Okumuş, and S. Süsstrunk, “Pogain: Poisson-gaussian image noise modeling from paired samples,” IEEE Signal Processing Letters, vol. 29, pp. 2602–2606, 2022.
  • [52] Z. Wang, X. Cun, J. Bao, W. Zhou, J. Liu, and H. Li, “Uformer: A general u-shaped transformer for image restoration,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 17 683–17 693.
  • [53] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M.-H. Yang, “Restormer: Efficient transformer for high-resolution image restoration,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5728–5739.