\UseTblrLibrary

siunitx

Zero-Shot Image Denoising for High-Resolution Electron Microscopy

Xuanyu Tian, Zhuoya Dong, Xiyue Lin, Yue Gao, Hongjiang Wei, Yanhang Ma, Jingyi Yu, and Yuyao Zhang Corresponding authors: Jingyi Yu; Yuyao Zhang Xuanyu Tian is with School of Information Science and Technology, ShanghaiTech University, 201210, Shanghai, China and Lingang Laboratory, 20031, Shanghai, China (e-mail: tianxy@shanghaitech.edu.cn) Xiyue Lin, Jingyi Yu and Yuyao Zhang are with School of Information Science and Technology, ShanghaiTech University, 201210, Shanghai, China (e-mail: linxy2022@shanghaitech.edu.cn; yujingyi@shanghaitech.edu.cn; zhangyy8@shanghaitech.edu.cn) Zhuoya Dong and Yanhang Ma are with School of Physical Science and Technology, ShanghaiTech University, 201210, Shanghai, China (e-mail: dongzhy@shanghaitech.edu.cn; mayh2@shanghaitech.edu.cn) Yue Gao is with the BNRist, THUIBCS, KLISS, BLBCI, School of Software, Tsinghua University, Beijing 100084, China (e-mail: gaoyue@tsinghua.edu.cn). Hongjiang Wei is with the School of Biomedical Engineering and Institute of Medical Robotics, Shanghai Jiao Tong University, 200127 Shanghai, China (e-mail: hongjiang.wei@sjtu.edu.cn). The code for this work is availble at https://github.com/MeijiTian/ZS-Denoisier-HREM

Abstract

High-resolution electron microscopy (HREM) imaging technique is a powerful tool for directly visualizing a broad range of materials in real-space. However, it faces challenges in denoising due to ultra-low signal-to-noise ratio (SNR) and scarce data availability. In this work, we propose Noise2SR, a zero-shot self-supervised learning (ZS-SSL) denoising framework for HREM. Within our framework, we propose a super-resolution (SR) based self-supervised training strategy, incorporating the Random Sub-sampler module. The Random Sub-sampler is designed to generate approximate infinite noisy pairs from a single noisy image, serving as an effective data augmentation in zero-shot denoising. Noise2SR trains the network with paired noisy images of different resolutions, which is conducted via SR strategy. The SR-based training facilitates the network adopting more pixels for supervision, and the random sub-sampling helps compel the network to learn continuous signals enhancing the robustness. Meanwhile, we mitigate the uncertainty caused by random-sampling by adopting minimum mean squared error (MMSE) estimation for the denoised results. With the distinctive integration of training strategy and proposed designs, Noise2SR can achieve superior denoising performance using a single noisy HREM image. We evaluate the performance of Noise2SR in both simulated and real HREM denoising tasks. It outperforms state-of-the-art ZS-SSL methods and achieves comparable denoising performance with supervised methods. The success of Noise2SR suggests its potential for improving the SNR of images in material imaging domains.

Index Terms:

Zero-shot, Electron Microscopy, Denoising, Self-supervised

I Introduction

High-resolution electron microscopy (HREM) imaging [1, 2, 3] is an indispensable tool in the fields of materials science and nanotechnology. HREM enables direct visualization of structures at the atomic level through interactions between the sample and high-energy electrons.

However, HREM is inevitably susceptible to noise due to inherent properties of electron beams and detection process etc. For example, the low-dose conditions are often applied to imaging electron-beam sensitive materials to minimize the damage from the electrons. [4, 5]. Capuring the dynamic events at kilohertz frame rates with direct electronic detection systems [6, 7], the image is affected by severe shot noise due to the shortened exposure time. Improving signal-to-noise (SNR) is critical for HREM to enhance image quality and facilitate accurate information extraction.

Recently, data-driven methods based on deep learning (DL) [8, 9, 10] have obtained favorable performance compared to conventional methods in image denoising. However, applying supervised DL denoising methods to electron microscopy (EM) images is challenging due to the lack of paired noisy-clean image datasets. In the absence of ground-truth images, several self-supervised image denoising methods[11, 12, 13, 14, 15, 16, 17, 18, 19] have been proposed. Some works [12, 11, 13, 17] utilize blind-spot networks (BSNs) to prevent identical mappings in self-supervised learning. BSN aims to eliminate the influence of each pixel on the corresponding output pixel to satisfy the $\mathcal{J}$ -invariance [11] theory. Noise2Void [12] and Noise2Self [11] employ a masked-based strategy while subsequent works [14, 13, 20, 17] design tailored networks to build BSN. The scarcity of HREM data poses challenges for BSN-based methods, and their performance tends to degrade when trained with limited data [15]. Huang et al. proposes Neighbor2Neighbor (NB2NB) [19] to generate paired noisy images from subsampling paired neighbor pixels for self-supervised training. While NB2NB and its variants [21, 22] fail to address the intensity gap issue between neighbor pixels, resulting in relatively poor performance in low-SNR scenarios. Some works [18, 23] introduce explicit noise modeling to generate training paired noisy images by adding synthetic noise. However, the noise distribution of real HREM is unknown and challenging to estimate due to the extremely low signal-to-noise ratio in HREM images.

In this paper, we propose an efficient zero-shot self-supervised denoising framework for HREM images named Noise2SR. We propose to train a denoising network with paired noisy images with different resolutions, which is conducted via super-resolution (SR) strategy. Inspired by NB2NB [19], we introduce the Random Sub-sampler module to generate sub-sampled noisy images that form a noisy pair with the original noisy image. Unlike NB2NB, we utilize paired noisy images with different resolutions for training. We indicated that the sub-sampled noisy image and the original noisy image have a consistent underlying clean image. Meanwhile, we provide theoretical proof that the Noise2SR training scheme is statically equivalent to using a clean image for supervision. Combined with the SR-based training strategy, we address the coordinate mismatch and intensity gap issues present in paired neighbor pixels in NB2NB. The proposed Random Sub-sampler helps to break the noise correlation of real-world and serves as an effective data augmentation for enhancing the zero-shot image denoising performance. We adopt minimum mean squared error (MMSE) estimation to mitigate the uncertainty caused by random sub-sampling and further enhance the denoising performance. With the distinctive integration of SR training strategy, Random sub-sampling and MMSE estimation, Noise2SR exhibits significant performance for single image denoising, particularly in scenarios with extremely low signal-to-noise ratios (SNR) such as HREM.

We conduct a series of experiments on both simulated and real HREM images to demonstrate the effectiveness and superiority of the proposed Noise2SR framework for HREM image denoising. The main contributions of this work can be summarized as follows:

1.

We proposed Noise2SR, which efficiently improves the signal-to-noise ratio (SNR) of single HREM images without involving any external dataset. To the best of our knowledge, Noise2SR could be one of the first zero-shot self-supervised denoising methods for HREM images.
2.

We propose a novel self-supervised training scheme incorporating SR strategies without noise model assumptions and can be combined with any network or framework.
3.

We propose the Random Sub-sampler serves as an effective data augmentation in network training and incorporates MMSE estimation to effectively produce reliable denoised results in ultra-low SNR scenarios.
4.

Incorporating the training scheme and designs, our method performs very favorably against state-of-the-art self-supervised denoising methods in HREM image denoising.

The quantitative and qualitative results demonstrate the effectiveness of our methods in enhancing the SNR of simulated HREM images and low-dose HREM images of two electron beam sensitive zeolites.

This work is built upon our previous work [24] and introduces several notable improvements. Firstly, we extend the previous work to address the denoising of single low-dose HREM data. Secondly, we propose an approximate MMSE estimation in the inference stage, which enhances the denoising performance and provides stability in the context of single HREM image denoising. Furthermore, we conduct a comprehensive discussion on the proposed Random Sub-sampler module and evaluate its effectiveness within the overall framework.

II Related Work

II-A Image Denoising Without Clean Signal Prior

In recent years, supervised image denoising methods based on deep neural networks e.g., DnCNN [8] have achieved great success and outperform conventional image denoiser. However, the acquisition of aligned noisy-clean images is infeasible and impractical in many scientific imaging applications, such as electron microscopy, which limits the use of supervised deep learning approaches.

Noise2Noise (N2N) [25] is the first work that proposes training a deep denoiser using paired noisy images of the same scene and demonstrates that it is statistically equivalent to supervised learning. Subsequently, self-supervised denoising methods have been proposed, enabling the denoiser to be trained from individual noisy images without paired noisy images. Noise2Void (N2V) [12] and Noise2Self (N2S) [11] design the masked-based blind-spot network (BSN) for self-supervised learning. Specifically, the masked-based BSN replaces certain pixels of input noisy images and predicts their value based on neighboring pixels. However, masked-based BSN suffers from the limited pixels of supervision and imperfect replacement strategies, leading to inefficient training and artifacts in the denoised results [13, 16]. To address these issues, the following works [13, 20, 14, 17, 26] have proposed to design novel BSN architectures by incorporating centrally masked convolution and dilated convolutions. However, these methods are limited by the large amount of calculation and the inflexible network structure. Different from BSN methods, another category of self-supervised learning approaches proposed generating approximate paired noisy images from individual noisy images. For example, Noisier2Noise [18] generates paired noisy images by adding additional noise to noisy images, and similar ideas are explored in [27, 23]. However, these methods require a known noise model of original noisy images to guarantee denoising performance. Additionally, Neighbor2Neighbor (NB2NB) [19] proposed to sub-sample the neighbor pixels of the noisy image to generate paired noisy images. Since a gap exists between the underlying ground truth of paired neighbor sub-sampled images, NB2NB has limited performance in low-SNR scenarios.

II-B Advances in Unsupervised and Zero-shot Image Restoration

Deep image prior (DIP) [28] first demonstrated that convolutional neural networks (CNNs) inherently learn image priors with their inductive bias. DIP proposed using early-stopping strategies for recovering corrupted images with CNNs. Subsequently, several works leverage various approaches such as denoising [29], GANs [30] or diffusion models [31] to learn the prior image distribution for zero-shot image restoration. However, these approaches require a substantial amount of data for pre-training to learn the data distribution, which is impractical in the context of HREM. Therefore, we focus on zero-shot learning restoration methods, training with only a single corrupted image without additional external data. Compared with dataset-based training, zero-shot learning is significantly more challenging to avoid overfitting. In image dehazing, several works [32, 33] have achieved remarkable performance by training on a single hazy image. For zero-shot image denoising, Self2Self [15] incorporates a dropout regularization with a blind-spot networks framework to mitigate overfitting. FBI-Denoiser [17] showed superior performance by carefully designing the BSN. ZS-Noise2Noise [22] introduced a lightweight network to achieve comparable performance in zero-shot denoising. Chen et al. [34] proposed a zero-shot method for medical image artifact reduction.

Refer to caption — Figure 1: Pipeline of proposed Noise2SR framework. A. Training Phase: First, Random Sub-sampler takes a noisy image $\mathbf{y}$ as input and generates a sub-sampled noisy image $\mathbf{y}_{J^{\complement}}$ along with corresponding unsampled mask $\mathbf{m}_{J}$ . Then, the network $f_{\theta}$ takes the sub-sampled image $\mathbf{y}_{J^{\complement}}$ as input and generates a denoised image of full resolution $f_{\theta}(\mathbf{y}_{J^{\complement}})$ . The network is optimized by computing the loss on the difference between unsampled noisy pixels $\mathbf{y}_{J}$ and the output of the network. B. Inference Phase: A sub-sampled noisy set $\mathcal{Y}$ can be obtained by repeatedly sub-sampling a noisy image $\mathbf{y}$ $M$ times using the Random Sub-sampler. Given a sub-sampled noisy set $\small{\mathcal{Y}}$ , well-trained network $f_{\hat{\theta}}$ can generated a plausible denoised image set $\small{\hat{\mathcal{X}}}$ . Finally, the clean image can be estimated by averaging the images in $\small{\hat{\mathcal{X}}}$ using the MMSE estimation.

II-C Electron Microscopy (EM) Denoising

Conventional spatial filter based methods have been applied to EM, such as Bilateral filter, Non-local Means, BM3D, etc. Recent, deep learning based methods have been applied in EM imaging[35, 36]. Refs. [37, 38] proposed using Cycle-GAN for STEM images denoising without paired training images. Chong et al. [39] utilized paired noisy images achieving real-world optical and electron microscopy data denoising. Refs. [40] [41] proposed the simulation-based denoising (SBD) framework creating large simulated datasets to train CNNs for denoising in a supervised learning manner. Mohan et al. [41] introduced simulated paired clean/noisy HREM images of the same substance for various imaging parameters. However, creating large simulated datasets can be a time-consuming process, and its performance will degrade sharply due to the domain gap between simulated and real noisy images. Thus, it is imperative to propose a zero-shot image denoising method that is robust to ultra-low SNR to enhance HREM data quality effectively.

III Methodology

We introduce a novel self-supervised training framework called Noise2SR, which enables the training of a denoiser using paired noisy images with different resolutions obtained from an individual noisy observation. The effectiveness of our framework is supported by the theoretical proof of the $\mathcal{J}$ -invariant property [11]. Specifically, we introduce a novel random sub-sampling strategy to generate training image pairs of different resolutions. Subsequently, these pairs are utilized in conjunction with a super-resolution (SR) neural network. Such an approach effectively leverages the inherent relationship in signal content while minimizing noise correlation, thus significantly enhancing the efficiency of the denoiser training process. We provide a comprehensive overview of the training and inference stages within the Noise2SR framework. The overall architecture of our proposed Noise2SR framework is illustrated in Fig. 1.

III-A Related Theory Revisit

III-A1 Noise2Noise

With the absence of lack of clean images for supervised learning, Noise2Noise (N2N) [25] proposed to train denoising network with paired noisy measurements of the same scene. Given paired noisy measurements $\{\mathbf{y}_{1},\mathbf{y}_{2}\}$ where $\{\mathbf{y}_{1}=\mathbf{x}+\mathbf{n}_{1}$ and $\mathbf{y}_{1}=\mathbf{x}+\mathbf{n}_{2}\}$ , N2N demonstrates that learning the mapping between paired noisy measurements yields the same solutions as supervised learning with clean images statistically:

\mathbb{E}_{\mathbf{x},\mathbf{y}_{1},\mathbf{y}_{2}}\left\|f_{\theta}(\mathbf% {y}_{1})-\mathbf{y}_{2}\right\|_{2}^{2}=\mathbb{E}_{\mathbf{x},\mathbf{y}_{1},% \mathbf{y}_{2}}\left\|f_{\theta}(\mathbf{y}_{1})-\mathbf{x}\right\|_{2}^{2}+% \sigma^{2},

(1)

where $\sigma^{2}$ is a constant the variance of noise $\mathbf{n}$ .

III-A2 $\mathcal{J}$ -invariant Denoiser

With the assumption that noise is zero-mean and pixel-wise independent, Noise2Self [11] proved that denoising network with individual noisy measurements is possible if the network is $\mathcal{J}$ -invariant.

Definition 1 [11] Let $\mathcal{J}$ be a partition of the dimensions $\{1,\dots,n\}$ and let $J\in\mathcal{J}$ . A function $f:\mathbb{R}^{n}\to\mathbb{R}^{n}$ is $J$ -invariant if $f(\mathbf{x})_{J}$ does not depend on the value of $\mathbf{x}_{J}$ . It is $\mathcal{J}$ -invariant if is J-invariant for each $J\in\mathcal{J}$ .

III-B Super-resolved Based Denoising Methods

We propose training a denoiser using pairs of noisy images with different resolutions, generated from an individual image. These paired images consist of a sub-sampled noisy image $S(\mathbf{y})$ and its corresponding full-resolution noisy image $\mathbf{y}$ . Following the N2N strategy, the self-supervised loss can be generally stated as

\min_{\theta}\mathbb{E}\|f_{\theta}(S(\mathbf{y}))-\mathbf{y}\|^{2}.

(2)

However, directly minimizing the loss function above may result in the network learning an identity mapping for the noisy input pixels. Thus, we proposed to use unsampled noisy pixels for network supervision. Under this supervision, we can theoretically prove that using noisy images for network training is equivalent to using clean signals for supervision.

Based on the $\mathcal{J}$ -invariant theory, the sub-sampling operation partitions the image with $n$ pixels into two disjoint sets denoted as $J$ and $J^{\complement}$ , where $|J|+|J^{\complement}|=n$ . Consequently, the sub-sampled noisy image $S(\mathbf{y})$ can be represented as $\mathbf{y}_{J^{\complement}}$ , while the unsampled noisy pixels $\mathbf{y}\setminus S(\mathbf{y})$ are denoted as $\mathbf{y}_{J}$ . When training a network using paired noisy images with different resolutions, the resulting network naturally becomes a $\mathcal{J}$ -invariant function. This is because the output $f_{\theta}(\mathbf{y}_{J^{\complement}})_{J}$ does not rely on the specific values of $\mathbf{y}_{J}$ .

Theorem 1. Let $\mathbf{y}=\mathbf{x}+\mathbf{n}$ be an image corrupted by zero-mean noise $\mathbf{n}$ with variance $\sigma^{2}$ . $\mathbf{y}_{J^{\complement}}$ is the sub-sampled noisy image $\mathbf{y}$ and $\mathbf{y}_{J}$ is the unsampled noisy pixels of original image $\mathbf{y}$ . Suppose the noise $\mathbf{n}$ is independent of the clean image $\mathbf{x}$ and the noise $\mathbf{n}$ is pixel-wise independent. Then it holds that:

\mathbb{E}_{\mathbf{x},\mathbf{y}}\left\|f_{\theta}(\mathbf{y}_{J^{\complement% }})_{J}-\mathbf{y}_{J}\right\|_{2}^{2}=\mathbb{E}_{\mathbf{x},\mathbf{y}}\left% \|f_{\theta}(\mathbf{y}_{J^{\complement}})_{J}-\mathbf{x}_{J}\right\|_{2}^{2}+% \sigma^{2}.

(3)

The proof is given below:

Proof :

$\displaystyle\mathbb{E}_{\mathbf{x},\mathbf{y}}\left\\|f_{\theta}(\mathbf{y}_{J% ^{\complement}})_{J}-\mathbf{y}_{J}\right\\|_{2}^{2}$	$\displaystyle=\mathbb{E}_{\mathbf{x},\mathbf{y}}\left\\|f_{\theta}(\mathbf{y}_{% J^{\complement}})_{J}-\mathbf{x}_{J}-\mathbf{n}_{J}\right\\|_{2}^{2}$	(4)
	$\displaystyle=\mathbb{E}_{\mathbf{x},\mathbf{y}}\left\\|f_{\theta}(\mathbf{y}_{% J^{\complement}})_{J}-\mathbf{x}_{J}\right\\|_{2}^{2}+\sigma^{2}$
	$\displaystyle\quad-2\mathbb{E}_{\mathbf{x},\mathbf{y}}\langle f_{\theta}(% \mathbf{y}_{J^{\complement}})_{J}-\mathbf{x}_{J},\mathbf{n}_{J}\rangle.$

Due to the independence between $\mathbf{y}_{J^{\complement}}$ and $\mathbf{x}_{J}$ to $\mathbf{n}_{J}$ , there holds:

\displaystyle\mathbb{E}_{\mathbf{x},\mathbf{y}}\langle f_{\theta}(\mathbf{y}_{% J^{\complement}})_{J}-\mathbf{x}_{J},\mathbf{n}_{J}\rangle=\mathbb{E}_{\mathbf% {x},\mathbf{y}}\left[f_{\theta}(\mathbf{y}_{J^{\complement}})_{J}-\mathbf{x}_{% J}\right]\mathbb{E}_{\mathbf{x},\mathbf{y}}[\mathbf{n}_{J}].

(5)

Since the noise is zero-mean $\mathbb{E}_{\mathbf{x},\mathbf{y}}(\mathbf{n}_{J})=0$ , we have:

\mathbb{E}_{\mathbf{x},\mathbf{y}}\left\|f_{\theta}(\mathbf{y}_{J^{\complement% }})_{J}-\mathbf{y}_{J}\right\|_{2}^{2}=\mathbb{E}_{\mathbf{x},\mathbf{y}}\left% \|f_{\theta}(\mathbf{y}_{J^{\complement}})_{J}-\mathbf{x}_{J}\right\|_{2}^{2}+% \sigma^{2}.

(6)

Since $\sigma^{2}$ is constant, Theorem 1 states that optimizing the self-supervised loss function over the proposed training scheme yields the same solutions as the supervised loss function.

III-C Generating Sub-sampled Image Randomly

Self-supervised image denoising methods typically assume that noise is signal-independent and spatially uncorrelated. Therefore, it is crucial to sub-sample a noisy image to maintain the denoising performance when training with paired noisy images of different resolutions. Recently, pixel-shuffling downsampling (PD) [42, 26] has emerged as an effective technique for breaking the spatial correlation of real-world noise. Motivated by the success of PD, we introduce a Random Sub-sampler for generating sub-sampled noisy images. The Random Sub-sampler follows a similar sub-sampling strategy as PD but introduces additional randomness into the sub-sampling operation. The randomness serves as a data augmentation strategy, which helps to prevent training overfitting.

The process of using Random Sub-sampler with sampling stride $s$ to generate a sub-sampled noisy image $\mathbf{y}_{J^{\complement}}\in\mathbb{R}^{H/s\times W/s}$ from an image $\mathbf{y}\in\mathbb{R}^{H\times W}$ is summarized as follows:

1.

Perform an inverse pixel shuffling (PS) [43] operation on image $\mathbf{y}\in\mathbb{R}^{H\times W}$ with a stride of $s$ , resulting in: $\text{PS}_{s}^{-1}(\mathbf{y})\in\mathbb{R}^{H/s\times W/s\times s^{2}}$ ;
2.

For $(i,j)$ -th location of sub-sampled image $\mathbf{y}_{J^{\complement}}$ , randomly select one elements from ${\text{PS}_{s}^{-1}(\mathbf{y})}_{ij}$ ;
3.

Generate a binary matrix mask $\mathbf{m}_{J}\in\mathbb{R}^{H\times W}$ to select unsampled pixels in the original image $\mathbf{y}$ . The mask $\mathbf{m}_{J}$ is set to 0 at locations where an element from $\mathbf{y}$ is selected, and 1 otherwise.

Following this process, the sub-sampled noisy image $\mathbf{y}_{J^{\complement}}$ with dimensions of $H/s\times W/s$ is obtained. The binary matrix mask $\mathbf{m}_{J}$ is also generated to identify the unsampled pixels in $\mathbf{y}$ . Fig. 2 illustrates the workflow of the Random Sub-sampler, demonstrating the generation of a sub-sampled image from an input image of size $4\times 4$ with a stride of $s=2$ .

Algorithm 1 Zero-shot learning for Noise2SR

0: An individual noisy image

\mathbf{y}

; denoising network

f_{\theta}

; Random Sub-sampler.

0: Well-trained denoising network

f_{\hat{\theta}}

1: while not converged do

2: Generate a random sub-sampled noisy image

\mathbf{y}_{J^{\complement}}

, and a binary mask

\mathbf{m}_{J}

from a Random Sub-sampler;

3: For a sub-sampled noisy image

\mathbf{y}_{J^{\complement}}

, derive the denoised image

f_{\theta}(\mathbf{y}_{J^{\complement}})

;

4: Select the unsampled pixels in noisy image

\mathbf{y}

and corresponding pixels in

f_{\theta}(\mathbf{y}_{J^{\complement}})

\mathbf{y}_{J}=\mathbf{y}\odot\mathbf{m}_{J}

f_{\theta}(\mathbf{y}_{J^{\complement}})_{J}=f_{\theta}(\mathbf{y}_{J^{% \complement}})\odot\mathbf{m}_{J}

;

5: Calculate

\mathcal{L}=\|f_{\theta}(\mathbf{y}_{J^{\complement}})_{J}-\mathbf{y}_{J}\|^{2}

;

6: Update the denoising network

f_{\theta}

by minimizing the objective

\mathcal{L}

7: end while

The stride $s$ determines the relatively sub-sampling interval and sampling ratio of noisy pixels, which have an impact on the training of Noise2SR. In Section IV-F, we conduct a comprehensive study to evaluate the influence of the sampling stride $s$ of the Random Sub-sampler. We also compare the performance between fix-location and random-location sub-sampling strategies.

III-D Optimizing Network

Given a noisy image $\mathbf{y}\in\mathbb{R}^{H\times W}$ , We can parameterize the Noise2SR as CNN-based denoising and SR function $f_{\theta}:\mathbb{R}^{|J^{\complement}|}\to\mathbb{R}^{H\times W}$ , where the parameters $\theta$ are weights of the network, $|J^{\complement}|$ is number of sub-sampled noisy image pixels. Noise2SR takes the sub-sampled noisy image $\mathbf{y}_{J^{\complement}}$ as input and outputs a prediction of denoised image $\hat{\mathbf{x}}=f_{\theta}(\mathbf{y}_{J^{\complement}})\in\mathbb{R}^{H% \times W}$ .

Based on the proof provided in Section III-B, we optimize the network by minimizing the loss function that compares the unsampled pixels of the original noisy image denotes as $\mathbf{y}_{J}$ , with corresponding pixels in the network output, denotes as $f_{\theta}(\mathbf{y}_{J^{\complement}})_{J}$ . To facilitate the selection of these pixels, we can utilize a binary mask $\mathbf{m}_{J}$ . Specifically, the unsampled pixels of the original image can be obtained by element-wise multiplication with the mask, i.e., $\mathbf{y}_{J}=\mathbf{y}\odot\mathbf{m}_{J}$ . Similarly, the corresponding pixels in the network output can be obtained as $f_{\theta}(\mathbf{y}_{J^{\complement}})_{J}=f_{\theta}(\mathbf{y}_{J^{% \complement}})\odot\mathbf{m}_{J}$ . Here, the symbol $\odot$ represents the Hadamard product. Thus, the Noise2SR network can learn a denoising and SR function by minimizing the self-supervised loss

\underset{\theta}{\arg\min}\sum_{n=1}^{N}\sum_{J\in\mathcal{J}}\mathcal{L}(f_{% \theta}(\mathbf{y}_{J^{\complement}})_{J},\mathbf{y}_{J}),

(7)

where $\mathcal{J}$ represents the set of partitions used during the training stage, and $N$ denotes the number of images in the dataset.

III-E Clean Image Restoration with MMSE Estimation

The pipeline of clean image reconstruction using the well-trained N2SR model is shown in Fig. 1.B. For a random sub-sampled noisy image $\mathbf{y}_{J^{\complement}}$ , well-trained Noise2SR can generated a plausible denoised result $\hat{\mathbf{x}}=f_{\hat{\theta}}(\mathbf{y}_{J^{\complement}})$ . Since the sub-sampled image is generated from the original noisy image randomly, the clean image can be expressed as:

\mathbf{x}=\mathbb{E}[f_{\hat{\theta}}(\mathbf{y}_{J^{\complement}})]=\sum_{J% \in\mathcal{J}}f_{\hat{\theta}}(\mathbf{y}_{J^{\complement}})p(\mathbf{x}|% \mathbf{y}_{J^{\complement}}),

(8)

where $\mathcal{J}$ is the possible Random Sub-sampler partition set. Since the randomness of sub-sampling operation , $|\mathcal{J}|$ is numerous and $p(\mathbf{x}|\mathbf{y}_{J^{\complement}})$ is undetermined, making it challenging to precisely compute $\mathbb{E}(\hat{\mathbf{x}})$ .

Thus, we propose to use MMSE estimation to approximate $\mathbb{E}(\hat{{\mathbf{x}}})$ of the clean signal. Given a set of sub-sampled noisy images $\mathbf{y}_{J^{\complement}}$ , we can approximate the MMSE estimate of the clean image by averaging all the plausible denoised results $f_{\hat{\theta}}(\mathbf{y}_{J^{\complement}})$ . Specifically, we first randomly sub-sampled the noisy image $M$ times to obtain a sub-sampled noisy image set $\mathcal{Y}=\left\{\mathbf{y}_{J_{1}^{\complement}},\dots,\mathbf{y}_{J_{M}^{% \complement}}\right\}$ . For each subsampled noisy image $\mathbf{y}_{J_{m}^{\complement}}$ , the well-trained Noise2SR takes it as input and generates the corresponding denoising and SR result $f_{\hat{\theta}}\left(\mathbf{y}_{J_{m}^{\complement}}\right)$ . Finally, the approximate MMSE estimation of the clean image can be computed below:

\mathbf{x}\approx\frac{1}{M}\sum_{m=1}^{M}f_{\hat{\theta}}\left(\mathbf{y}_{J_% {m}^{\complement}}\right),

(9)

where $J_{m}^{c}$ denotes the sub-sampling partition at $m$ time.

III-F Network Architecture

As shown in Fig. 3, the network of Noise2SR $f_{\theta}$ consists of a U-Net based Encoder and a Super-resolved Decoder.

III-F1 U-Net based Encoder

We adopt the same U-Net architecture as encoder [25] with modifying the last convolution block to generate a feature map with 128 channels for Super-resolved Decoder.

III-F2 Super-resolved Decoder

The Super-resolved Decoder takes the feature map from the encoder as input and generates the denoised image prediction. It is comprised of an Upsampling layer followed by three $1\times 1$ convolution layers. In this work, we employ pixel-shuffling [43] as the super-resolution strategy for the Upsampling layer. In Section IV-F, we conduct an ablation study to evaluate the impact of different super-resolution strategies on denoising performance.

IV Experiments and Results

In this section, we aim to demonstrate the feasibility and effectiveness of the proposed Noise2SR framework. First, we conduct an experiment to evaluate the influence of sub-sampling on clean/noisy natural and HREM images. Then, We conduct two noise removal experiments comparing Noise2SR with eight other denoising methods. 1) Synthetic Poisson-Gaussian noise removal on simulated TEM datasets, and 2) Real noise removal on Scanning Transmission Electron Microscopy (STEM) imaging data. Moreover, we perform a comprehensive ablation study to analyze the impact of various factors, such as sub-sampling patterns, the number of samples used for MMSE estimation of the clean signal and the up-sampling strategy adopted in Super-resolved decoder.

IV-A Compared Methods and Metrics

IV-A1 Compared Methods

For comprehensive comparisons, we compared the proposed Noise2SR with 8 methods which are divided into four groups. 1) non-learning methods: Adaptive Wiener filtering [44] and BM3D[45]; 2) Zero-shot Self-supervised Learning (ZS-SSL) methods: Self2Self (S2S) [15], FBI-Denoiser (FBI-D) [17] and ZS-Noise2Noise [22]; 3) Dataset-based Self-supervised Learning (DS-SSL) methods: Noise2Self (N2S) [11], Neighbor2Neighbor (NB2NB) [19] and Noise2Noise (N2N) [25]; 4) Supervised Deep learning methods: Simulated-based denoising (SBD) [41]. It should be noted that N2S and NB2NB can also be trained in zero-shot learning manner, and we denote the corresponding extensions of these methods as N2S* and NB2NB*, respectively.

IV-A2 Implementation Details

For the training of the Noise2SR model, each iteration begins by randomly cropping a patch image of size $256\times 256$ from the original image. Subsequently, we utilize the Random Sub-sampler with sampling stride $(s=2)$ to generate a sub-sampled noisy image, which serves as the input for the network. To optimize the model, we employ the Adam optimizer [46] with the following hyper-parameters: $\beta_{1}=0.9$ , $\beta_{2}=0.999$ , and $\epsilon=10^{-8}$ . The initial learning rate is set to $10^{-4}$ . We set the batch size of 12 and the training process consists of 1500 epochs. During the inference stage, we sub-sampled 50 times to generate an approximate MMSE denoised result.

For the compared DL-based methods, we adhere to the same network architecture as described in their respective original papers. Specifically, for the SBD method, we employ the U-Net architecture [25], also referred to as small-Unet in the original paper. All experiments were conducted on a server equipped with Python 3.7.3, PyTorch 1.3, and NVIDIA TITAN GPUs.

TABLE I: Quantitative results (PSNR/SSIM) of compared denoising methods and Noise2SR on simulated \cePt/CeO2 TEM data corrupted with Poisson-Gaussian noise. * indicates that the dataset-based self-supervised denoising methods are performed in a zero-shot learning manner. The best performance among non-learning and zero-shot learning methods is highlighted in bold, while the second-best performance is underlined.

Noise Parameters		$a=0.1,b=0.02$		$a=0.05,b=0.02$		$a=0.02,b=0.02$
Category	Method	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
Noisy		4.57 $\pm$ 2.67	0.0120 $\pm$ 0.01	7.54 $\pm$ 2.67	0.0232 $\pm$ 0.01	11.40 $\pm$ 2.67	0.0526 $\pm$ 0.02
Non- Learning	Adaptive Wiener [44]	20.70 $\pm$ 1.90	0.6093 $\pm$ 0.08	22.96 $\pm$ 1.97	0.7398 $\pm$ 0.07	25.66 $\pm$ 1.95	0.8656 $\pm$ 0.03
Non- Learning	VST+BM3D [45]	22.30 $\pm$ 2.34	0.5429 $\pm$ 0.10	25.33 $\pm$ 2.33	0.7266 $\pm$ 0.08	29.97 $\pm$ 2.33	0.8659 $\pm$ 0.05
ZS-SSL	Neighbour2Neighbour* [19]	24.56 $\pm$ 2.48	0.6983 $\pm$ 0.10	26.32 $\pm$ 2.61	0.7451 $\pm$ 0.12	28.44 $\pm$ 2.87	0.7915 $\pm$ 0.09
	Noise2Self* [11]	28.51 $\pm$ 2.99	0.8817 $\pm$ 0.06	29.76 $\pm$ 2.36	0.9056 $\pm$ 0.04	31.68 $\pm$ 2.69	0.9207 $\pm$ 0.06
	Self2Self [15]	25.36 $\pm$ 2.79	0.7423 $\pm$ 0.13	27.77 $\pm$ 2.93	0.8353 $\pm$ 0.07	31.05 $\pm$ 2.50	0.9076 $\pm$ 0.04
	FBI-D [17]	28.17 $\pm$ 2.73	0.7916 $\pm$ 0.09	31.23 $\pm$ 2.74	0.9231 $\pm$ 0.04	34.23 $\pm$ 2.74	0.9541 $\pm$ 0.04
	ZS-N2N [22]	21.95 $\pm$ 2.00	0.5776 $\pm$ 0.07	24.34 $\pm$ 2.31	0.6736 $\pm$ 0.08	27.84 $\pm$ 2.62	0.7910 $\pm$ 0.07
	Noise2SR (w/o MMSE) (Ours)	28.84 $\pm$ 2.95	0.9296 $\pm$ 0.05	30.16 $\pm$ 3.08	0.9532 $\pm$ 0.04	34.57 $\pm$ 2.46	0.9788 $\pm$ 0.01
	Noise2SR (w/ MMSE) (Ours)	31.68 $\pm$ 2.68	0.9656 $\pm$ 0.02	33.66 $\pm$ 2.68	0.9772 $\pm$ 0.01	36.35 $\pm$ 2.46	0.9873 $\pm$ 0.01
DS-SSL	Noise2Self[11]	31.62 $\pm$ 2.77	0.9642 $\pm$ 0.02	33.50 $\pm$ 2.58	0.9734 $\pm$ 0.01	35.68 $\pm$ 2.91	0.9848 $\pm$ 0.01
	Neighbour2Neighbour [19]	28.80 $\pm$ 3.10	0.9305 $\pm$ 0.03	30.75 $\pm$ 2.37	0.9530 $\pm$ 0.02	33.21 $\pm$ 2.72	0.9674 $\pm$ 0.01
	Noise2Noise [25]	32.73 $\pm$ 2.97	0.9790 $\pm$ 0.01	34.11 $\pm$ 2.88	0.9837 $\pm$ 0.01	36.30 $\pm$ 2.78	0.9888 $\pm$ 0.01
Supervised	SBD [41]	33.12 $\pm$ 2.79	0.9799 $\pm$ 0.01	34.25 $\pm$ 3.09	0.9838 $\pm$ 0.01	36.52 $\pm$ 2.90	0.9891 $\pm$ 0.01

IV-A3 Evaluation Metrics

We calculate Peak-Signal-to-Noise Ratio (PSNR) and Structured Similarity Index Measure (SSIM) [47] to measure the performance of compared methods quantitatively. PSNR is defined based on pixel-by-pixel distance and SSIM measures structural similarity using the mean and variance of images.

IV-B Datasets

Simulated TEM Datasets

IV-B1 Simulated TEM Datasets

Based on the simulated TEM dataset ¹¹1https://github.com/sreyas-mohan/electron-microscopy-denoising released by Mohan et al. [41] which consists of approximate 18000 simulated images of \cePt/CeO2 catalyst with various combinations of imaging parameters, we select 1000 images as training set and 5 images for the test set. The training set is used to prepare dataset-based training methods and evaluate their performance on the test set. On the other hand, zero-shot learning models train a model for each image in the test set.

IV-B2 Noise simulation

HREM images are affected by mixed Poisson-Gaussian noise [48, 49, 50], which combines the effects of dark noise and photon noise (Poisson noise) and readout noise (Gaussian noise). The Poisson noise $\mathbf{n}_{p}$ is dependent on the signal intensity, while the Gaussian noise $\mathbf{n}_{g}$ originates from imperfections in the output amplifier during charge-to-voltage conversion. We adopt the Poisson-Gaussian Model formulation described in [51].

		$\displaystyle\mathbf{y}=\mathbf{x}+\mathbf{n}_{p}(\mathbf{x})+\mathbf{n}_{g},$		(10)
		$\displaystyle(\mathbf{x}+\mathbf{n}_{p}(\mathbf{x}))\sim{a}\mathcal{P}(\frac{1% }{a}\mathbf{x}),$
		$\displaystyle\mathbf{n}_{g}\sim\mathcal{N}(0,b^{2}).$

The expected value and the variance of the noisy measurement $\mathbf{x}$ is:

\mathbb{E}[\mathbf{y}]=\mathbf{x},\quad\text{Var}[\mathbf{x}]=a\mathbf{x}+b^{2}.

(11)

Thus, the noisy image corrupted with Poisson-Gaussian noise can be modeled using characterized parameters $(a,b)$ as

\mathbf{y}=\mathbf{x}+\mathbf{n},\quad\mathbf{n}\sim\mathcal{N}(0,a\mathbf{x}+% b^{2}).

(12)

Here, $a$ represents the noise level that is dependent on the signal, while $b$ represents the noise level that is independent of the signal.

To comprehensively evaluate the denoising performance of the compared methods on different levels of noise, we apply three different levels of Poisson-Gaussian noise to the TEM dataset. The noise parameters used are summarized below: $(a=0.1,b=0.02),(a=0.05,b=0.02),$ and $(a=0.01,b=0.02)$ .

Real STEM Data Sample preparation of Te crystals, ZSM-5 (MFI framework) and MOR zeolite (http://asia.iza-structure.org/IZA-SC/ftc_table.php) were all followed in the same way. The crystals were first crushed with mortar and pestle, and then the powders were dispersed in ethanol under ultrasonication. Few drops of the suspension were placed onto holey carbon copper grids to be further checked by TEM. The real annular dark field scanning transmission electron microscopy (ADF-STEM) images were all collected with a GrandARM 300F (JEOL Ltd.) transmission electron microscopy operated at 300 kV. The crystals were first tilted to specific zone axis under TEM mode with the selected area electron diffraction (SAED) patterns and then switched to STEM mode to record the experimental ADF-STEM images.

IV-C Comparisons on Simulated TEM Datasets

Table I shows the quantitative results. Compared with other zero-shot self-supervised learning denoising methods, Noise2SR achieves the best denoising performance. Compared with dataset-based self-supervised denoising methods and supervised denoising, Noise2SR achieves comparable performance.

Fig. 4 shows the qualitative results on two test samples of the simulated noisy \cePt/CeO2 catalyst TEM images. Compared with zero-shot denoising methods, Noise2SR precisely restores the nanoparticle structures and achieves a cleaner background. Moreover, Noise2SR exhibits fewer additional structural patterns in the error maps compared to dataset-based self-supervised denoising methods and supervised denoising methods. In Fig. 5, we compare the intensity profiles on the surface atomic columns for the denoised data obtained from FBI-D and Noise2SR, along with the ground truth. The intensity profiles of both denoised data show a similar overall trend to the ground truth. Compared to FBI-D, Noise2SR exhibits a closer resemblance to the ground truth in terms of both peaks and troughs, with fewer fluctuations.

IV-D Comparisons on Real STEM Data

IV-D1 Low-Dose Te Crystal STEM Data

In Fig. 6, we present the qualitative results obtained from denoising real low-dose Te crystal STEM data using different methods. The corresponding high-dose Te crystals STEM data is also included for reference. Two different areas of the low-dose and high-dose STEM images of one same Te crystal were shown in Fig. 6, where bright contrast that corresponds to each Te atom can be observed while the contrast is less clear in low-dose image than high-dose image because of the low signal-to-noise ratio (SNR). After denoising of low-dose STEM images using different methods, it can be observed that conventional learning methods (e.g., BM3D, Wiener Filter) tend to produce over-smoothed results, making it difficult to distinguish the contours of the atoms. On the other hand, DL-based denoising methods exhibit good atom contrast so that each Te atom can be clearly resolved. Notably, compared to other zero-shot denoising methods, the denoised results of Noise2SR exhibit reduced noise and clearer atom contours.

IV-D2 ZSM-5 & MOR Zeolites STEM Data

To further apply this method, two kinds of zeolites, ZSM-5 zeolite and MOR zeolite were imaged under very low-dose conditions and then denoised. It is worth mentioning that zeolites, as one kind of import nano-porous materials that were widely used in catalysis, separation, etc, are extremely sensitive to electron beams. Thus, low electron dose conditions were applied, resulting in low SNR of STEM images which hinder the visualization of atomic structure.

In Fig. 7, we present the qualitative results obtained from denoising ZSM-5 and MOR zeolite STEM data using different methods. Conventional non-learning methods yield blurred denoising results that fail to capture the atomic structure clearly. Among the zero-shot denoising methods, S2S introduces artifacts resembling pepper salt noise, while NB2NB results in a blurred output due to artifacts. In contrast, both FBI-D and Noise2SR clearly represent the atomic structure. However, in the zoomed-in region, FBI-D exhibits some structural artifacts, whereas Noise2SR demonstrates higher contrast with fewer artifacts. In the Noise2SR result of ZSM-5 zeolite, not only the bright and shape contrast of Silicon (Si) atoms in 5-, 6- and 10- member ring (MR) can be well resolved, but also the weak contrast of light Oxygen (O) atom between two Si atoms can be clearly visualized. Similarly, in the Noise2SR result of MOR zeolite, the contrast of Si atoms can be clearly resolved while in the raw image, it is hard to identify because of the noise and low SNR. These results show the great potential of our Noise2SR method in denoising the HREM images, which helps to visualize and study the atomic structures in materials science.

IV-E Effectiveness of Randomness in Sub-sampler

In this paper, we propose the Random Sub-sampler, a method for generating sub-sampled noisy images from full-size noisy images. To evaluate the effectiveness of randomness in sub-sampler on the denoising performance of Noise2SR, we compare the fix-location and random location sampling strategies within sub-sampler. Fig. 8 illustrates the difference of fix and random location sub-sampling strategies of sampling stride $s=2$ . Table II and Fig. 9 show the quantitative and qualitative results, respectively. The results demonstrate that random sub-sampling improves the denoising performance compared to fixed location sub-sampling. Both the denoised results from an individual sub-sampled image $(M=1)$ and MMSE denoised results, random sub-sampling shows the superiority of fixed sub-sampling. In the qualitative result, it is evident that the denoised images obtained from random sub-sampling exhibit clearer backgrounds and sharper atom contours compared to those obtained from fixed location sub-sampling. This observation is further supported by the error maps of the denoising results.

TABLE II: Comparisons of fix and random location sub-sampling strategy in sub-sampler. Quantitative results (PSNR/SSIM) are evaluated on simulated TEM datasets with three different noise levels.

Noise Parameters	$a=0.1$	$a=0.05$	$a=0.02$
Fix $(M=1)$	25.33/0.8732	27.45/0.9131	29.68/0.9401
Random $(M=1)$	28.44/0.9235	30.16/0.9532	32.70/0.9729
Fix MMSE	29.36/0.9443	31.92/0.9655	34.57/0.9788
Random MMSE	31.92/0.9645	33.44/0.9793	36.41/0.9880

IV-F Ablation Study

IV-F1 Influence of Sampling stride in Random Sub-sampler

To evaluate the influence of sampling stride $s$ on the denoising performance, we conducted experiments using the Noise2SR model with three different values of $s$ . The experiments were performed on noisy simulated TEM datasets with corrupted Poisson-Gaussian noise ( $a=0.05,b=0.02$ ).

Fig. 10 presents the performing curves Noise2SR with different sampling stride $s$ . The results show that Noise2SR with sampling stride $(s=2)$ achieved the best denoising performance in PSNR/SSIM metrics. Increasing the stride $s$ led to a decrease in denoising performance. Fig. 11, shows qualitative comparisons of denoised results. All the denoised results closely resembled the ground truth, but the error maps revealed a decrease in denoising performance as the stride $s$ increased.

IV-F2 Influence of the Sampling Times in Inference

To assess the influence of the number of sub-sampling on the MMSE estimation of the clean signal during the inference stage, we compared the quality of the clean image estimation using different numbers of sub-sampling, ranging from 1 to 50. Fig. 12 showcases the variation in PSNR of the restored images as the number of sub-sampling increases. Furthermore, it presents a comparison of the quantitative and qualitative results of the MMSE estimation of clean images for different numbers of sub-sampling when $M=1,5,20,$ and $50$ . From the figure, it is observed that as the number of samples increases, the PSNR of the estimated clean images also increases while the variance decreases. Notably, there is a significant improvement in the PSNR when the number of samples increases from 1 to 5, with an approximate gain of $2$ dB ( $31.34$ vs. $33.61$ ). However, as the number of samples increases from 20 to 50, the improvement in the quality of the clean images becomes less significant. Additionally, the error maps of the qualitative results show that the difference between the estimated clean images for $M=20$ and $M=50$ is relatively small. This observation is consistent with the qualitative results of the estimation of clean image for $M=20$ and $M=50$ , where the difference between the two is relatively small.

IV-F3 Influence of Encoder Networks

To further explore the influence of the encoder network on the performance of Noise2SR, we compare the U-net based encoder with two well-known transformer-based backbones Uformer [52] and Restormer [53]. Additionally, to better demonstrate the superiority of Noise2SR, we also compare the performance of Noise2Self with the three encoders and a SOTA BSN, FBI-Net [17].

In the experiments, we implement Uformer and Restormer following the same parameters as in the original papers. We evaluate their performance on two noisy samples with different PSNRs (3 dB and 10 dB) and compared their computational complexity. The quantitative results are shown in Table III. In Noise2SR, the Uformer shows a slight performance degradation compared to the U-net-based encoder, while the Restormer shows a significant improvement of approximately 1.5 dB and 2 dB on two samples, respectively. In contrast, the transformer-based encoders in Noise2Self do not exhibit a performance gain over the U-net encoder but rather a decrease in the SSIM evaluations. For HREM images with high-structured features, CNN-based networks can achieve superior performance due to the inductive bias. One possible explanation is that the masked-based strategy in Noise2Self is not suitable for $\mathcal{J}$ -invariant denoisers training with transformer encoders. Since the self-attention is not $\mathcal{J}$ -invariant, the transformer may overfit to mapping replacements to noisy pixels, resulting in performance degradation. Noise2SR enjoys more flexibility in terms of network structures for enhancing performance. It is worth noting that the sub-sampling operation allows us to significantly reduce the computational complexity in the case of using a transformer-based encoder.

TABLE III: Quantitative comparisons of Noise2Self and Noise2SR with three distinct encoder networks and FBI-Net on two noisy samples. The computational complexity of different networks is also compared. MACs are computed on patch size of

128\times 128

SSL Strategy	Encoder	3 dB	10 dB	MACs (G)
Noise2Self	Unet	25.20/0.7455	33.44/0.9306	4.61
	Uformer	25.56/0.6754	31.78/0.8899	10.24
	Restormer	26.17/0.6374	33.16/0.8665	35.21
Noise2SR	Unet	29.75/0.9595	35.89/0.9881	6.82
	Uformer	29.53/0.9563	34.98/0.9838	4.15
	Restormer	31.25/0.9630	37.84/0.9851	11.59
FBI-Net	FBI-Net	25.75/0.7434	34.33/0.9379	6.56

IV-F4 Influence of Different Upsampling Strategies

To evaluate the impact of different upsampling strategies on the denoising performance of Noise2SR, three versions of the Noise2SR model with different upsampling techniques are compared: 1) Transposed convolution (TransConv.); 2) Pixel shuffling (PixelShuff.); 3) Bilinear interpolation (BiInter.). The performance of these models was evaluated by denoising two noisy samples with PSNR of 3 dB and 10 dB, respectively.

Table IV shows the quantitative results. The results show that for the test sample with a PSNR of 3 dB, the denoising performance of the three upsampling methods is comparable, with Noise2SR using pixel shuffling showing a slight advantage. For the test sample with a higher SNR of 10 dB, Noise2SR using transposed convolution achieves a slight improvement of 0.3 dB compared to the other two methods. It is worth noting that the bilinear interpolation upsampling method, which does not have any trainable parameters, achieves comparable performance with the added benefit of faster inference time. This implies that Noise2SR is effective in denoising even when simpler upsampling techniques, such as bilinear interpolation, are employed.

TABLE IV: Ablation on different upsampling strategies in SR decoder. Quantitative results (PSNR/SSIM) are evaluated on two noisy samples with different SNRs in simulated TEM Dataset. The best performance is highlighted in Bold.

Upsampling	3 dB	10 dB	Time
TransConv.	27.36/0.9292	36.22/0.9894	5.84s
PixelShuff.	27.51/0.9349	35.89/0.9881	5.48s
BiInter.	27.31/0.9325	35.59/0.9876	4.43s

V Conclusion

In this work, we propose Noise2SR, zero-shot self-supervised learning (ZS-SSL) method for denoising HREM images. We introduce the Random Sub-sampler to generate paired noisy images with different resolutions, preserving structural information while altering high-frequency signal-dependent noise. This enables effective noise removal through supervised training using paired noisy images. By employing a super-resolved strategy, the network generates denoised results of the original image size from sub-sampled noisy inputs. The generation of paired noisy images serves as data augmentation to mitigate overfitting in single image denoising. Furthermore, we present an approximate MMSE estimation of the clean signal by ensembling multiple denoised results from randomly sub-sampled noisy images of a single input. This further enhances the denoising performance. The training strategy and MMSE estimation enable Noise2SR to achieve superior denoising performance using only a single ultra-low SNR HREM image. Experimental results on synthetic and real HREM data demonstrate that Noise2SR outperforms state-of-the-art ZS-SSL methods and achieves comparable performance to supervised methods. Despite achieving state-of-the-art performance in zero-shot self-supervised denoising for HREM images, Noise2SR still has certain limitations. One such limitation is the random sub-sampling would introduce high-frequency information loss in the input noisy image. Consequently, when denoising images rich in details, Noise2SR with MMSE estimation tends to produce relatively smooth results. Another limitation is that the zero-shot DL denoiser demands minutes of training time, preventing it from achieving real-time denoising performance. Besides the limitations, We believe that this work has the potential to improve the signal-to-noise ratio of images in materials imaging domains.

References

[1] M. O’keefe, P. Buseck, and S. Iijima, “Computed crystal structure images for high resolution electron microscopy,” Nature, vol. 274, no. 5669, pp. 322–324, 1978.
[2] J. C. Spence and A. V. Crewe, “Experimental high-resolution electron microscopy,” 1981.
[3] C. Kisielowski, B. Freitag, M. Bischoff, H. Van Lin, S. Lazar, G. Knippels, P. Tiemeijer, M. van der Stam, S. von Harrach, M. Stekelenburg et al., “Detection of single atoms and buried defects in three dimensions by aberration-corrected electron microscope with 0.5-å information limit,” Microscopy and Microanalysis, vol. 14, no. 5, pp. 469–477, 2008.
[4] N. I. Kato, “Reducing focused ion beam damage to transmission electron microscopy samples,” Journal of electron microscopy, vol. 53, no. 5, pp. 451–458, 2004.
[5] N. Jiang, “Electron beam damage in oxides: a review,” Reports on Progress in Physics, vol. 79, no. 1, p. 016501, 2015.
[6] A. Faruqi and G. McMullan, “Direct imaging detectors for electron microscopy,” Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, vol. 878, pp. 180–190, 2018.
[7] P. Ercius, I. Johnson, H. Brown, P. Pelz, S.-L. Hsu, B. Draney, E. Fong, A. Goldschmidt, J. Joseph, J. Lee et al., “The 4d camera–an 87 khz frame-rate detector for counted 4d-stem experiments,” Microscopy and Microanalysis, vol. 26, no. S2, pp. 1896–1897, 2020.
[8] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE transactions on image processing, vol. 26, no. 7, pp. 3142–3155, 2017.
[9] Y. Tai, J. Yang, X. Liu, and C. Xu, “Memnet: A persistent memory network for image restoration,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 4539–4547.
[10] K. Zhang, W. Zuo, and L. Zhang, “Ffdnet: Toward a fast and flexible solution for cnn-based image denoising,” IEEE Transactions on Image Processing, vol. 27, no. 9, pp. 4608–4622, 2018.
[11] J. Batson and L. Royer, “Noise2self: Blind denoising by self-supervision,” in International Conference on Machine Learning. PMLR, 2019, pp. 524–533.
[12] A. Krull, T.-O. Buchholz, and F. Jug, “Noise2void-learning denoising from single noisy images,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 2129–2137.
[13] S. Laine, T. Karras, J. Lehtinen, and T. Aila, “High-quality self-supervised deep image denoising,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[14] S. Cha and T. Moon, “Fully convolutional pixel adaptive image denoiser,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4160–4169.
[15] Y. Quan, M. Chen, T. Pang, and H. Ji, “Self2self with dropout: Learning self-supervised denoising from single image,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 1890–1898.
[16] Y. Xie, Z. Wang, and S. Ji, “Noise2same: Optimizing a self-supervised bound for image denoising,” Advances in neural information processing systems, vol. 33, pp. 20 320–20 330, 2020.
[17] J. Byun, S. Cha, and T. Moon, “Fbi-denoiser: Fast blind image denoiser for poisson-gaussian noise,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5768–5777.
[18] N. Moran, D. Schmidt, Y. Zhong, and P. Coady, “Noisier2noise: Learning to denoise from unpaired noisy data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 064–12 072.
[19] T. Huang, S. Li, X. Jia, H. Lu, and J. Liu, “Neighbor2neighbor: A self-supervised framework for deep image denoising,” IEEE Transactions on Image Processing, vol. 31, pp. 4023–4038, 2022.
[20] X. Wu, M. Liu, Y. Cao, D. Ren, and W. Zuo, “Unpaired learning of deep image denoising,” in European conference on computer vision. Springer, 2020, pp. 352–368.
[21] J. Lequyer, R. Philip, A. Sharma, W.-H. Hsu, and L. Pelletier, “A fast blind zero-shot denoiser,” Nature Machine Intelligence, vol. 4, no. 11, pp. 953–963, 2022.
[22] Y. Mansour and R. Heckel, “Zero-shot noise2noise: Efficient image denoising without any data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 018–14 027.
[23] T. Pang, H. Zheng, Y. Quan, and H. Ji, “Recorrupted-to-recorrupted: Unsupervised deep learning for image denoising,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 2043–2052.
[24] X. Tian, Q. Wu, H. Wei, and Y. Zhang, “Noise2sr: Learning to denoise from super-resolved single noisy fluorescence image,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VI. Springer, 2022, pp. 334–343.
[25] J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, and T. Aila, “Noise2noise: Learning image restoration without clean data,” in International Conference on Machine Learning. PMLR, 2018, pp. 2965–2974.
[26] W. Lee, S. Son, and K. M. Lee, “Ap-bsn: Self-supervised denoising for real-world images via asymmetric pd and blind-spot network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 725–17 734.
[27] J. Xu, Y. Huang, M. Cheng, L. Liu, F. Zhu, Z. Xu, and L. Shao, “Noisy-as-clean: Learning self-supervised denoising from corrupted image.” IEEE Transactions on Image Processing: a Publication of the IEEE Signal Processing Society, 2020.
[28] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9446–9454.
[29] K. Zhang, Y. Li, W. Zuo, L. Zhang, L. Van Gool, and R. Timofte, “Plug-and-play image restoration with deep denoiser prior,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 10, pp. 6360–6376, 2021.
[30] X. Pan, X. Zhan, B. Dai, D. Lin, C. C. Loy, and P. Luo, “Exploiting deep generative prior for versatile image restoration and manipulation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7474–7489, 2021.
[31] Y. Wang, J. Yu, and J. Zhang, “Zero-shot image restoration using denoising diffusion null-space model,” in The Eleventh International Conference on Learning Representations, 2022.
[32] B. Li, Y. Gou, J. Z. Liu, H. Zhu, J. T. Zhou, and X. Peng, “Zero-Shot Image Dehazing,” IEEE Transactions on Image Processing, vol. 29, pp. 8457–8466, Aug. 2020.
[33] B. Li, Y. Gou, S. Gu, J. Z. Liu, J. T. Zhou, and X. Peng, “You Only Look Yourself: Unsupervised and Untrained Single Image Dehazing Neural Network,” International Journal of Computer Vision, pp. 1–14, Mar. 2021.
[34] Y.-J. Chen, Y.-J. Chang, S.-C. Wen, Y. Shi, X. Xu, T.-Y. Ho, Q. Jia, M. Huang, and J. Zhuang, “Zero-shot medical image artifact reduction,” in 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI). IEEE, 2020, pp. 862–866.
[35] J. M. Ede, “Deep learning in electron microscopy,” Machine Learning: Science and Technology, vol. 2, no. 1, p. 011004, 2021.
[36] S. V. Kalinin, C. Ophus, P. M. Voyles, R. Erni, D. Kepaptsoglou, V. Grillo, A. R. Lupini, M. P. Oxley, E. Schwenker, M. K. Chan et al., “Machine learning in scanning transmission electron microscopy,” Nature Reviews Methods Primers, vol. 2, no. 1, p. 11, 2022.
[37] T. M. Quan, D. G. C. Hildebrand, K. Lee, L. A. Thomas, A. T. Kuan, W.-C. A. Lee, and W.-K. Jeong, “Removing imaging artifacts in electron microscopy using an asymmetrically cyclic adversarial network without paired training data,” in 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). IEEE, 2019, pp. 3804–3813.
[38] F. Wang, T. R. Henninen, D. Keller, and R. Erni, “Noise2atom: unsupervised denoising for scanning transmission electron microscopy images,” Applied Microscopy, vol. 50, no. 1, pp. 1–9, 2020.
[39] X. Chong, M. Cheng, W. Fan, Q. Li, and H. Leung, “M-denoiser: Unsupervised image denoising for real-world optical and electron microscopy data,” Computers in Biology and Medicine, vol. 164, p. 107308, 2023.
[40] R. Lin, R. Zhang, C. Wang, X.-Q. Yang, and H. L. Xin, “Temimagenet training library and atomsegnet deep-learning models for high-precision atom segmentation, localization, denoising, and deblurring of atomic-resolution images,” Scientific reports, vol. 11, no. 1, p. 5386, 2021.
[41] S. Mohan, R. Manzorro, J. L. Vincent, B. Tang, D. Y. Sheth, E. P. Simoncelli, D. S. Matteson, P. A. Crozier, and C. Fernandez-Granda, “Deep denoising for scientific discovery: A case study in electron microscopy,” IEEE Transactions on Computational Imaging, vol. 8, pp. 585–597, 2022.
[42] Y. Zhou, J. Jiao, H. Huang, Y. Wang, J. Wang, H. Shi, and T. Huang, “When awgn-based denoiser meets real noises,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 13 074–13 081.
[43] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1874–1883.
[44] J. S. Lim, “Two-dimensional signal and image processing,” Englewood Cliffs, 1990.
[45] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3-d transform-domain collaborative filtering,” IEEE Transactions on image processing, vol. 16, no. 8, pp. 2080–2095, 2007.
[46] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR (Poster), 2015. [Online]. Available: http://arxiv.org/abs/1412.6980
[47] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
[48] A. K. Boyat and B. K. Joshi, “A review paper: Noise models in digital image processing,” Signal & Image Processing, vol. 6, no. 2, p. 63, 2015.
[49] P. Sarder and A. Nehorai, “Deconvolution methods for 3-d fluorescence microscopy images,” IEEE signal processing magazine, vol. 23, no. 3, pp. 32–45, 2006.
[50] W. Meiniel, J.-C. Olivo-Marin, and E. D. Angelini, “Denoising of microscopy images: a review of the state-of-the-art, and a new sparsity-based method,” IEEE Transactions on Image Processing, vol. 27, no. 8, pp. 3842–3856, 2018.
[51] N. Bähler, M. El Helou, É. Objois, K. Okumuş, and S. Süsstrunk, “Pogain: Poisson-gaussian image noise modeling from paired samples,” IEEE Signal Processing Letters, vol. 29, pp. 2602–2606, 2022.
[52] Z. Wang, X. Cun, J. Bao, W. Zhou, J. Liu, and H. Li, “Uformer: A general u-shaped transformer for image restoration,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 17 683–17 693.
[53] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M.-H. Yang, “Restormer: Efficient transformer for high-resolution image restoration,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5728–5739.

$\displaystyle\mathbb{E}_{\mathbf{x},\mathbf{y}}\left\\|f_{\theta}(\mathbf{y}_{J% ^{\complement}})_{J}-\mathbf{y}_{J}\right\\|_{2}^{2}$	$\displaystyle=\mathbb{E}_{\mathbf{x},\mathbf{y}}\left\\|f_{\theta}(\mathbf{y}_{% J^{\complement}})_{J}-\mathbf{x}_{J}-\mathbf{n}_{J}\right\\|_{2}^{2}$	(4)
	$\displaystyle=\mathbb{E}_{\mathbf{x},\mathbf{y}}\left\\|f_{\theta}(\mathbf{y}_{% J^{\complement}})_{J}-\mathbf{x}_{J}\right\\|_{2}^{2}+\sigma^{2}$
	$\displaystyle\quad-2\mathbb{E}_{\mathbf{x},\mathbf{y}}\langle f_{\theta}(% \mathbf{y}_{J^{\complement}})_{J}-\mathbf{x}_{J},\mathbf{n}_{J}\rangle.$