Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: arXiv.org perpetual non-exclusive license
arXiv:2312.16455v1 [eess.IV] 27 Dec 2023

Learn From Orientation Prior for Radiograph Super-Resolution: Orientation Operator Transformer

Yongsong Huang Tomo Miyazaki Xiaofeng Liu Kaiyuan Jiang Zhengmi Tang Shinichiro Omachi
Abstract

Background and objective: High-resolution radiographic images play a pivotal role in the early diagnosis and treatment of skeletal muscle-related diseases. It is promising to enhance image quality by introducing single-image super-resolution (SISR) model into the radiology image field. However, the conventional image pipeline, which can learn a mixed mapping between SR and denoising from the color space and inter-pixel patterns, poses a particular challenge for radiographic images with limited pattern features. To address this issue, this paper introduces a novel approach: Orientation Operator Transformer - O2superscript𝑂2O^{2}italic_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTformer. Methods: We incorporate an orientation operator in the encoder to enhance sensitivity to denoising mapping and to integrate orientation prior. Furthermore, we propose a multi-scale feature fusion strategy to amalgamate features captured by different receptive fields with the directional prior, thereby providing a more effective latent representation for the decoder. Based on these innovative components, we propose a transformer-based SISR model, i.e., O2superscript𝑂2O^{2}italic_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTformer, specifically designed for radiographic images. Results: The experimental results demonstrate that our method achieves the best or second-best performance in the objective metrics compared with the competitors at ×4absent4\times 4× 4 upsampling factor. For qualitative, more objective details are observed to be recovered. Conclusions: In this study, we propose a novel framework called O2superscript𝑂2O^{2}italic_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTformer for radiological image super-resolution tasks, which improves the reconstruction model’s performance by introducing an orientation operator and multi-scale feature fusion strategy. Our approach is promising to further promote the radiographic image enhancement field.

keywords:
Radiographs; Super-Resolution; Orientation Feature; Feature Fusion
journal: Computer Methods and Programs in Biomedicine\affiliation

[label1]organization=Department of Communications Engneering, Graduate School of Engineering, Tohoku University,city=Sendai, postcode=9808579, country=Japan

\affiliation

[label2]organization=Gordon Center for Medical Imaging, Harvard Medical School,city=Boston, postcode=02114, country=USA

\affiliation

[label3]organization=Department of Surgery, Tohoku University Graduate School of Medicine,city=Sendai, postcode=80 8575, country=Japan

1 Introduction

Radiographs, often referred to as X-rays, are a cornerstone in the realm of musculoskeletal medicine, being instrumental in diagnosing and managing a multitude of diseases. For instance, they play a pivotal role in the evaluation of fractures. They only illustrate the fracture line’s location, orientation, and displacement, but also detect joint involvement and identify signs of complications, such as open fractures and infections, e.g., gas gangrene. Radiographs are also adept at revealing indications of osteoporosis, including decreased bone density, trabecular thinning, or cortical thinning(Vives, 2006; Chen et al., 2013; Mc Donnell et al., 2007; Zhou et al., 2015). In the context of osteoarthritis, radiographs can pinpoint key features, including joint space narrowing, osteophyte formation, subchondral sclerosis, and cartilage wear(Turlington, 2003; Adepu et al., 2022; Ying et al., 2023). Moreover, they can unveil the classic presentations of musculoskeletal tuberculosis, such as bone destruction and periosteal reactionMiyamoto et al. (2007); Hu et al. (2022); Shen and Lv (2022). However, the inherent low resolution of radiographic images may lead to diagnostic inaccuracies or oversights. There is a significant need to improve the resolution of these images to ensure more accurate and reliable diagnoses(Shin et al., 2023; Qiu et al., 2022; Zhu et al., 2023; Huang et al., 2023).

To address the issue that the resolution of radiographic images is limited, single image super-resolution (SISR) is attracting increasing attention in this field. The goal of SISR is to recover matched high-resolution (HR) IHRsubscript𝐼𝐻𝑅I_{HR}italic_I start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT images from degraded low-resolution (LR) ILRsubscript𝐼𝐿𝑅I_{LR}italic_I start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT images, formalized as follows:

ILR=𝔻(IHR;δ),subscript𝐼𝐿𝑅𝔻subscript𝐼𝐻𝑅𝛿I_{LR}=\mathbb{D}\left(I_{HR};\delta\right),italic_I start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT = blackboard_D ( italic_I start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT ; italic_δ ) , (1)

where 𝔻𝔻\mathbb{D}blackboard_D denotes degradation and δ𝛿\deltaitalic_δ is the degradation process parameter. SISR is often considered as an ill-posed problem, considering that we often need to restore paired images from unknowable degradationWang et al. (2020); Park et al. (2003). Further, the degradation process is shown in Eq.2:

𝔻(IHR;δ)=(IHRκ)d+nς,𝔻subscript𝐼𝐻𝑅𝛿tensor-productsubscript𝐼𝐻𝑅𝜅subscript𝑑subscript𝑛𝜍\mathbb{D}\left(I_{HR};\delta\right)=\left(I_{HR}\otimes\kappa\right)% \downarrow_{d}+n_{\varsigma},blackboard_D ( italic_I start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT ; italic_δ ) = ( italic_I start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT ⊗ italic_κ ) ↓ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT italic_ς end_POSTSUBSCRIPT , (2)

where κ𝜅\kappaitalic_κ means the blurred kernel in the degradation, and dsubscript𝑑\downarrow_{d}↓ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is the downsampling factor. In general, nςsubscript𝑛𝜍n_{\varsigma}italic_n start_POSTSUBSCRIPT italic_ς end_POSTSUBSCRIPT is considered as additive noiseChen et al. (2022); Huang et al. (2022).

With the development of deep learning, it has become popular to use deep learning models in SISR tasks. First, convolutional neural network-based (CNNs-base) approaches were proposedDong et al. (2014); Kim et al. (2016); Jiang et al. (2021). These methods use CNNs as feature extractors, further learn the nonlinear mapping relationship between LR images and HR images by neural networks, and finally reconstruct the captured features to SR images in latent space. CNN-based models tend to capture more latent representations with the help of attention mechanismsZhang et al. (2018) and deeper networksBehjati et al. (2021). Then, generative models were introduced to further advance the field. Compared with the CNN-based model, the generative model can recover more image edge detail texture information. The reason behind this is the different optimization paradigms between them. In generative models, the widely studied generative adversarial networks (GANs) are remarkableGoodfellow et al. (2020); Ledig et al. (2017); Huang et al. (2021). The strategy of adversarial generation helps the model to generate diverse samples and no longer relies only on perceptual fields and deep neural networks to improve the generated image qualityWang et al. (2018, 2021). However, the model collapse risk in such methods can bring trouble to the model trainingGulrajani et al. (2017). Recently, exploding transformer models have become the new paradigm in the field of computer vision. The proposed transformer model is based on the self-attention mechanism, and its excellent ability to capture long-range dependencies enables it to achieve excellent performance in SR tasksYang et al. (2020); Lu et al. (2022). To balance deep learning-based models between local feature extraction and global information reconstruction, CNNs as encoders combined with transformer decoders are becoming more and more popularGao et al. (2022). This kind of approach considers the CNN as an encoder that can have a better perceptive field to capture shallow features, and also takes the excellent long-range information reconstruction capability of the transformer model into consideration, and finally achieves better performance in SISR tasks.

However, these proposed methods can have several challenges in the radiographic field: First, the methods mentioned above focus on the nature image reconstruction, such as RGB-type images. These methods usually focus on the reconstruction between LR-HR pairs more, and the blurred connection between such mappings is not sufficiently explored. However, radiological images usually incorporate more challenging blur in practical applications. For example, breathing undulations are common, making patient displacement relative to the device. This blurring can be widely observed, especially for young or tremor disease patients(Qiu et al., 2023, 2022; Zhu and Qiu, 2021). Therefore, the negative influence of noise represented by motion blur on imaging quality is a serious concern. Second, there is a limitation on the feature representation capability of the model by relying on a simple convolutional approach or attention mechanism in the shallow feature extraction. Previous approachesYang et al. (2020); Lu et al. (2022) have tended to improve the decoder, i.e., the transformer model. For CNN-based encoders, improving the ability to capture local information can contribute to the quality of SR images by providing a better latent representation for the decoder.

To further improve the image quality in super-resolution radiography, we propose Orientation Operator Transformer for this task. In summary, our contributions can be summarized as follows:

  • 1.

    We first propose Orientation Operator for enhancing CNN-based shallow feature extraction modules. This novel operator focuses on the prior knowledge of horizontal and vertical directions and introduces it into local feature extraction. The different orientation prior helps the encoder to capture shallow features for better latent representation, and further benefits the decoder in learning better nonlinear mapping for image reconstruction. To the best of our knowledge, it is the first model focusing on the orientation prior in the radiographic super-resolution task.

  • 2.

    We also propose a multi-scale feature fusion strategy for radiographic images. This strategy considers different convolution methods that help to capture shallow features with more diverse local features. The shallow representation with diversity helps the decoder to better reconstruct SR images. Further ablation experiments demonstrate the effectiveness of the feature fusion strategy.

  • 3.

    Finally, we propose the end-to-end Orientation Operator Transformer for the super-resolution task in radiographic images. This model includes two components: the CNN encoder for shallow feature extraction and the transformer-based decoder to reconstruct the image by connecting global information. Compared with previous approaches, Orientation Operator Transformer focuses more on non-linear mapping for blurred mapping in radiological image reconstruction and achieves better performance. According to the experimental results, our method achieves better performance compared to competitors’ in both objective metrics and qualitative studies.

The remainder of this paper is organized as follows: we will present related work on deep learning models and orientation priors in Section.2. Section.3 will present the key components of our proposed Orientation Operator Transformer and detailed information. The qualitative and quantitative evaluation results will be provided in Section.4. The conclusions will be included in Section.5.

2 Related Work

In this section, representative work based on deep learning in SISR tasks will be discussed first. These methods will include the following: CNN-based, GAN-based, and transformer-based. Highlighted work will be reviewed and presented. Further, studies in orientation priors will also be shown. Then, the horizontal and vertical prior’s application in different visual tasks will be discussed. The details about this prior knowledge in denoising mapping will also be described.

Deep learning-based models. These models aim to reconstruct images by learning a nonlinear mapping between paired data in latent space by using deep learning models. Unlike previous approaches, their advantage is that they do not rely on complicated prior knowledge and can fit the data distribution with the model. The objective function is shown as follows:

θ^=argmin𝜃(ISR,IHR)+λΦ(θ),^𝜃𝜃subscript𝐼𝑆𝑅subscript𝐼𝐻𝑅𝜆Φ𝜃\hat{\theta}=\underset{\theta}{\arg\min}\mathcal{L}\left(I_{SR},I_{HR}\right)+% \lambda\Phi(\theta),over^ start_ARG italic_θ end_ARG = underitalic_θ start_ARG roman_arg roman_min end_ARG caligraphic_L ( italic_I start_POSTSUBSCRIPT italic_S italic_R end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT ) + italic_λ roman_Φ ( italic_θ ) , (3)

where (ISR,IHR)subscript𝐼𝑆𝑅subscript𝐼𝐻𝑅\mathcal{L}\left(I_{SR},I_{HR}\right)caligraphic_L ( italic_I start_POSTSUBSCRIPT italic_S italic_R end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT ) represents the loss function between the generated SR image ISRsubscript𝐼𝑆𝑅I_{SR}italic_I start_POSTSUBSCRIPT italic_S italic_R end_POSTSUBSCRIPT and the ground truth image IHR,Φ(θ)subscript𝐼𝐻𝑅Φ𝜃I_{HR},\Phi(\theta)italic_I start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT , roman_Φ ( italic_θ ) is the regularization term and λ𝜆\lambdaitalic_λ is the tradeoff parameterWang et al. (2020). Dong et al.Dong et al. (2014) introduced the first deep learning model into the SISR domain. The convolution operator used in this shallow model is used for feature extraction in LR images, and finally, the SR images are available by upsampling. This work exceeds other methods at that time and achieves the best performance. Further, Fast Super-Resolution Convolutional Neural Networks (FSRCNN) were proposed by the researchers to minimize the model complexityDong et al. (2015). By improving the model, FSRCNN is able to improve computational efficiency while maintaining comparable performance. Such models are more popular in some applications where computing power is limited. The following proposed CNN-basedJiang et al. (2021) models mainly focus on deepening the layers in the network to improve performance, and introducing attention mechanismsZhang et al. (2018) is another popular trend.

With the community proposing generative models, such as generative adversarial networks and variational encoders, it has become the trend to introduce these models into SISR tasksLedig et al. (2017); Huang et al. (2021, 2021, 2023). The SRGAN model first proposed by Ledig et al. has achieved remarkable performance using an adversarial training strategy. According to the experimental results, the texture details at the edges were well restored for SR images. However, the model collapse risk in this kind of method becomes a challenge in front of researchers. To address this issue, the researchers proposed to use new divergence in the discriminatorGulrajani et al. (2017); Huang et al. (2021). It was observed from the experimental results that the improved model is more stable during training and also tends to reduce the weakness in mode collapse. Further work focused on the generator model enhancement, and more powerful modules were designed and introducedHuang et al. (2022).

Recent transformer models have attracted the attention in the community from researchers. Such approaches from natural language processing have better decoded latent representations in SISR tasks by their power to capture long-range dependenciesHan et al. (2020); Vaswani et al. (2017). However, receptive field bounds make these methods fail to achieve the expected performance on local domain extraction. Thus, the CNN-based encoder is used for shallow feature encoding, and the latent output representation is fed to the transformer-based decoder as a new paradigmGao et al. (2022).

Orientation priors. In the lower-level computer vision tasks, the combination of complex elements of the image structure/texture in different orientations (e.g., horizontal and vertical) is considered to benefit the model to fit the data distribution with local featuresHe et al. (2023); Lin et al. (2023). For example, Nucleus segmentation feeds that prior to the classifier to enhance the response for local featuresVo and Kim (2023); Dogar et al. (2023). Further, in normal image restoration tasks, orientation-aware features are also used as input to fused features to improve the model’s representational capabilityHe et al. (2023). Moreover, there is also an increasing interest in the denoising fieldTsai et al. (2022); Sun et al. (2015); Huang et al. (2023) for the orientation feature prior.

In this study, we aim to introduce the orientation prior to enhance the encoder’s ability for local features, which will help the decoder to have better express blur mapping and reconstruct images. Compared to the normal image recovery task, radiological imaging can face more challenges from blur. Previous approaches would also focus on making the model learn to nonlinear mappings, mixing LR-HR and blur. Normal images can learn enough patterns in diverse color patterns to infer the blur orientation between pixelsTsai et al. (2022), whereas the limited colors and features in radiographic images can introduce more difficulties. To address this issue, further attention to orientation features would be promising to enhance the model’s ability to represent blur, which is beneficial to get reconstructed high-quality images. Next, we describe the details for CNN estimation blur from the orientation prior.

Given a blurry image, denoted as I𝐼Iitalic_I, we define the local motion blur kernel at a specific image pixel, represented as pΩ𝑝Ωp\in\Omegaitalic_p ∈ roman_Ω (where ΩΩ\Omegaroman_Ω signifies the image region), using a motion vector 𝐦𝐩=(lp,op)subscript𝐦𝐩subscript𝑙𝑝subscript𝑜𝑝\mathbf{m}_{\mathbf{p}}=\left(l_{p},o_{p}\right)bold_m start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT = ( italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ). This vector encapsulates the magnitude and direction of the motion field at point p𝑝pitalic_p during the camera shutter’s open phase. Each motion vector gives rise to a motion kernel, which possesses non-zero values exclusively along the trajectory of the motion. Consequently, the blurred image can be depicted as I=k(M)*I0𝐼𝑘𝑀subscript𝐼0I=k(M)*I_{0}italic_I = italic_k ( italic_M ) * italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, which is to say, the convolution of an underlying sharp image I0subscript𝐼0I_{0}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with the non-uniform motion blur kernels k(M)𝑘𝑀k(M)italic_k ( italic_M ), which are dictated by the motion field M={𝐦𝐩}pΩ𝑀subscriptsubscript𝐦𝐩𝑝ΩM=\left\{\mathbf{m}_{\mathbf{p}}\right\}_{p\in\Omega}italic_M = { bold_m start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_p ∈ roman_Ω end_POSTSUBSCRIPT. We also depict the motion vector 𝐦𝐩subscript𝐦𝐩\mathbf{m}_{\mathbf{p}}bold_m start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT as (up,vp)subscript𝑢𝑝subscript𝑣𝑝\left(u_{p},v_{p}\right)( italic_u start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) within the framework of the Cartesian coordinate system, following the transformation:

up=lpcos(op),vp=lpsin(op)formulae-sequencesubscript𝑢𝑝subscript𝑙𝑝subscript𝑜𝑝subscript𝑣𝑝subscript𝑙𝑝subscript𝑜𝑝u_{p}=l_{p}\cos\left(o_{p}\right),v_{p}=l_{p}\sin\left(o_{p}\right)italic_u start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT roman_cos ( italic_o start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) , italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT roman_sin ( italic_o start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) (4)

To summarize, we can use different orientations prior to estimating the blur 𝐦𝐩subscript𝐦𝐩\mathbf{m}_{\mathbf{p}}bold_m start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT, considering the two components in the motion blur: (up,vp)subscript𝑢𝑝subscript𝑣𝑝\left(u_{p},v_{p}\right)( italic_u start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ). This approach does not make any global parameter assumptions (e.g., homography) for motion, and uses local image regions to predict these kernels Sun et al. (2015). In our approach, the encoder’s output is the latent representation with better blur estimation, which will be fed to the decoder to learn better nonlinear mappings.

3 Methodology

In this section, we first provide an overview of the proposed Orientation Operator Transformer: O2superscript𝑂2O^{2}italic_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTformer, for radiograph super-resolution. Then, we present the detailed configuration of its two main components: the orientation operator and the multi-scale feature fusion strategy. Finally, training strategies will also be demonstrated.

Refer to caption
Figure 1: An overview of our proposed O2superscript𝑂2O^{2}italic_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTformer. In O2superscript𝑂2O^{2}italic_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTformer, the encoder E𝐸Eitalic_E is used to capture shallow local features zH×W×C𝑧superscript𝐻𝑊𝐶z\in\mathbb{R}^{H\times W\times C}italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT first, and the z𝑧zitalic_z form the output are further fed to the decoder D𝐷Ditalic_D. Our proposed orientation operator 𝒪𝒪\mathcal{O}caligraphic_O aims to fuse more orientation priors, both horizontal hPhdir(θ^)similar-tosubscript𝑃𝑑𝑖𝑟^𝜃h\sim P_{hdir}(\hat{\theta})italic_h ∼ italic_P start_POSTSUBSCRIPT italic_h italic_d italic_i italic_r end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG ) and vertical vPvdir(θ~)similar-to𝑣subscript𝑃𝑣𝑑𝑖𝑟~𝜃v\sim P_{vdir}(\tilde{\theta})italic_v ∼ italic_P start_POSTSUBSCRIPT italic_v italic_d italic_i italic_r end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG ), for the latent representations. Then, the nonlinear mapping between LR-HR pairs will be learned in the decoder D𝐷Ditalic_D by parameter optimization. O2superscript𝑂2O^{2}italic_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTformer with the input xPdata(x)similar-to𝑥subscript𝑃𝑑𝑎𝑡𝑎𝑥x\sim P_{data}(x)italic_x ∼ italic_P start_POSTSUBSCRIPT italic_d italic_a italic_t italic_a end_POSTSUBSCRIPT ( italic_x ), SR image yPsr(y)similar-to𝑦subscript𝑃𝑠𝑟𝑦y\sim P_{sr}(y)italic_y ∼ italic_P start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT ( italic_y ) - output. σ𝜎\sigmaitalic_σ denotes the sigmoid function. Zoom in for the best view.

3.1 Network Architecture

To achieve the goal of improving the radiometric SR images quality, we proposed an end-to-end model including encoder E𝐸Eitalic_E and decoder D𝐷Ditalic_D: O2superscript𝑂2O^{2}italic_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTformer (see Fig.1). On the one hand, the local shallow feature is captured and represented as latent representations in the latent space at E𝐸Eitalic_E. In our approach, orientation operator 𝒪𝒪\mathcal{O}caligraphic_O and multi-scale feature fusion strategy are proposed, aiming to help the higher quality latent representations to be output. The advantages are as follows: First, different from normal images, the color space and pattern information in radiographic images are poor. Therefore, it is difficult for those models in radiographic images to learn the blurred mapping, which influences the reconstructed image quality, from diverse inter-pixel patterns as normal images. Enhance by 𝒪𝒪\mathcal{O}caligraphic_O (more details of the proposed orientation operator is provided in Section.3.2), more orientation priors are captured and fused into the shallow feature information. Moreover, the multi-scale feature fusion strategy is introduced to further enhance the model’s response to local features at different scales (more details of the proposed multi-scale feature fusion strategy is provided in Section.3.3). Finally, E𝐸Eitalic_E is enhanced by these two novel components to output better latent representations.

On the other hand, the latent representations from the encoder are fed to the D𝐷Ditalic_D. O2superscript𝑂2O^{2}italic_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTformer’s decoder is proposed based on the transformer model for capturing long-distance dependent pattern information. In D𝐷Ditalic_D, the nonlinear mapping between LR-HR is learned. Previously mixed blurred mappings are also better represented because the input has more blurred estimation priors. As a result, the parameters are optimized for the objective function, and O2superscript𝑂2O^{2}italic_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTformer will be able to achieve the restoration from LR images to SR images.

3.2 Orientation Operator

The orientation prior is employed as a remarkable image statistical feature widely to enhance model performance in various vision tasks. Considering the complex structural information in the image space domain should include different combinations of textural complexities in various orientationsHe et al. (2023); Lin et al. (2023). Previous approaches have further proposed that the image prior knowledge of different orientations, horizontal and vertical, contributes to better fitted data distribution by the modelTsai et al. (2022); Sun et al. (2015).

In this work, we aim to propose a novel orientation operator 𝒪[𝒪v,𝒪h]𝒪subscript𝒪𝑣subscript𝒪\mathcal{O}\left[\mathcal{O}_{v},\mathcal{O}_{h}\right]caligraphic_O [ caligraphic_O start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , caligraphic_O start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ] for radiometric images to capture the image prior that enhances the latent representation quality. The visualization of textures in different directions is shown in Fig.2, where the blurring priors in shallow features are better represented with the help of this prior knowledge. It will further benefit the decoder to better capture the nonlinear mappings used for reconstruction, such as blurring. To represent the different orientations, 𝒪𝒪\mathcal{O}caligraphic_O includes 𝒪vsubscript𝒪𝑣\mathcal{O}_{v}caligraphic_O start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and 𝒪hsubscript𝒪\mathcal{O}_{h}caligraphic_O start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, depending on the prior knowledge in vertical and horizontal directions, respectively. For the vertical orientation prior vPvdir(θ~)similar-to𝑣subscript𝑃𝑣𝑑𝑖𝑟~𝜃v\sim P_{vdir}(\tilde{\theta})italic_v ∼ italic_P start_POSTSUBSCRIPT italic_v italic_d italic_i italic_r end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG ), we first consider the vertically oriented pixel values in the latent variable z¯H×W×C¯𝑧superscript𝐻𝑊𝐶\bar{z}\in\mathbb{R}^{H\times W\times C}over¯ start_ARG italic_z end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT as objects to be extracted (z¯¯𝑧\bar{z}over¯ start_ARG italic_z end_ARG denotes the feature map that has been further extracted by multiple different convolutions, the details will be described in Section.3.3.) , using the pixel-by-pixel operation as follows:

Refer to caption
Figure 2: Gradients of horizontal and vertical directions from a pair of LR images. The first row is from the LR images in the MURA-mini dataset (the downsampling factor is ×2absent2\times 2× 2), and the second row is from the ×4absent4\times 4× 4 downsampled data in the same dataset. The streak artifacts in different orientations in the gradient domain. Best viewed in color.
𝒪v(zc)=1Hj[0,H)(zc(W,j)Φw(zc))2subscript𝒪𝑣subscript𝑧𝑐1𝐻subscript𝑗0𝐻superscriptsubscript𝑧𝑐𝑊𝑗superscriptΦ𝑤subscript𝑧𝑐2\mathcal{O}_{v}\left(z_{c}\right)=\frac{1}{H}\sum_{j\in[0,H)}\left(z_{c}(W,j)-% \Phi^{w}\left(z_{c}\right)\right)^{2}caligraphic_O start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_H end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ [ 0 , italic_H ) end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_W , italic_j ) - roman_Φ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (5)

First, shallow local features z¯=[z1,z2,z3zc]¯𝑧subscript𝑧1subscript𝑧2subscript𝑧3subscript𝑧𝑐\bar{z}=[z_{1},z_{2},z_{3}...z_{c}]over¯ start_ARG italic_z end_ARG = [ italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT … italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ] is fed to 𝒪vsubscript𝒪𝑣\mathcal{O}_{v}caligraphic_O start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and average pooling ΦwsuperscriptΦ𝑤\Phi^{w}roman_Φ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT is performed. c𝑐citalic_c denotes the number of channels. W𝑊Witalic_W, H𝐻Hitalic_H are width and height, respectively. Further, we want to minimize the variance between pixels, and the pixel-by-pixel in the feature map zcsubscript𝑧𝑐z_{c}italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is given a variance pooling operation. Our aim is to balance the computational complexity and space direction prior. Similarly, the direction operator 𝒪hsubscript𝒪\mathcal{O}_{h}caligraphic_O start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT for the horizontal direction is defined as:

𝒪h(zc)=1Wi[0,W)(zc(i,H)Φh(zc))2subscript𝒪subscript𝑧𝑐1𝑊subscript𝑖0𝑊superscriptsubscript𝑧𝑐𝑖𝐻superscriptΦsubscript𝑧𝑐2\mathcal{O}_{h}\left(z_{c}\right)=\frac{1}{W}\sum_{i\in[0,W)}\left(z_{c}(i,H)-% \Phi^{h}\left(z_{c}\right)\right)^{2}caligraphic_O start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_W end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ [ 0 , italic_W ) end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_i , italic_H ) - roman_Φ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (6)

For the horizontal direction, zcsubscript𝑧𝑐z_{c}italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT as the input via horizontal averaging pooling ΦhsuperscriptΦ\Phi^{h}roman_Φ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT will produce the initial statistical features for the independent channels. The higher order directional prior will be available by variance pooling. When directional priors for different directions are captured, we will fuse these priors, vPvdir(θ~)similar-to𝑣subscript𝑃𝑣𝑑𝑖𝑟~𝜃v\sim P_{vdir}(\tilde{\theta})italic_v ∼ italic_P start_POSTSUBSCRIPT italic_v italic_d italic_i italic_r end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG ) and hPhdir(θ^)similar-tosubscript𝑃𝑑𝑖𝑟^𝜃h\sim P_{hdir}(\hat{\theta})italic_h ∼ italic_P start_POSTSUBSCRIPT italic_h italic_d italic_i italic_r end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG ), with the latent representations. The fusion details will be described in the multi-scale fusion feature strategy.

3.3 Multi-scale Feature Fusion Strategy

In this section, we describe the multiscale feature fusion strategy \mathcal{F}caligraphic_F used in O2superscript𝑂2O^{2}italic_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT former. Numerous prior studiesZhang et al. (2018); Li et al. (2018); Tong et al. (2017) have underscored the critical role of hierarchical features from distinct convolutional stages in augmenting SISR. Our multi-scale feature fusion strategy is designed with a hierarchical structure of different convolution modules to generate orientation-aware features across a variety of scales. Unlike previous strategies, our approach focuses on both different convolutional perceptual features and fuses directional priors. A strategic approach is adopted where orientation priors are tasked with holding a rich set of denoising features, while these patterns benefit the decoder in learning the blurring mapping. Our methodology employs convolution operators of varying scales (e.g., 5×5555\times 55 × 5 and 3×3333\times 33 × 3) to extract shallow features from images at multiple scales. Following this, the Orientation Operator delineates the orientation prior across these scales, detailed in Section.3.2. Subsequent to this, 1×1111\times 11 × 1 convolution is applied to reshape the feature maps to identical dimensions, allowing for their summation with the shallow feature maps, thereby integrating the multi-scale orientation prior with the shallow features (see Fig.1). This fusion approach emphasizes the assimilation of multi-scale image orientation priors over the shallow features, advancing beyond previous strategies that targeted solely pixel-level feature disparities.

After shallow feature extraction zH×W×C𝑧superscript𝐻𝑊𝐶z\in\mathbb{R}^{H\times W\times C}italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT, we seek to capture high-level features by employing multiple convolution methods in order to help the model represent more orientation-aware features at different scales. Specifically, different convolutional approaches are used, including Conv3×3subscriptConv33\text{Conv}_{3\times 3}Conv start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT, Conv5×5subscriptConv55\text{Conv}_{5\times 5}Conv start_POSTSUBSCRIPT 5 × 5 end_POSTSUBSCRIPT, and ShiftConvZhang et al. (2022). The formalization is as follows:

z¯=[C3×3(z),C5×5(z),Cshift (z)]𝒪¯𝑧direct-productsubscript𝐶33𝑧subscript𝐶55𝑧subscript𝐶shift 𝑧𝒪\bar{z}=\left[C_{3\times 3}(z),C_{5\times 5}(z),C_{\text{shift }}(z)\right]% \odot\mathcal{O}over¯ start_ARG italic_z end_ARG = [ italic_C start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT ( italic_z ) , italic_C start_POSTSUBSCRIPT 5 × 5 end_POSTSUBSCRIPT ( italic_z ) , italic_C start_POSTSUBSCRIPT shift end_POSTSUBSCRIPT ( italic_z ) ] ⊙ caligraphic_O (7)

C3×3subscript𝐶33C_{3\times 3}italic_C start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT and C5×5subscript𝐶55C_{5\times 5}italic_C start_POSTSUBSCRIPT 5 × 5 end_POSTSUBSCRIPT denote convolution operations utilizing 3×3333\times 33 × 3 and 5×5555\times 55 × 5 kernels, respectively. The symbol direct-product\odot signifies the linear operation between the convolution kernel and the latent representation. As illustrated in the proposed pipeline, the initial step involves the extraction of shallow features z𝑧zitalic_z through multi-scale representations (C3×3,C5×5,Cshift)subscript𝐶33subscript𝐶55subscript𝐶𝑠𝑖𝑓𝑡\left(C_{3\times 3},C_{5\times 5},C_{shift}\right)( italic_C start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 5 × 5 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_s italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT ). Subsequently, latent representations at varying scales are input into 𝒪𝒪\mathcal{O}caligraphic_O, facilitating the capture of perceptual priors across different orientations. These advanced radiographic priors are then integrated with the initial image features x𝑥xitalic_x in the next stage, yielding enhanced latent representations. The formalization is shown below:

=(σ(z¯C1×1),x)𝜎direct-product¯𝑧subscript𝐶11𝑥\mathcal{F}=\sum\left(\sigma(\bar{z}\odot C_{1\times 1}),x\right)caligraphic_F = ∑ ( italic_σ ( over¯ start_ARG italic_z end_ARG ⊙ italic_C start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT ) , italic_x ) (8)

Particularly, considering the aggregated feature maps generated by Eq.7, 1×1111\times 11 × 1 convolutional transformations are employed. These transformations are applied independently to feature maps in z¯¯𝑧\bar{z}over¯ start_ARG italic_z end_ARG. σ𝜎\sigmaitalic_σ represents the sigmoid function. Following the application of the nonlinear mapping function σ𝜎\sigmaitalic_σ, the feature maps, now enriched with priors and encompassing various scales, undergo normalization. These normalized values are then added to x𝑥xitalic_x on a pixel-by-pixel basis, resulting in a latent representation for the decoder D𝐷Ditalic_D that incorporates priors from multiple orientations. These enhanced representations facilitate the decoder D𝐷Ditalic_D in more effective learning of the mapping between LR-HR pairs, as well as the denoising mapping.

3.4 Training Strategy

During the training process, the optimization of O2superscript𝑂2O^{2}italic_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT former is achieved through the application of the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss function and MSE loss msesubscript𝑚𝑠𝑒\mathcal{L}_{mse}caligraphic_L start_POSTSUBSCRIPT italic_m italic_s italic_e end_POSTSUBSCRIPT. Given a training dataset, denoted as {ILRi,IHRi}i=1Nsuperscriptsubscriptsuperscriptsubscript𝐼𝐿𝑅𝑖superscriptsubscript𝐼𝐻𝑅𝑖𝑖1𝑁\left\{I_{LR}^{i},I_{HR}^{i}\right\}_{i=1}^{N}{ italic_I start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_I start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, we aim to solve the following optimization problem:

θ^=argminθ(α[1Ni=1NFθ(ILRi)IHRi1]+βmse)^𝜃subscript𝜃𝛼delimited-[]1𝑁superscriptsubscript𝑖1𝑁subscriptnormsubscript𝐹𝜃superscriptsubscript𝐼𝐿𝑅𝑖superscriptsubscript𝐼𝐻𝑅𝑖1𝛽subscript𝑚𝑠𝑒\hat{\theta}=\arg\min_{\theta}\left(\alpha\left[\frac{1}{N}\sum_{i=1}^{N}\left% \|F_{\theta}\left(I_{LR}^{i}\right)-I_{HR}^{i}\right\|_{1}\right]+\beta% \mathcal{L}_{mse}\right)over^ start_ARG italic_θ end_ARG = roman_arg roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_α [ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - italic_I start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] + italic_β caligraphic_L start_POSTSUBSCRIPT italic_m italic_s italic_e end_POSTSUBSCRIPT ) (9)

In this formulation, θ𝜃\thetaitalic_θ stands for the set of parameters that define our proposed O2superscript𝑂2O^{2}italic_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT former. The function Fθ(ILR)=ISRsubscript𝐹𝜃subscript𝐼𝐿𝑅subscript𝐼𝑆𝑅F_{\theta}\left(I_{LR}\right)=I_{SR}italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_L italic_R end_POSTSUBSCRIPT ) = italic_I start_POSTSUBSCRIPT italic_S italic_R end_POSTSUBSCRIPT represents the process of reconstructing the SR image from the LR input. Lastly, N𝑁Nitalic_N is indicative of the total count of images present in the training dataset. α𝛼\alphaitalic_α and β𝛽\betaitalic_β denote the weights. This optimization strategy is designed to minimize the difference between the reconstructed SR image and the original HR image, thereby enhancing the performance of the O2superscript𝑂2O^{2}italic_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTformer model.

4 Experiments

Training Dataset: In our study, we utilize a widely recognized dataset, MURA-SR Huang et al. (2022), to generate training pairs. MURA-SR comprises 4,000 musculoskeletal radiographs of the upper extremity. To emulate real-world scenarios with fewer samples, we curate a subset of 500 images from the MURA-SR dataset to form our training set. Representative samples from this training dataset can be viewed in Fig.3, which showcases samples with varying degrees of degradation and downsampling factors. Upon examination of these samples, it becomes evident that complex degradation can result in increased blurring of the edges in LR images, as exemplified by the alphabets in Source 1310. Furthermore, statistical noise can introduce undesirable artifacts into the bone structures, as can be observed in Source 1410.

Refer to caption
Figure 3: Examples of training samples and their corresponding LR images.

Test Dataset: The evaluation of these experiments will encompass both objective comparisons and subjective visual assessments. Further details about the datasets are provided below. For synthetic datasets, we utilize two test datasets: MURA-mini and MURA-plus. Both datasets consist of 100 HR images, but differ in terms of the degradation levels of their respective LR images.

Training details: Our model is trained using a batch size of 32, executed on two TITAN X (Pascal) GPUs. The size of the HR patch used during training is set to 96. For optimization, we utilize the Adam optimizer Kingma and Ba (2014), with a learning rate set at 1e51𝑒51e-51 italic_e - 5. The number of training epochs is established at 1000. α𝛼\alphaitalic_α and β𝛽\betaitalic_β were set to 1 and 0.1, respectively.

Evaluation Metrics: In alignment with the baseline experimental results, our experiments employ widely accepted full reference evaluation metrics: the Peak Signal-to-Noise Ratio (PSNR) and the Structural Similarity Index (SSIM)Wang et al. (2004). Concurrently, objective metrics are also assessed using non-reference metrics: the Perceptual Index (PI)Blau et al. (2018) and the Naturalness Image Quality Evaluator (NIQE)Mittal et al. (2012). Quantitative results are evaluated specifically on the luminance channel (Y).

Table 1: Ablation studies of the encoder. Average PSNR \uparrow on MURA-mini datasets with scale factor ×4 are shown.
Table 2: Ablation experiments of shallow feature fusion methods. Results for different fusion methods in the MURA-mini dataset ×4absent4\times 4× 4 are presented.
Scale Encoder PSNR/dB \uparrow
×4absent4\times 4× 4 CNN (w/o) 29.66 Baseline
CNN(w) 30.06
Attention 29.91
Ours 30.24 0.58 dB \uparrow
Scale Feature Fusion PSNR/dB \uparrow
×4absent4\times 4× 4 concat all 30.14 Baseline
concat all &\&& skip 30.07
concat (conv3 &\&& conv5) 30.09
sum 30.24 0.10 dB \uparrow
Table 2: Ablation experiments of shallow feature fusion methods. Results for different fusion methods in the MURA-mini dataset ×4absent4\times 4× 4 are presented.

4.1 Ablation Studies

To delve deeper into the performance of the proposed methods, we conduct an ablation study to analyze their impact on model training. Initially, we illustrate the effectiveness of the proposed O2superscript𝑂2O^{2}italic_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTformer and encoder. Subsequently, we carry out an ablation experiment to examine the influence of the key components of our architectural design: fusion strategies and multi-scale convolution.

Influence of O2superscript𝑂2O^{2}italic_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTformer & encoder: We investigated the performance of various types of shallow feature encoders used in the SR task for radiographic images, as detailed in Tab.2. Initially, we excluded CNN-based encoders from the model used in this task, resulting in a PSNR score of 29.66 dB. As a baseline model, the experimental results demonstrated that an SR model using only transformers performed sub-optimally. As previously discussed, this is often attributed to the fact that transformer models excel at capturing long-distance dependent information, such as pair data mapping. The limited receptive field presented by smaller patch data restricts performance, which could be enhanced by employing a CNN-based encoder with a larger receptive field.

Our experiments with the CNN-based encoder revealed that the enhanced model outperformed the baseline, improving objective metrics by up to 0.4 dB. This suggests that the CNN-based encoder is more effective in capturing shallow-level features. We then proposed attention-based CNN improvement methodsBehjati et al. (2023) and evaluated the performance of the attention mechanism on this baseline approach. The experiments indicated that attention mechanisms, while effective for normal images, face challenges when applied to radiographic images, with a PSNR evaluation metric of 29.91 dB. These methods primarily focus on mixing pattern features between pixels in the latent space. While the model used for normal images can effectively represent LR-HR pairwise mapping and blur mapping due to the diverse color space and pixel-to-pixel pattern information, this information is more limited in radiographic images, presenting challenges for these attention mechanisms.

Finally, our method achieved the best performance in the ablation experiments, with a PSNR score of 30.24 dB. This represents an improvement of 0.58 dB compared to the baseline model, demonstrating the effectiveness of the O2superscript𝑂2O^{2}italic_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTformer we incorporated into the encoder for this task.

Influence of feature fusion strategy: In this section, our objective is to investigate the approach of fusing features and the selection of the convolution method. As depicted in Tab.2, we initially attempt to concat feature maps from different convolution methods. This results in a PSNR score of 30.14 dB, indicating an improvement over the baseline that does not rely on feature fusion (refer to Tab.2). The experiments demonstrate that multi-scale feature extraction is advantageous for enhancing model performance. Additionally, we explored the combination of concat with skip connections, which did not yield competitive performance. We also experimented with selectively choosing parts of the convolutional feature maps for concatenation. The results indicate that omitting shallow features from other perceptual fields is detrimental to performance, with a PSNR score of 30.09 dB.

Finally, we intuitively summed the feature maps on a pixel-by-pixel basis to obtain the latent representations. This operation not only fuses multi-scale feature information but also better preserves the spatial semantic information in the orientations between neighboring pixels a priori. As the experimental results demonstrate, the PSNR is improved to 30.24 dB.

Table 3: The results of adding different convolution methods in O2superscript𝑂2O^{2}italic_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTformer. Average PSNR\uparrow on MURA-mini datasets with scale factor ×4absent4\times 4× 4 are shown.
Scale Conv. 3×3333\times 33 × 3 Conv. 5×5555\times 55 × 5 Conv. shift Skip PSNR/dB \uparrow
×4absent4\times 4× 4 square-root\surd 29.83 Baseline
square-root\surd square-root\surd 29.99
square-root\surd square-root\surd square-root\surd 30.04
square-root\surd square-root\surd square-root\surd square-root\surd 30.24 0.41 dB \uparrow

Influence of convolution methods: As outlined in Tab.3, we built the encoder using convolutional strategies with different perceptual fields. We started with a convolution kernel of size 3, then moved to larger kernels to capture more diverse scale features, improving the PSNR score to 29.99 dB. Incorporating advanced convolution methodsZhang et al. (2022) further enhanced the model’s ability to represent shallow features, achieving a score of 30.04 dB. We also introduced a skip connection between the shallow features and the final upsampling stage to transfer more information from the shallow patterns.

The experiments showed that using diverse convolutional approaches allowed the encoder’s latent representation to include multi-scale features. The final PSNR score reached 30.24 dB, an improvement of 0.41 dB over the benchmark experiment.

4.2 Results

Table 4: PSNR\uparrow and SSIM\uparrow results of different methods on MURA-mini (mini) & MURA-plus (plus) with scale factors of 4 & 2. Non-reference metrics: NIQE\downarrow and PI\downarrow were also used to evaluate image quality, respectively. There are two test datasets with the same HR images but different degraded LR images (which are more damaged). The red and blue indicate the best and the second-best performance, respectively.
Ours SAFMN Shu OverNet FSRCNN RCAN SRCNN Bic
PSNR 30.24 30.23 30.09 30.02 29.98 29.70 28.86 26.48
SSIM 0.9484 0.9504 0.9506 0.9503 0.9508 0.9493 0.9426 0.9362
NIQE 8.3320 9.0689 8.6955 8.9148 9.0839 9.0024 9.5310 8.9988
mini PI 7.6738 - 7.9935 8.0562 8.2511 8.1178 8.6483 8.1202
PSNR 30.55 30.57 30.35 30.22 30.22 29.89 28.99 26.80
SSIM 0.9417 0.9433 0.9434 0.9432 0.9437 0.9422 0.9375 0.9314
NIQE 8.7672 9.5649 8.4588 9.3380 9.5755 9.4267 10.1283 9.2759
×4absent4\times 4× 4 plus PI 8.0226 - 9.4680 8.3886 8.5966 8.4340 9.0154 8.3265
PSNR 31.35 31.05 31.27 29.87 31.30 31.32 30.17 27.60
SSIM 0.9372 0.9414 0.9442 0.9439 0.9434 0.9425 0.9433 0.9405
NIQE 7.8368 8.5484 8.3353 7.7891 8.1240 8.0038 7.8758 7.6566
mini PI 7.1446 - 7.4418 7.3260 7.3324 7.2628 7.2917 6.9207
PSNR 31.85 31.72 31.84 30.11 31.97 32.01 30.48 27.66
SSIM 0.9266 0.9316 0.9356 0.9358 0.9338 0.9329 0.9347 0.9324
NIQE 8.2203 8.7294 8.1982 8.1563 8.4326 8.2614 8.1968 7.8381
×2absent2\times 2× 2 plus PI 7.3677 - 7.4858 7.5938 7.5277 7.4645 7.5491 7.1892

Quantitative results: Our method consistently outperforms the state-of-the-art: ShuSun et al. (2022), SAFMNSun et al. (2023) and other methods (OverNetBehjati et al. (2021), FSRCNNDong et al. (2016), RCANZhang et al. (2018), SRCNNDong et al. (2014), Bic) across both MURA-mini and MURA-plus datasets in terms of PSNR, a key indicator of image fidelity. Specifically, for ×4absent4\times 4× 4 scaling factor on the MURA-mini dataset, our method achieves a PSNR of 30.24, surpassing Shu’s 30.09. This trend continues on the MURA-plus dataset, where our method scores 30.55 in PSNR, again outperforming Shu’s 30.35. Furthermore, our method excels in the NIQE and PI metrics, which measure image naturalness and perceptual quality respectively. On the MURA-mini dataset with ×4absent4\times 4× 4 scaling factor, our method achieves the best NIQE and PI scores of 8.3320 and 7.6738, respectively, indicating superior performance in terms of image naturalness and perceptual quality. When the scaling factor is reduced to ×2absent2\times 2× 2, our method continues to perform well on the MURA-mini dataset, achieving the highest PSNR score of 31.35. This demonstrates the robustness of our method across different scaling factors. However, it’s worth noting that the Bic method achieves better performance in certain metrics. This is particularly evident in the PI on the MURA-mini dataset with ×2absent2\times 2× 2 scaling factor, where Bic scores the best with 6.9207. These metrics were originally designed for normal images, and the fact that Bic performs well on these metrics suggests it may be preserving certain aspects of the image that are particularly valued in normal images. To further investigate these differences, we plan to conduct a qualitative analysis, visualizing the detail and texture information in the images produced by these methods (see Fig.4). This will provide a more comprehensive understanding of our method compared to others.

Qualitative results: In this section, we will present the visualization experimental results for more qualitative experiments. First, to address the concerns in the quantitative results for the reconstruction results from the Bic method, we use the Histogram of Oriented Gradient (HOG) operatorDalal and Triggs (2005) to visualize the detailed texture information in the gradient domain. Gradient operators are used following previous work in Huang et al. (2021). As shown in Fig.4, our method would provide more heat map response compared to Bic and Shu. In the ×2absent2\times 2× 2 upsampling factor comparison, although the reconstructed image by the Bic method achieves higher scores, the Ossuperscript𝑂𝑠O^{s}italic_O start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPTformer’s detail recovery is more competitive and reliable in the human subjective visual evaluation. In comparison to existing other methodologies, our proposed technique elevates the quality of visualization by markedly optimizing the resolution and discernibility between bone structures (see Fig.4, Ref. HR: 0003) and the contiguous muscular soft tissue (see Fig.4, Ref. HR: 0050). When benchmarked against established models such as Bic and Shu, it demonstrates comparable, if not superior, performance in terms of discernibility, whilst preserving a gradient distribution that mirrors those of high-resolution images. This evidence underscores the robust capability of our method in reconstructing textures and subtle variations inherent in real-world images, without compromising the retention of intricate details.

Secondly, we provide visual comparisons with other SISR methods in Fig. 5-6. For a more nuanced qualitative analysis, we focus on the highlighted regions. It’s crucial to note that the PSNR/SSIM metrics are derived from the entire input image. Observations indicate that our model excels in image restoration.

Refer to caption Refer to caption
Figure 4: Gradient visual comparisons of our model and others (Bic & Shu). The streak artifacts can be easily captured in the gradient domain. Sample data are from the ×2absent2\times 2× 2 MURA-mini dataset, and the colorful pattern in the heat map indicates the diversity of detailed textures.
Refer to caption
Figure 5: Visual comparison achieved on radiology images for ×4absent4\times 4× 4 SR.
Refer to caption
Figure 6: Visual comparison achieved on radiology images for ×2absent2\times 2× 2 SR.

As illustrated in Fig.5, the O2superscript𝑂2O^{2}italic_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTformer stands to offer clinicians a more precise insight into aspects such as bone mineral density (Ref. MURA-mini HR: No.0002), bone and joint structure (Ref. MURA-plus HR: No.0026). These enhancements are instrumental for the early identification of changes in bone mineral density (see Fig.6 Ref. MURA-mini HR: No.0013), subtle fractures, and irregularities in joint structures (see Fig.6 Ref. MURA-mini HR: No.0050). Based on the above advantages, our model has a high prospect in clinical application, providing reliable image analysis results for improving early diagnosis and treatment.

5 Conclusion

In this study, we proposed a novel approach: O2superscript𝑂2O^{2}italic_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTformer for radiographic images, with a focus on developing a specific pipeline that can more effectively learn denoising mapping. Given the limited color space and inter-pixel patterns in radiographic images, they cannot be directly compared to normal images. Normal images can more readily learn SR and denoising mapping from inherent patterns, thus radiographic images require enhanced latent representations to aid the decoder’s learning process. To achieve this, we introduced a novel orientation operator in the encoder to incorporate the orientation prior and boost the sensitivity to denoising mapping. Additionally, we proposed a multi-scale feature fusion strategy to combine features captured by different receptive fields with the directional prior, thereby providing a superior latent representation for the decoder. Ultimately, we proposed a transformer-based SR model, that is O2superscript𝑂2O^{2}italic_O start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTformer, for radiographic images, built upon these two innovative components. Our experimental results demonstrate that our approach outperforms competitors in both qualitative and quantitative comparisons.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper

Acknowledgements

This work was supported by JSPS KAKENHI Grant Number JP23KJ0118.

References

  • Vives (2006) M. J. Vives, Orthopedic imaging: A practical approach, The Journal of Spinal Cord Medicine 29 (2006) 173.
  • Chen et al. (2013) H. Chen, X. Zhou, H. Fujita, M. Onozuka, K.-Y. Kubo, Age-related changes in trabecular and cortical bone microstructure, International journal of endocrinology 2013 (2013).
  • Mc Donnell et al. (2007) P. Mc Donnell, P. Mc Hugh, D. O’mahoney, Vertebral osteoporosis and trabecular bone quality, Annals of biomedical engineering 35 (2007) 170–189.
  • Zhou et al. (2015) D. Zhou, C. Lebel, S. Treit, A. Evans, C. Beaulieu, Accelerated longitudinal cortical thinning in adolescence, Neuroimage 104 (2015) 138–145.
  • Turlington (2003) B. Turlington, The radiology of emergency medicine, Chest 123 (2003) 658.
  • Adepu et al. (2022) S. Adepu, S. Ekman, J. Leth, U. Johansson, A. Lindahl, E. Skiöldebrand, Biglycan neo-epitope (bgn262), a novel biomarker for screening early changes in equine osteoarthritic subchondral bone, Osteoarthritis and Cartilage 30 (2022) 1328–1336.
  • Ying et al. (2023) J. Ying, P. Wang, Z. Shi, J. Xu, Q. Ge, Q. Sun, W. Wang, J. Li, C. Wu, P. Tong, et al., Inflammation-mediated aberrant glucose metabolism in subchondral bone induces osteoarthritis, Stem Cells 41 (2023) 482–492.
  • Miyamoto et al. (2007) M. I. Miyamoto, S. L. Vernotico, H. Majmundar, G. S. Thomas, Pharmacologic stress myocardial perfusion imaging: a practical approach, Journal of nuclear cardiology 14 (2007) 250–255.
  • Hu et al. (2022) L. Hu, R. Liu, L. Zhang, Advance in bone destruction participated by jak/stat in rheumatoid arthritis and therapeutic effect of jak/stat inhibitors, International Immunopharmacology 111 (2022) 109095.
  • Shen and Lv (2022) Y. Shen, Y. Lv, Dual targeted zeolitic imidazolate framework nanoparticles for treating metastatic breast cancer and inhibiting bone destruction, Colloids and Surfaces B: Biointerfaces 219 (2022) 112826.
  • Shin et al. (2023) M. Shin, Z. Peng, H.-J. Kim, S.-S. Yoo, K. Yoon, Multivariable-incorporating super-resolution residual network for transcranial focused ultrasound simulation, Computer Methods and Programs in Biomedicine 237 (2023) 107591.
  • Qiu et al. (2022) D. Qiu, Y. Cheng, X. Wang, Improved generative adversarial network for retinal image super-resolution, Computer Methods and Programs in Biomedicine 225 (2022) 106995.
  • Zhu et al. (2023) D. Zhu, H. He, D. Wang, Feedback attention network for cardiac magnetic resonance imaging super-resolution, Computer Methods and Programs in Biomedicine 231 (2023) 107313.
  • Huang et al. (2023) Y. Huang, W. Xie, M. Li, E. Xiao, J. You, X. Liu, Source-free domain adaptive segmentation with class-balanced complementary self-training, Artificial Intelligence in Medicine 146 (2023) 102694.
  • Wang et al. (2020) Z. Wang, J. Chen, S. C. Hoi, Deep learning for image super-resolution: A survey, IEEE transactions on pattern analysis and machine intelligence 43 (2020) 3365–3387.
  • Park et al. (2003) S. C. Park, M. K. Park, M. G. Kang, Super-resolution image reconstruction: a technical overview, IEEE signal processing magazine 20 (2003) 21–36.
  • Chen et al. (2022) H. Chen, X. He, L. Qing, Y. Wu, C. Ren, R. E. Sheriff, C. Zhu, Real-world single image super-resolution: A brief review, Information Fusion 79 (2022) 124–145.
  • Huang et al. (2022) Y. Huang, T. Miyazaki, X. Liu, S. Omachi, Infrared image super-resolution: Systematic review, and future trends, arXiv preprint arXiv:2212.12322 (2022).
  • Dong et al. (2014) C. Dong, C. C. Loy, K. He, X. Tang, Learning a deep convolutional network for image super-resolution, in: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV 13, Springer, 2014, pp. 184–199.
  • Kim et al. (2016) J. Kim, J. K. Lee, K. M. Lee, Accurate image super-resolution using very deep convolutional networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1646–1654.
  • Jiang et al. (2021) Z. Jiang, K. Pi, Y. Huang, Y. Qian, S. Zhang, Difference value network for image super-resolution, IEEE Signal Processing Letters 28 (2021) 1070–1074.
  • Zhang et al. (2018) Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, Y. Fu, Image super-resolution using very deep residual channel attention networks, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 286–301.
  • Behjati et al. (2021) P. Behjati, P. Rodriguez, A. Mehri, I. Hupont, C. F. Tena, J. Gonzalez, Overnet: Lightweight multi-scale super-resolution with overscaling network, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 2694–2703.
  • Goodfellow et al. (2020) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial networks, Communications of the ACM 63 (2020) 139–144.
  • Ledig et al. (2017) C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al., Photo-realistic single image super-resolution using a generative adversarial network, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4681–4690.
  • Huang et al. (2021) Y. Huang, Z. Jiang, R. Lan, S. Zhang, K. Pi, Infrared image super-resolution via transfer learning and psrgan, IEEE Signal Processing Letters 28 (2021) 982–986.
  • Wang et al. (2018) X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, C. Change Loy, Esrgan: Enhanced super-resolution generative adversarial networks, in: Proceedings of the European conference on computer vision (ECCV) workshops, 2018, pp. 0–0.
  • Wang et al. (2021) X. Wang, L. Xie, C. Dong, Y. Shan, Real-esrgan: Training real-world blind super-resolution with pure synthetic data, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1905–1914.
  • Gulrajani et al. (2017) I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A. C. Courville, Improved training of wasserstein gans, Advances in neural information processing systems 30 (2017).
  • Yang et al. (2020) F. Yang, H. Yang, J. Fu, H. Lu, B. Guo, Learning texture transformer network for image super-resolution, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5791–5800.
  • Lu et al. (2022) Z. Lu, J. Li, H. Liu, C. Huang, L. Zhang, T. Zeng, Transformer for single image super-resolution, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 457–466.
  • Gao et al. (2022) G. Gao, Z. Wang, J. Li, W. Li, Y. Yu, T. Zeng, Lightweight bimodal network for single-image super-resolution via symmetric cnn and recursive transformer, arXiv preprint arXiv:2204.13286 (2022).
  • Qiu et al. (2023) D. Qiu, Y. Cheng, X. Wang, Medical image super-resolution reconstruction algorithms based on deep learning: A survey, Computer Methods and Programs in Biomedicine (2023) 107590.
  • Qiu et al. (2022) D. Qiu, Y. Cheng, X. Wang, Dual u-net residual networks for cardiac magnetic resonance images super-resolution, Computer Methods and Programs in Biomedicine 218 (2022) 106707.
  • Zhu and Qiu (2021) D. Zhu, D. Qiu, Residual dense network for medical magnetic resonance images super-resolution, Computer Methods and Programs in Biomedicine 209 (2021) 106330.
  • Dong et al. (2015) C. Dong, C. C. Loy, K. He, X. Tang, Image super-resolution using deep convolutional networks, IEEE transactions on pattern analysis and machine intelligence 38 (2015) 295–307.
  • Huang et al. (2021) Y. Huang, Z. Jiang, Q. Wang, Q. Jiang, G. Pang, Infrared image super-resolution via heterogeneous convolutional wgan, in: PRICAI 2021: Trends in Artificial Intelligence: 18th Pacific Rim International Conference on Artificial Intelligence, PRICAI 2021, Hanoi, Vietnam, November 8–12, 2021, Proceedings, Part II 18, Springer, 2021, pp. 461–472.
  • Huang et al. (2023) Y. Huang, T. Miyazaki, X. Liu, Y. Dong, S. Omachi, Target-oriented domain adaptation for infrared image super-resolution, arXiv preprint arXiv:2311.08816 (2023).
  • Han et al. (2020) K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, et al., A survey on visual transformer, arXiv preprint arXiv:2012.12556 2 (2020).
  • Vaswani et al. (2017) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017).
  • He et al. (2023) Z. He, D. Chen, Y. Cao, J. Yang, Y. Cao, X. Li, S. Tang, Y. Zhuang, Z.-m. Lu, Single image super-resolution based on progressive fusion of orientation-aware features, Pattern Recognition 133 (2023) 109038.
  • Lin et al. (2023) Y. Lin, D. Zhang, X. Fang, Y. Chen, K.-T. Cheng, H. Chen, Rethinking boundary detection in deep learning models for medical image segmentation, in: International Conference on Information Processing in Medical Imaging, Springer, 2023, pp. 730–742.
  • Vo and Kim (2023) V. T.-T. Vo, S.-H. Kim, Mulvernet: Nucleus segmentation and classification of pathology images using the hover-net and multiple filter units, Electronics 12 (2023) 355.
  • Dogar et al. (2023) G. M. Dogar, M. Shahzad, M. M. Fraz, Attention augmented distance regression and classification network for nuclei instance segmentation and type classification in histology images, Biomedical Signal Processing and Control 79 (2023) 104199.
  • Tsai et al. (2022) F.-J. Tsai, Y.-T. Peng, Y.-Y. Lin, C.-C. Tsai, C.-W. Lin, Stripformer: Strip transformer for fast image deblurring, in: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIX, Springer, 2022, pp. 146–162.
  • Sun et al. (2015) J. Sun, W. Cao, Z. Xu, J. Ponce, Learning a convolutional neural network for non-uniform motion blur removal, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 769–777.
  • Huang et al. (2023) Y. Huang, W. Xie, M. Li, M. Cheng, J. Wu, W. Wang, J. You, X. Liu, Vicinal feature statistics augmentation for federated 3d medical volume segmentation, in: International Conference on Information Processing in Medical Imaging, Springer, 2023, pp. 360–371.
  • Zhang et al. (2018) Y. Zhang, Y. Tian, Y. Kong, B. Zhong, Y. Fu, Residual dense network for image super-resolution, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2472–2481.
  • Li et al. (2018) J. Li, F. Fang, K. Mei, G. Zhang, Multi-scale residual network for image super-resolution, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 517–532.
  • Tong et al. (2017) T. Tong, G. Li, X. Liu, Q. Gao, Image super-resolution using dense skip connections, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 4799–4807.
  • Zhang et al. (2022) X. Zhang, H. Zeng, S. Guo, L. Zhang, Efficient long-range attention network for image super-resolution, in: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVII, Springer, 2022, pp. 649–667.
  • Huang et al. (2022) Y. Huang, Q. Wang, S. Omachi, Rethinking degradation: Radiograph super-resolution via aid-srgan, in: Machine Learning in Medical Imaging: 13th International Workshop, MLMI 2022, Held in Conjunction with MICCAI 2022, Singapore, September 18, 2022, Proceedings, Springer, 2022, pp. 43–52.
  • Kingma and Ba (2014) D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).
  • Wang et al. (2004) Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE transactions on image processing 13 (2004) 600–612.
  • Blau et al. (2018) Y. Blau, R. Mechrez, R. Timofte, T. Michaeli, L. Zelnik-Manor, The 2018 pirm challenge on perceptual image super-resolution, in: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 0–0.
  • Mittal et al. (2012) A. Mittal, R. Soundararajan, A. C. Bovik, Making a “completely blind” image quality analyzer, IEEE Signal processing letters 20 (2012) 209–212.
  • Behjati et al. (2023) P. Behjati, P. Rodriguez, C. Fernández, I. Hupont, A. Mehri, J. Gonzàlez, Single image super-resolution based on directional variance attention network, Pattern Recognition 133 (2023) 108997.
  • Sun et al. (2022) L. Sun, J. Pan, J. Tang, Shufflemixer: An efficient convnet for image super-resolution, arXiv preprint arXiv:2205.15175 (2022).
  • Sun et al. (2023) L. Sun, J. Dong, J. Tang, J. Pan, Spatially-adaptive feature modulation for efficient image super-resolution, arXiv preprint arXiv:2302.13800 (2023).
  • Dong et al. (2016) C. Dong, C. C. Loy, X. Tang, Accelerating the super-resolution convolutional neural network, in: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, Springer, 2016, pp. 391–407.
  • Dalal and Triggs (2005) N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), volume 1, Ieee, 2005, pp. 886–893.
  • Huang et al. (2021) Z. Huang, J. Zhang, Y. Zhang, H. Shan, Du-gan: Generative adversarial networks with dual-domain u-net-based discriminators for low-dose ct denoising, IEEE Transactions on Instrumentation and Measurement 71 (2021) 1–12.