Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

High resolution (HR) images are in demand for cardiac magnetic resonance imaging (CMRI). However, HR CMRI images are often costly to acquire. In particular, the quality and resolution of CMRI images can be limited dramatically by low signal to noise ratio, cardiac and respiratory motion, and restricted scanning time due to patient symptoms etc. In consequence, noise corrupted and low resolution (LR) images are produced that may reduce the visibility of vital pathological details and potentially compromise the clinical outcomes.

Instead of optimizing hardware settings (e.g., improved cardiac gating and respiratory navigation) and reducing scanning time (e.g., accelerated by parallel imaging and compressed sensing), image super-resolution (SR) provides an alternative solution to boost the perceptual quality of CMRI images in terms of the spatial resolution enhancement. Essentially, the goal of SR methods is to recover a HR image from a single or multiple LR images [1]. Comprehensive reviews on various SR methods can be found elsewhere [2, 3], and here we briefly review the most relevant publications. Existing SR algorithms can be broadly categorized into three classes including interpolation based, reconstruction based, and learning or example based methods [4]. Widely used interpolation based SR methods, which assume that images are spatially smooth, typically result in overly smoothed edges with ringing and jagged artifacts [4, 5]. Reconstruction based algorithms, which solve an inverse problem by recovering the HR image by fusing multiple LR images, are time-consuming and limited to an upsampling factor of two [4–6]. For learning or example based methods, the mapping function between LR and HR images (or their patches) is learned from a representative set of training image pairs. Once the mapping function is learned, it is applied to a single testing image to achieve SR. Despite numerous learning or example based methods (e.g., [7–10]) claiming success for single-image SR, these methods hinge on the availability of training data of LR and HR image pairs; therefore, we define them as pseudo-single-image SR.

Compared to pseudo-single-image SR methods, Glasner et al. [1] proposed a self-similarity based single-image SR method using a training dataset that is directly established from the LR input, by exploiting patch redundancy among in-scale and cross-scale images in an image pyramid to enforce constraints for recovering the unknown HR image [4]. In this method, no extrinsic large training dataset is required as a priori, but the abundance of self-similar patches is crucial for successful SR recovery. Instead of searching for similar image patches, Yang and Wang [11] used support vector regression (SVR) to learn SR models from patches at different image scales. They also applied a sparse representation to extract effective image features for SVR to make their SR algorithm more computationally feasible for real applications, following the application of sparse representation originally proposed for solving the SR problem by Yang et al. [12]. More recently, Singh and Ahuja [13] added sub-band energy constraints for the self-similarity based SR method, and Huang et al. [14] enriched the dataset of self-similar patches by looking at their transformed exemplars. Although these studies have demonstrated promising results, the SR methods were applied on natural images, which are relatively clean. Therefore, the performance of these methods for noisy inputs, e.g., CMRI images, hasn’t been demonstrated.

Fig. 1.
figure 1

Flowchart of our approach including both learning and prediction.

In this paper, we propose an SVR based SR method that learns a sparse representation as robust and effective features across various down-sampled and denoised versions (using dual-tree complex wavelet transform, i.e., DTCWT) of a single input LR image. Once the SVR has been trained, we can apply the best model to predict the final SR image from a single LR image (Fig. 1). Compared to the previous learning based methods [7–9], our method does not require construction of paired LR and HR training datasets. Compared to Glasner et al. [1] (Fig. 2(a)) and more recent work by Singh and Ahuja [13], our method does not assume self-similarity of image patches. In contrast to the SR methods [11, 15], in which SVR was originally applied, we use DTCWT to suppress the noise, and we hypothesize that our method is more applicable for medical imaging applications [16], in particular for the CMRI images.

2 Method

The overall workflow of our method is summarized in Fig. 1, and details of each step are described below.

Fig. 2.
figure 2

Schematic diagram of (a) conventional self-similarity based SR method and (b) our self-learning based SR approach. Panel (a): In step 1, the input LR image \(\mathrm {I}_\mathrm {LR}\) (yellow) is downsampled by a factor of two recursively into several lower levels, and here only one downsampled level (\(\mathrm {I}_\mathrm {LR}^{-1}\)) is shown for illustration purposes. Then k nearest neighbors of the example patch (brown patch \(\mathrm {P}_\mathrm {LR}\) in \(\mathrm {I}_\mathrm {LR}\)) are located in the downsampled level. For simplicity, we show the case where the most similar patch (red patch in \(\mathrm {I}_\mathrm {LR}^{-1}\)) is used (i.e., \(k=1\)). In step 2, the corresponding patch at the same location (red patch in \(\mathrm {I}_\mathrm {LR}\)) is used as the HR predictor (notice the larger size). In step 3, this patch is copied and pasted in the HR image (\(\mathrm {I}_\mathrm {HR}\)). In the case of \(k>1\), multiple predictors are averaged and then pasted; Panel (b): Four major steps of our method and details are described in Sects. 2.1–2.3 Instead of downsampling using bicubic interpolation, DTCWT based denoising and reconstruction was applied to construct the image pyramid. (Color figure online)

2.1 Denoising and Pyramid Construction Using DTCWT

Most previous self-similarity or self-learning methods (e.g., [1, 11, 13]) used bicubic interpolation to construct the image pyramid, but this can be susceptible to noise. Intuitively, discrete wavelet transform (DWT) based denoising has an intrinsic multiscale structure of decomposition and reconstruction, and therefore can be suitable for the pyramid construction task while effectively suppressing the noise. In addition, the dual-tree complex wavelet transform (DTCWT) overcomes two shortcomings of the widely used conventional DWT: lack of shift-invariance and lack of directional selectivity [17]. Regarding memory usage, the DTCWT is an over-complete transform with a redundancy factor of four for 2D image processing while DWT is non-redundant. An alternative method to obtain the orientation selective sub-bands is using steerable pyramid decomposition [13], which is over-complete by a factor of \(\text {4k/3}\) (where k denotes the number of orientation sub-bands) [18]. Regarding computational complexity, the DTCWT is linear \(\mathcal {O}(\text {N})\) where N is the number of input pixels. In contrast, the steerable pyramid decomposition has a complexity of \(\mathcal {O}(\text {N}\log {}\text {N})\). By considering both memory and computational complexity, instead of using steerable pyramid decomposition to overcome limitations of DWT, we decomposed the original image using DTCWT. To accomplish the denoising and image pyramid construction, each level of the pyramid was reconstructed using a soft-thresholding scheme [17], which we used to train the SVR (step 1 in Fig. 2(b)).

2.2 Extracting Features Using Image Sparse Representation

Directly working on pixels can be time-consuming and infeasible for real clinical applications. In this study, a sparse image representation [19] was learned, representing robust and effective features for SVR that can be formulated as an optimization problem, \(\min \frac{1}{n} \sum _{i=1}^n || D\alpha _i - \mathrm {x}_i ||_2^2 + \lambda || \alpha _i ||_1, i = 1, \ldots , n\), in which \(\mathrm {x}_i\) is the vectorized image patch (patch size was set to \(P=5\times 5\) in this study), and is denoted as \(\mathrm {x} = \mathrm {vectorize}(\mathrm {P}_\mathrm {LR}) - \text {mean}(\mathrm {P}_\mathrm {LR}) \in \mathbb {R}^{P\times 1}\), and the subtraction of the mean value of each LR patch (\(\text {mean}(\mathrm {P}_\mathrm {LR})\)) indicates the learning of local pixel value variations instead of learning absolute pixel values. Here n is the number of training patches and D is the over-complete dictionary to be learned, \(\alpha _i\) is the corresponding sparse coefficient vector, and \(\lambda \) is the Lagrange multiplier term that regularizes the model sparsity (\(l_1\)-norm term) and the \(l_2\)-norm based residual. According to previous research [11], better SR performance can be achieved by patch categorization, i.e., clustering patches into low and high spatial frequency ones and learning their dictionaries separately. We applied a Sobel filter in this study, such that any patch including an edge derived from the Sobel filter is considered to have high spatial frequency. In doing this, two dictionaries (\(D_l\) and \(D_h\)) for low and high spatial frequency patches have been learned separately. Accordingly, we used the corresponding sparse coefficient vectors \(\alpha _l\) and \(\alpha _h\) as the features for the SVR models (step 2 in Fig. 2(b)).

2.3 Support Vector Regression

Finally, support vector regression (in this study we used \(\nu \)-SVR) [20], which can fit the data in a high dimensional feature space without assuming the data distribution, was applied to model the relationship between the input sparse coefficient vector \(\alpha \) and the associated SR pixel value. In the training procedure, the SVR solves the following optimization problem,

figure a

in which \(y_i\) denotes the associated target variable, i.e., pixel value at the center of the patch considered in the HR image (\(\text {mean}(\mathrm {P}_\mathrm {LR})\)) is also subtracted from each corresponding \(y_i\)). \(\varPhi (\alpha _i)\) is the sparse image patch features in the transformed space; thus, \(\varPhi (\alpha _i)\) and \(y_i\) form the feature and target variable pairs for training. And \(\omega \) is the norm vector of the nonlinear mapping function to be learned. In addition, \(\rho \) is the tradeoff between the generalization and the upper and lower bounds of training errors \(\xi _i\) and \(\xi _i^*\) subject to a margin of \(\epsilon \), and \(\nu \) controls the amount of support vectors in the resulting model. \(K(\alpha _i, \alpha _j)\equiv \varPhi (\alpha _i)^T\varPhi (\alpha _j)\) is called the kernel function. In this study, we used a nonlinear Gaussian Radial Basis Function (RBF) kernel \(K(\alpha _i, \alpha _j)=\mathrm {exp}(-\gamma ||\alpha _i-\alpha _j||^2)\) with scaling-factor, \(\gamma >0\), to map feature vectors into a nonlinear transformed feature space. SVR parameters (\(\rho \) and \(\gamma \)) were estimated using grid search with cross-validation. Furthermore, SVR model selection was achieved according to a minimum-error-rate classification rule based on Bayesian decision theory [11], and the trained best SVR model was applied to predict the final SR output for a test LR input (steps 3 and 4 in Fig. 2(b)).

2.4 Experimental Settings and Performance Measure

First, ex-vivo swine heart MR imaging data were acquired on a Siemens Skyra 3T MRI system (Erlangen, Germany). The subject was scanned using a 4-element receive coil setup and a 3D fast low-angle shot sequence (3D FLASH, TR/TE \(=\) 50 ms/5.36 ms, flip angle \(=\) 35 \(^{\circ }\), voxel resolution \(=\) 0.6 mm\(^3\), and reconstructed into 0.3 mm\(^3\) isotropically). Second, in-vivo patient data were acquired on a Siemens Avanto 1.5T MRI system. Ten subjects were scanned using a navigator-gated inversion-prepared 3D segmented gradient echo sequence with late gadolinium enhancement (TR/TE \(=\) 5.2 ms/2.3 ms, flip angle \(=\) 20 \(^{\circ }\), voxel resolution \(=\) 1.5\(\,\times \,\)1.5\(\,\times \,\)4 mm\(^3\), and reconstructed into 0.7\(\,\times \,\)0.7\(\,\times \,\)2 mm\(^3\)) [21].

For the ex-vivo data, we extracted the central slice from the axial and saggital views respectively in order to test the efficacy of our method on both in-plane and out-of-plane slices. The original resolution is 400\(\,\times \,\)400 pixels for both slices, and used as HR ground truth for SR results evaluation. Inputs were synthesized by downsampling (by a factor of 2 and 4 denoted as x2 and x4 resulted in 200\(\,\times \,\)200 and 100\(\,\times \,\)100 inputs respectively), and degrading (by various additive noises [(\(\sigma \) = 10, 15, 20, and 25) and run 10 times for each noise level]). Evaluations have been done qualitatively by visual inspection and quantitatively using peak signal to noise ratio (PSNR) and mean squared error (MSE) against the HR ground truth. The performance of our method was compared with bicubic interpolation with standard low-pass Gaussian filter based denoising and the self-learning based SR approach (SLSR) proposed in [11]. For the in-vivo data, we randomly chose 40 ROIs (each of 200\(\,\times \,\)200 pixels) from 10 patient cases. For these 40 noise corrupted LR slices, we have no HR ground truth available. In addition to visual inspection of x2 SR, we compared the line profiles extracted from the results of three SR methods, namely, bicubic (with standard low-pass Gaussian filter based denoising), SLSR, and ours, and estimated the remaining noise in the SR results using the method described in [22].

3 Results

Figure 3(a) shows the synthesized LR input by a downsampling factor of two. The added noise (\(\sigma =15\)) is shown in Fig. 3(b). Figure 3(c) is the zoomed-in ROI of the HR ground truth. Compared to the results of using bicubic interpolation and SLSR method (Fig. 3(d) and (e)), our SR result (Fig. 3(f)) is much more homogeneous in the myocardium region. In addition, green arrows pointed at three example fine structures that are much clearer in our SR result. Figure 3(g)–(i) are the results recovered from the synthesized LR input by an x4 downsampling with additive noise (\(\sigma =15\)). Pink arrows pointed at textures that have been preserved using our SR method (Fig. 3(i)) while missing in the bicubic interpolation and SLSR method (Fig. 3(g) and (h)). Tables 1 and 2 show the PSNR and MSE with various additive noises, and the standard deviations were calculated by running the random noise simulation 10 times at each noise level (\(\sigma =10,15,20,\mathrm {~and~}25\)). In addition, our SR results (both PSNR and MSE in Fig. 4) showed significant improvement compared to the results of bicubic interpolation and SLSR (statistical significances were given by two-sample Wilcoxon rank-sum test between the results of each two SR methods with significance level of \(p<0.05\)). There is no significant difference between bicubic interpolation and SLSR when the noise level is high (e.g., when \(\sigma =25\)).

Fig. 3.
figure 3

(a) Synthesized input LR image (sagittal view of a swine heart MRI slice) that is downsampled (x2) and noise corrupted (red box showing the ROI; contrast has been altered for better visualization of the background noise); (b) Random noise (\(\sigma =15\)) added to the original HR image; (c) ROI of the original HR image without blurring or noise (ground truth); Panels (d)–(f): x2 SR results; (d) ROI of the bicubic interpolation result; (e) ROI of the SLSR result; (f) ROI of our result; Green arrows pointed three example fine structures that have been super-resolved using our method, but much more blurred using bicubic interpolation and SLSR method; Panels (g)–(i): x4 SR results; (g) ROI of the bicubic interpolation result; (h) ROI of the SLSR result; (i) ROI of our result; Pink arrows point to three example textures that have been preserved using our method, but which are more vague in the results of using bicubic interpolation or the SLSR method. (Color figure online)

Figure 5(a)–(c) shows the x2 SR results for one ROI of an in-vivo late gadolinium enhancement scan. Our SR result (Fig. 5(c)) shows noise suppression in the blood pool region of the left atrium, and therefore less confounding artifacts with similar intensities as the real late gadolinium enhancement of the myocardial wall. In addition, line profiles (Fig. 5(d) and (e)) across the SR results show that bicubic interpolation and SLSR method tend to be much noisier. Furthermore, the noise level estimations have been performed according to [22], and our method obtained a lower level of remaining noise (Table 3 shows that \(\sigma =32.04\pm 4.17\) compared to \(\sigma =35.76\pm 3.67\) obtained by bicubic interpolation and \(\sigma =40.17\pm 4.89\) obtained by SLSR method, and noise level of the original LR input is \(\sigma =53.52\pm 5.48\)).

Table 1. PSNR of x2 and x4 SR on ex-vivo cardiac MRI data from different views. For \(\sigma >0\) we run 10 times for each noise level to obtain the mean ± std.
Table 2. MSE of x2 and x4 SR on ex-vivo cardiac MRI data from different views. For \(\sigma >0\) we run 10 times for each noise level to obtain the mean ± std.
Fig. 4.
figure 4

Left: Boxplot of the PSNR (Sagittal x2 SR); Right: Boxplot of the MSE (Sagittal x2 SR). Statistical significant between each two groups are showing above the boxplot (*** stands for \(p<0.05\) and n.s. means no significant difference between two groups).

4 Discussion and Conclusion

The objective of this study was to develop an image post-processing method to super-resolve LR cardiac MR images from single input while effectively suppressing the noise. Results of this work provide compelling evidence that our single image SR method, which is coupled with multiscale DTCWT based denoising, is capable of reconstructing superior HR results with significantly lower remaining noise compared with conventional bicubic interpolation and a state-of-the-art self-learning based SR approach, i.e., SLSR.

Kernel based single image SR methods (example kernels including linear, bicubic, and Lanczos kernel) are the most widely used in clinical applications due to their efficiency. Results of our experiments on ex-vivo cardiac MR images showed that bicubic kernel based interpolation obtained lower PSNR and higher MSE than SLSR or our methods when the noise level \(\sigma \le 20\) (Tables 1 and 2). This can be attributed to the fact that bicubic interpolation makes strong assumptions on spatial smoothness of images that can result in blurred edges, loss of fine textures, and averaged noisy pixels with their less noisy neighbors (Fig. 3(d) and (g)).

Self-similarity based methods and later proposed self-learning based methods utilize a multiscale decomposition of the LR image to understand the relationship between its current scale and its lower resolution versions. Then the HR output is reconstructed either using prediction from candidates selected by k-nearest neighbors method (self-similarity based) or SVR (self-learning based). To the best of our knowledge, previously published self-similarity or self-learning based methods (e.g., [1, 11, 13]) are demonstrated on natural images, which are assumed with less or no noise contamination. Results of the SLSR method tested on ex-vivo data with various additive noises showed that for the noise level \(\sigma \le 20\) SLSR outperformed bicubic interpolation with significant differences (Fig. 3(e) and (h) and Tables 1 and 2). This may be attributed to the fact that SVR learns local pixel value variations nonlinearly, while negative lobes on the bicubic kernel tend to create ringing artifacts. When \(\sigma =25\) SLSR performed similar to bicubic interpolation (no significant difference) demonstrating that SLSR results can be affected by noise significantly. Tellingly, with noise suppression in each level of the multiscale pyramid our method showed significant improvement of both PSNR and MSE even for the dense noise contamination with high magnification factor (Tables 1 and 2). Both line profiles and remaining noise estimation further confirmed the merits of our SR method (Fig. 5(d) and (e) and Table 3).

Table 3. Noise estimation of in-vivo late gadolinium enhancement data for randomly selected 40 ROIs to obtain the mean ± std.
Fig. 5.
figure 5

Panels (a)–(c): x2 SR results of an in-vivo late gadolinium enhancement scan using bicubic interpolation, SLSR method and our approach respectively; (d) Horizontal line profiles through the green arrows of the SR results; (e) Zoomed-in line profiles of the pink region in (d). (Color figure online)

Although previous research demonstrated the success of using paired LR and HR training datasets to super-resolve a given new LR image [9], in this study we emphasize the practical infeasibility of this class of methods for CMRI images where acquiring HR training data can require prohibitively long scan times. For clinical CMRI studies, total scanning time per patient is already long (typically 60 min) and this can not easily be extended for HR data acquisition without having a significant input on patient throughput. A method, which doesn’t require the use of HR training data, is therefore highly desirable. Moreover, there are potential hazards include painful peripheral nerve stimulation for prolonged scanning time. We are aware that our proof of concept study may have two limitations. First, there are several parameters that can be tuned in our framework, e.g., the thresholds for the calculated gradient magnitude in Sobel filtering and DTCWT denoising. Currently they are based on trial and error; however, we envisage that future assessment of the robustness of the SR framework will include parameter perturbation. Second, we explicitly assumed an additive Gaussian noise for simplicity, but more complex noise modeling and suppression could be easily plugged-in to the current framework. In summary, our SR framework, which couples SVR and DTCWT, can achieve promising HR CMRI results while effectively suppressing the noise.