Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Recurrent Inference Machine for Medical Image Registration

\firstnameYi \surnameZhang \emaily.zhang-43@tudelft.nl
\addrDepartment of Imaging Physics, Delft University of Technology, Delft, The Netherlands \AND\nameYidong Zhao \emaily.zhao-8@tudelft.nl
\addrDepartment of Imaging Physics, Delft University of Technology, Delft, The Netherlands \AND\nameHui Xue \emailxueh@microsoft.com
\addrHealth Futures, Microsoft Research, Redmond, Washington, USA \AND\namePeter Kellman \emailkellmanp@nhlbi.nih.gov
\addrNational Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland, USA \AND\nameStefan Klein \emails.klein@erasmusmc.nl
\addrDepartment of Radiology and Nuclear Medicine, Erasmus University Medical Center, Rotterdam, the Netherlands \AND\nameQian Tao \emailq.tao@tudelft.nl
\addrDepartment of Imaging Physics, Delft University of Technology, Delft, The Netherlands
Abstract

Image registration is essential for medical image applications where alignment of voxels across multiple images is needed for qualitative or quantitative analysis. With recent advancements in deep neural networks and parallel computing, deep learning-based medical image registration methods become competitive with their flexible modelling and fast inference capabilities. However, compared to traditional optimization-based registration methods, the speed advantage may come at the cost of registration performance at inference time. Besides, deep neural networks ideally demand large training datasets while optimization-based methods are training-free. To improve registration accuracy and data efficiency, we propose a novel image registration method, termed Recurrent Inference Image Registration (RIIR) network. RIIR is formulated as a meta-learning solver to the registration problem in an iterative manner. RIIR addresses the accuracy and data efficiency issues, by learning the update rule of optimization, with implicit regularization combined with explicit gradient input.

We evaluated RIIR extensively on brain MRI and quantitative cardiac MRI datasets, in terms of both registration accuracy and training data efficiency. Our experiments showed that RIIR outperformed a range of deep learning-based methods, even with only 5%percent55\%5 % of the training data, demonstrating high data efficiency. Key findings from our ablation studies highlighted the important added value of the hidden states introduced in the recurrent inference framework for meta-learning. Our proposed RIIR offers a highly data-efficient framework for deep learning-based medical image registration.

Keywords: Deep Learning, Image Registration, Recurrent Inference Machine

1 Introduction

Medical image registration, the process of establishing anatomical correspondences between two or more medical images, finds wide applications in medical imaging research, including imaging feature fusion (Haskins et al., 2020; Oliveira and Tavares, 2014), treatment planning (Staring et al., 2009; King et al., 2010; Byrne et al., 2022), and longitudinal patient studies (Sotiras et al., 2013; Jin et al., 2021). Medical image registration is traditionally formulated as an optimization problem, which aims to solve a parameterized transformation in an iterative manner (Klein et al., 2007). Typically, the optimization objective consists of two parts: a similarity term that enforces the alignments between images, and a regularization term that imposes smoothness constraints. Due to the complexity of non-convex optimization, traditional methods often struggle with long run time, especially for large, high-resolution images. This hinders its practical use in clinical practice, e.g. surgery guidance (Sauer, 2006), where fast image registration is demanded(Avants et al., 2011; Balakrishnan et al., 2019).

With recent developments in machine learning, the data-driven deep-learning paradigm has gained popularity in medical image registration (Rueckert and Schnabel, 2019). Instead of iteratively updating the transformation parameters by a conventional optimization pipeline, deep learning-based methods make fast image-to-transformation predictions at inference time. Early works learned the transformation in a supervised manner (Miao et al., 2016; Yang et al., 2016), while unsupervised learning methods later became prevalent. They adopt similar loss functions as those in conventional methods but optimize them through amortized neural networks (Balakrishnan et al., 2019; De Vos et al., 2019). These works demonstrate the great potential of deep learning-based modes for medical image registration. Nonetheless, one-step inference of image transformation is in principle a difficult problem, compared to the iterative approach, especially when the deformation field is large. In practice, the one-step inference requires a relatively large amount of data to train the deep learning network for consistent prediction, and may still lead to unexpected transformations at inference time (Fechter and Baltas, 2020; Hering et al., 2019; Zhao et al., 2019).

In contrast to one-step inference, recent studies revisited iterative registration, using multi-step inference processes (Fechter and Baltas, 2020; Kanter and Lellmann, 2022; Qiu et al., 2022; Sandkühler et al., 2019; Zhao et al., 2019). Some of these iterative methods Kanter and Lellmann (2022) Qiu et al. (2022) fall within the realm of meta-learning. Instead of learning the optimized parameters, meta-learning focuses on learning the optimization process itself. The use of meta-learning in optimization, as explored by Andrychowicz et al. (2016) and Finn et al. (2017) for image classification tasks, has led to enhanced generalization and faster convergence. For medical imaging applications, a prominent example is the recurrent inference machine (RIM) by Putzky and Welling (2017), originally proposed to solve inverse problems with explicit forward physics models. RIM has demonstrated excellent performance in fast MRI reconstruction (Lønning et al., 2019) and MR relaxometry (Sabidussi et al., 2021).

In this study, we propose a novel meta-learning medical image registration method, named Recurrent Inference Image Registration (RIIR). RIIR is inspired by RIM, but significantly extends its concept to solve more generic optimization problems: different from inverse problems, medical image registration presents a high-dimensional optimization challenge with no closed-form forward model. Below we provide a detailed review to motivate our work.

1.1 Related Works

In this section, we review deep learning-based medical image registration methods in more detail, categorizing them into one-step methods for direct image-to-transformation inference, and iterative methods for multi-step inference. Additionally, we provide a brief overview of meta-learning for medical imaging applications.

1.1.1 One-step Deep Learning-based Registration

Early attempts of utilizing convolutional neural networks (CNNs) for medical image registration supported confined transformations, such as SVF-Net (Rohé et al., 2017), Quicksilver (Yang et al., 2017), and the work of Cao et al. (2017), which are mostly trained in a supervised manner. With the introduction of U-Net architecture (Ronneberger et al., 2015), which has excellent spatial expression capability thanks to its multi-resolution and skip connection, Balakrishnan et al. (2019); Dalca et al. (2019); Hoopes et al. (2021) proposed unsupervised deformable registration frameworks. In the work of De Vos et al. (2019), a combination of affine and deformable transformations was further considered. More recent methods extended the framework by different neural network backbones such as transformers (Zhang et al., 2021) or implicit neural representations (Wolterink et al., 2022; van Harten et al., 2023).

1.1.2 Iterative Deep Learning-based Registration

However, a one-step inference strategy may struggle when predicting large and complex transformations (Hering et al., 2019; Zhao et al., 2019). In contrast to one-step deep learning-based registration methods, recent work adopted iterative processes, reincarnating the conventional pipeline of optimization for medical image registration, either in terms of image resolution (Hering et al., 2019; Fechter and Baltas, 2020; Xu et al., 2021; Liu et al., 2021), multiple optimization steps (Zhao et al., 2019; Sandkühler et al., 2019; Kanter and Lellmann, 2022), or combined (Qiu et al., 2022). In Sandkühler et al. (2019), the use of RNN with gated recurrent unit (GRU) (Chung et al., 2014) was considered, where each step progressively updates the transformation by adding an independent parameterized transformation. Another multi-step method proposed in Zhao et al. (2019) uses recursive cascaded networks to generate a sequence of transformations, which is then composed to get the final transformation. However, the method requires independent modules for each step, which can be memory-inefficient.Hering et al. (2019) proposed a variational method on different levels of resolution, where the final transformation is the composition of the transformations from coarse- to fine-grained. Fechter and Baltas (2020) addresses the importance of data efficiency of deep learning-based models by evaluating the model performance when data availability is limited, and a large domain shift exists. A more recent work proposed in Qiu et al. (2022), Gradient Descent Network for Image Registration (GraDIRN), integrates multi-step and multi-resolution for medical image registration. Specifically, the update rule follows the idea of conventional optimization by deriving the gradient of the similarity term w.r.t. the current transformation and using a CNN to estimate the gradient of the regularization term. Though the direct influence of the gradient term shows to be minor compared to the CNN output (Qiu et al., 2022), the method bridges gradient-based optimization and deep learning-based methods. The method proposed in Kanter and Lellmann (2022) used individual long short-term memory (LSTM) modules for implementing recurrent refinement of the transformation. However, the scope of the work is limited to affine transformation, which only serves as an initialization for the conventional medical image registration pipeline.

1.1.3 Meta-Learning and Recurrent Inference Machine

Meta-learning, also described as “learning to learn”, is a subfield of machine learning. In this approach, an outer algorithm updates an inner learning algorithm, enabling the model to adapt and optimize its learning strategy to achieve a broader objective. For example, in a meta-learning scenario, a model could be trained on a variety of tasks, such as different types of image recognition, with the goal of quickly adapting to unseen similar tasks, like recognizing new kinds of objects not included in the original training set, using a few training samples. (Hospedales et al., 2021). An early approach in meta-learning is designing an architecture of networks that can update their parameters according to different tasks and data inputs (Schmidhuber, 1993). The work of Cotter and Conwell (1990) and Younger et al. (1999) further show that a fixed-weight RNN demonstrates flexibility in learning multiple tasks. More recently, methods learning an optimization process with RNNs were developed and studied in Andrychowicz et al. (2016); Chen et al. (2017); Finn et al. (2017), demonstrating superior convergence speed and better generalization ability for unseen tasks.

In the spirit of meta-learning, RIM was developed by Putzky and Welling (2017) to solve inverse problems. RIM learns a single recurrent architecture that shares the parameters across all iterations, with internal states passing through iterations (Putzky and Welling, 2017). In the context of meta-learning, RIM distinguishes two tasks of different levels: the ‘inner task’, which focuses on solving a specific inverse problem (e.g., superresolution of an image), and the ‘outer task’, aimed at optimizing the optimization process itself. This setting enables RIM to efficiently learn and apply optimization strategies to complex problems. RIM has shown robust and competitive performance across different application domains, from cosmology (Morningstar et al., 2019; Modi et al., 2021) to medical imaging (Karkalousos et al., 2022; Lønning et al., 2019; Putzky et al., 2019; Sabidussi et al., 2021, 2023). They all aim to solve an inverse problem with a known differentiable forward model in closed form, such as Fourier transform with sensitivity map and sampling mask in MRI reconstruction (Lønning et al., 2019).

However, the definition of an explicit forward model does not exist for the medical image registration task. In this work, we sought to extend the framework of RIM, which demonstrated state-of-the-art performance in medical image reconstruction challenges (Muckley et al., 2021; Putzky et al., 2019; Zbontar et al., 2018), to the medical image registration problem. The same formulation can be generalized to other high-dimensional optimization problems where explicit forward models are absent but differentiable evaluation metrics are available.

1.2 Contributions

The main contributions of our work are three-fold:

1. We propose a novel meta-learning framework, RIIR, for medical image registration. RIIR learns the optimization process, in the absence of explicit forward models. RIIR is flexible w.r.t. the input modality while demonstrating competitive accuracy in different medical image registration applications.

2. Unlike existing iterative deep learning-based methods, our method integrates the gradient information of input images into the prediction of dense incremental transformations. As such, RIIR largely simplifies the learning task compared to one-step inference, significantly enhancing the overall data efficiency, as demonstrated by our experiments.

3. Through in-depth ablation experiments, we not only showed the flexibility of our proposed method with varying input choices but also investigated how different architectural choices within the RIM framework impact its performance. In particular, we showed the added value of hidden states in solving complex optimization problems in the context of medical imaging, which was under-explored in existing literature.

2 Methods

2.1 Deformable Image Registration

Deformable image registration aims to align a moving image Imovsubscript𝐼movI_{\text{mov}}italic_I start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT to a fixed image Ifixsubscript𝐼fixI_{\text{fix}}italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT by determining a transformation ϕbold-italic-ϕ\bm{\phi}bold_italic_ϕ acting on the shared coordinates 𝒙𝒙\bm{x}bold_italic_x, such that the transformed image Imovϕsubscript𝐼movbold-italic-ϕI_{\text{mov}}\circ\bm{\bm{\phi}}italic_I start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT ∘ bold_italic_ϕ is similar enough to Ifixsubscript𝐼fixI_{\text{fix}}italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT. The similarity is often evaluated by a scalar-valued metric. In deformable image registration, ϕbold-italic-ϕ\bm{\bm{\phi}}bold_italic_ϕ is considered to be a relatively small displacement added to the original coordinate 𝒙𝒙\bm{x}bold_italic_x, expressed as ϕ=𝒙+u(𝒙)bold-italic-ϕ𝒙𝑢𝒙\bm{\phi}=\bm{x}+u(\bm{x})bold_italic_ϕ = bold_italic_x + italic_u ( bold_italic_x ). Since the transformation ϕbold-italic-ϕ\bm{\phi}bold_italic_ϕ is calculated between the pair (Imov,Ifix)subscript𝐼movsubscript𝐼fix(I_{\text{mov}},I_{\text{fix}})( italic_I start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT ), the process is often referred to as pairwise registration (Balakrishnan et al., 2019). Finding such transformation ϕbold-italic-ϕ\bm{\phi}bold_italic_ϕ in pairwise registration can be viewed as the following optimization problem:

ϕ^=argminϕsim (Imovϕ,Ifix)+λreg (ϕ),^bold-italic-ϕbold-italic-ϕargminsubscriptsim subscript𝐼movbold-italic-ϕsubscript𝐼fix𝜆subscriptreg bold-italic-ϕ\hat{\bm{\phi}}=\underset{\bm{\phi}}{\operatorname{argmin}}\ \mathcal{L}_{% \text{sim }}(I_{\text{mov}}\circ\bm{\phi},I_{\text{fix}})+\lambda\mathcal{L}_{% \text{reg }}(\bm{\phi}),over^ start_ARG bold_italic_ϕ end_ARG = underbold_italic_ϕ start_ARG roman_argmin end_ARG caligraphic_L start_POSTSUBSCRIPT sim end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT ∘ bold_italic_ϕ , italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT ) + italic_λ caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT ( bold_italic_ϕ ) , (1)

where simsubscriptsim\mathcal{L}_{\text{sim}}caligraphic_L start_POSTSUBSCRIPT sim end_POSTSUBSCRIPT is a similarity term between the deformed image Imovϕsubscript𝐼movbold-italic-ϕI_{\text{mov}}\circ\bm{\phi}italic_I start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT ∘ bold_italic_ϕ and fixed image Ifixsubscript𝐼fixI_{\text{fix}}italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT, regsubscriptreg\mathcal{L}_{\text{reg}}caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT is a regularization term constraining ϕbold-italic-ϕ\bm{\phi}bold_italic_ϕ, and λ𝜆\lambdaitalic_λ is a trade-off weight term.

2.2 Recurrent Inference Machine (RIM)

Refer to caption
Figure 1: Overview of RIIR framework. Here, an illustrative cardiac image pair is shown as an example. The hidden states 𝒉t=[𝒉t1,𝒉t2]subscript𝒉𝑡superscriptsubscript𝒉𝑡1superscriptsubscript𝒉𝑡2\bm{h}_{t}=[\bm{h}_{t}^{1},\bm{h}_{t}^{2}]bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] are visualized in channel-wise fashion. The inner loss innersubscriptinner\mathcal{L}_{\text{inner}}caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT is calculated during each step of RIIR thus dynamically changing. When t=0𝑡0t=0italic_t = 0, the deformation field ϕ0subscriptbold-italic-ϕ0\bm{\phi}_{0}bold_italic_ϕ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is initialized as an identical transformation. In RIIR Cell, the dimensions of Conv and ConvGRU layer are dependent on the input (2D or 3D).

The idea of RIM originates from solving a closed-form inverse problem (Putzky and Welling, 2017):

𝒚=A𝒙+𝒏,𝒚𝐴𝒙𝒏\bm{y}=A\bm{x}+\bm{n},bold_italic_y = italic_A bold_italic_x + bold_italic_n , (2)

where 𝒚m𝒚superscript𝑚\bm{y}\in\mathbb{R}^{m}bold_italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is a noisy measurement vector, 𝒙d𝒙superscript𝑑\bm{x}\in\mathbb{R}^{d}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is the underlying noiseless signal, Am×d𝐴superscript𝑚𝑑A\in\mathbb{R}^{m\times d}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d end_POSTSUPERSCRIPT is a measurement matrix, and 𝒏𝒏\bm{n}bold_italic_n is a random noise vector. When mdmuch-less-than𝑚𝑑m\ll ditalic_m ≪ italic_d, the inverse problem is ill-posed. Thus, to constrain the solution space of x𝑥xitalic_x, a common practice is to solve a maximum a posteriori (MAP) problem:

max𝒙loglikelihood(𝒚|𝒙)+logpprior(𝒙),subscript𝒙subscriptlikelihoodconditional𝒚𝒙subscript𝑝prior𝒙\max_{\bm{x}}\log\mathcal{L}_{\text{likelihood}}(\bm{y}|\bm{x})+\log p_{\text{% prior}}(\bm{x}),roman_max start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT roman_log caligraphic_L start_POSTSUBSCRIPT likelihood end_POSTSUBSCRIPT ( bold_italic_y | bold_italic_x ) + roman_log italic_p start_POSTSUBSCRIPT prior end_POSTSUBSCRIPT ( bold_italic_x ) , (3)

where likelihood(𝒚|𝒙)subscriptlikelihoodconditional𝒚𝒙\mathcal{L}_{\text{likelihood}}(\bm{y}|\bm{x})caligraphic_L start_POSTSUBSCRIPT likelihood end_POSTSUBSCRIPT ( bold_italic_y | bold_italic_x ) is a likelihood term representing the noisy forward model, such as the Fourier transform with masks in MRI reconstruction Putzky et al. (2019), and ppriorsubscript𝑝priorp_{\text{prior}}italic_p start_POSTSUBSCRIPT prior end_POSTSUBSCRIPT is the prior distribution of the underlying signal 𝒙𝒙\bm{x}bold_italic_x. A simple iterative scheme at step t𝑡titalic_t for solving Eq. 3 is via gradient descent:

𝒙t+1=𝒙t+γt𝒙t(loglikelihood(𝒚|𝒙)+logpprior(𝒙)),subscript𝒙𝑡1subscript𝒙𝑡subscript𝛾𝑡subscriptsubscript𝒙𝑡subscriptlikelihoodconditional𝒚𝒙subscript𝑝prior𝒙\bm{x}_{t+1}=\bm{x}_{t}+\gamma_{t}\nabla_{\bm{x}_{t}}\left(\log\mathcal{L}_{% \text{likelihood}}(\bm{y}|\bm{x})+\log p_{\text{prior}}(\bm{x})\right),bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_log caligraphic_L start_POSTSUBSCRIPT likelihood end_POSTSUBSCRIPT ( bold_italic_y | bold_italic_x ) + roman_log italic_p start_POSTSUBSCRIPT prior end_POSTSUBSCRIPT ( bold_italic_x ) ) , (4)

where γtsubscript𝛾𝑡\gamma_{t}italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes a scalable step length and 𝒙tsubscriptsubscript𝒙𝑡\nabla_{\bm{x}_{t}}∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the gradient w.r.t. 𝒙𝒙\bm{x}bold_italic_x, evaluated at 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Then, in RIM implementation, Eq. 4 is represented as:

𝒙t+1=𝒙t+gθ(𝒙t(loglikelihood(𝒚|𝒙)),𝒙t),subscript𝒙𝑡1subscript𝒙𝑡subscript𝑔𝜃subscriptsubscript𝒙𝑡subscriptlikelihoodconditional𝒚𝒙subscript𝒙𝑡\bm{x}_{t+1}=\bm{x}_{t}+g_{\theta}\left(\nabla_{\bm{x}_{t}}\left(\log\mathcal{% L}_{\text{likelihood}}(\bm{y}|\bm{x})\right),\bm{x}_{t}\right),bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_log caligraphic_L start_POSTSUBSCRIPT likelihood end_POSTSUBSCRIPT ( bold_italic_y | bold_italic_x ) ) , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (5)

where gθsubscript𝑔𝜃g_{\theta}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is a neural network parameterized by θ𝜃\thetaitalic_θ. In RIM, the prior distribution pprior(𝒙)subscript𝑝prior𝒙p_{\text{prior}}(\bm{x})italic_p start_POSTSUBSCRIPT prior end_POSTSUBSCRIPT ( bold_italic_x ) is implicitly integrated into the parameterized neural network gθsubscript𝑔𝜃g_{\theta}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT which is trained with a weighted sum of the individual prediction losses between 𝒙𝒙\bm{x}bold_italic_x and 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (e.g., the mean squared loss) at each time step t𝑡titalic_t.

In the context of meta-learning, we regard the likelihood term likelihoodsubscriptlikelihood\mathcal{L}_{\text{likelihood}}caligraphic_L start_POSTSUBSCRIPT likelihood end_POSTSUBSCRIPT guided by the forward model as the ‘inner loss’, denoted by innersubscriptinner\mathcal{L}_{\text{inner}}caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT as it is serving as the input of the neural network gθsubscript𝑔𝜃g_{\theta}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. For example, given the Gaussian assumption of the noise 𝒏𝒏\bm{n}bold_italic_n with a known variance of σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and linear forward model described in Eq. 2, the inner loss can be given as the logarithm of the maximum likelihood estimation (MLE) solution:

inner=1σ2𝒚A𝒙22.subscriptinner1superscript𝜎2superscriptsubscriptnorm𝒚𝐴𝒙22\mathcal{L}_{\text{inner}}=\frac{1}{\sigma^{2}}\|\bm{y}-A\bm{x}\|_{2}^{2}.caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ bold_italic_y - italic_A bold_italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (6)

In RIM, the gradient of innersubscriptinner\mathcal{L}_{\text{inner}}caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT is calculated explicitly with the (linear) forward operator A𝐴Aitalic_A, which is free of the forward pass of a neural network. That means innersubscriptinner\mathcal{L}_{\text{inner}}caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT does not directly contribute to the update of the network parameters θ𝜃\thetaitalic_θ. The weighted loss for training the neural network gθsubscript𝑔𝜃g_{\theta}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT for efficient solving the inverse problem can be regarded as the ‘outer loss’, denoted by outersubscriptouter\mathcal{L}_{\text{outer}}caligraphic_L start_POSTSUBSCRIPT outer end_POSTSUBSCRIPT. In the form of the inverse problem shown in Eq. 2, the outer loss to update the network parameter θ𝜃\thetaitalic_θ across T𝑇Titalic_T time steps can be expressed as:

outer(θ)=1Ti=1T𝒙𝒙t22.subscriptouter𝜃1𝑇superscriptsubscript𝑖1𝑇superscriptsubscriptnorm𝒙subscript𝒙𝑡22\mathcal{L}_{\text{outer}}(\theta)=\frac{1}{T}\sum_{i=1}^{T}\|\bm{x}-\bm{x}_{t% }\|_{2}^{2}.caligraphic_L start_POSTSUBSCRIPT outer end_POSTSUBSCRIPT ( italic_θ ) = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ bold_italic_x - bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (7)

For clarity and consistency, these notations of innersubscriptinner\mathcal{L}_{\text{inner}}caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT and outersubscriptouter\mathcal{L}_{\text{outer}}caligraphic_L start_POSTSUBSCRIPT outer end_POSTSUBSCRIPT will be uniformly applied in the subsequent sections.

2.3 Recurrent Inference Image Registration Network (RIIR)

Inspired by the formulation of RIM and the optimization nature of medical image registration, we present a novel deep learning-based image registration framework, named the Recurrent Inference Image Registration Network (RIIR). The overview of our proposed framework can be found in Fig. 1. The proposed framework performs an end-to-end iterative prediction of a dense transformation ϕbold-italic-ϕ\bm{\phi}bold_italic_ϕ in T𝑇Titalic_T steps for pairwise registration: Given the input image pair (Imov,Ifix)subscript𝐼movsubscript𝐼fix\left(I_{\text{mov}},I_{\text{fix}}\right)( italic_I start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT ), the optimization problem in Eq. 1 can be solved by the iterative update of ϕbold-italic-ϕ\bm{\phi}bold_italic_ϕ. And the update rule at step t{0,1,,T1}𝑡01𝑇1t\in\{0,1,\ldots,T-1\}italic_t ∈ { 0 , 1 , … , italic_T - 1 } is:

ϕt+1=ϕt+Δϕt,subscriptbold-italic-ϕ𝑡1subscriptbold-italic-ϕ𝑡Δsubscriptbold-italic-ϕ𝑡\bm{\phi}_{t+1}=\bm{\phi}_{t}+\Delta\bm{\phi}_{t},bold_italic_ϕ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + roman_Δ bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (8)

where ϕ0subscriptbold-italic-ϕ0\bm{\phi}_{0}bold_italic_ϕ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is initialized as an identity mapping ϕ0(𝒙)=𝒙subscriptbold-italic-ϕ0𝒙𝒙\bm{\phi}_{0}(\bm{x})=\bm{x}bold_italic_ϕ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_italic_x ) = bold_italic_x. The update at step t𝑡titalic_t, ΔϕtΔsubscriptbold-italic-ϕ𝑡\Delta\bm{\phi}_{t}roman_Δ bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, is calculated by a recurrent update network gθsubscript𝑔𝜃g_{\theta}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT by taking a channel-wise concatenation of

{ϕt,ϕtinner(Imovϕt,Ifix)}subscriptbold-italic-ϕ𝑡subscriptsubscriptbold-italic-ϕ𝑡subscriptinnersubscript𝐼movsubscriptbold-italic-ϕ𝑡subscript𝐼fix\{\bm{\phi}_{t},\nabla_{\bm{\phi}_{t}}\mathcal{L}_{\text{inner}}\left(I_{\text% {mov}}\circ\bm{\phi}_{t},I_{\text{fix}}\right)\}{ bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∇ start_POSTSUBSCRIPT bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT ∘ bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT ) } (9)

as input, where ϕtsubscriptsubscriptbold-italic-ϕ𝑡\nabla_{\bm{\phi}_{t}}∇ start_POSTSUBSCRIPT bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the gradient w.r.t. ϕbold-italic-ϕ\bm{\phi}bold_italic_ϕ evaluated for ϕ=ϕtbold-italic-ϕsubscriptbold-italic-ϕ𝑡\bm{\phi}=\bm{\phi}_{t}bold_italic_ϕ = bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and innersubscriptinner\mathcal{L}_{\text{inner}}caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT denotes the inner loss.

Originally, RIM aimed to learn a recurrent solver for an inverse problem where the forward model from signal to measurement is known, such as quantitative mapping (Sabidussi et al., 2021) or MRI reconstruction (Lønning et al., 2019). However, the optimization of image registration is in an unsupervised manner with the forward model from the measurement pair (Imov,Ifix)subscript𝐼movsubscript𝐼fix(I_{\text{mov}},I_{\text{fix}})( italic_I start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT ) to the deformation field ϕbold-italic-ϕ\bm{\phi}bold_italic_ϕ unknown. In this work, we extend the framework of RIM to be amendable to a broader range of tasks including image registration. We design the inner loss innersubscriptinner\mathcal{L}_{\text{inner}}caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT by adapting the optimization objective in Eq. 1. Specifically, we use the similarity part simsubscriptsim\mathcal{L}_{\text{sim}}caligraphic_L start_POSTSUBSCRIPT sim end_POSTSUBSCRIPT in Eq. 1 as the inner loss at time step t𝑡titalic_t:

inner(Imovϕt,Ifix)=sim(Imovϕt,Ifix).subscriptinnersubscript𝐼movsubscriptbold-italic-ϕ𝑡subscript𝐼fixsubscriptsimsubscript𝐼movsubscriptbold-italic-ϕ𝑡subscript𝐼fix\mathcal{L}_{\text{inner}}\left(I_{\text{mov}}\circ\bm{\phi}_{t},I_{\text{fix}% }\right)=\mathcal{L}_{\text{sim}}\left(I_{\text{mov}}\circ\bm{\phi}_{t},I_{% \text{fix}}\right).caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT ∘ bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT ) = caligraphic_L start_POSTSUBSCRIPT sim end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT ∘ bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT ) . (10)

The explicit modelling of innersubscriptinner\mathcal{L}_{\text{inner}}caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT also helps the auto differentiation for calculating its gradient while making the regularization part to be learned implicitly in gθsubscript𝑔𝜃g_{\theta}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT during the training.

In the implementation of RIM, the iterative update Eq. 8 is achieved by a recurrent neural network (RNN) to generalize the update rule in Eq. 5 with hidden memory state variable 𝒉𝒉\bm{h}bold_italic_h estimated for each time step t𝑡titalic_t. Unlike previous RIM-based works (Putzky et al., 2019; Sabidussi et al., 2021) which use two linear gated recurrent units (GRU) to calculate the hidden states 𝒉tsubscript𝒉𝑡\bm{h}_{t}bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, in RIIR, two convolutional gated recurrent units (ConvGRU) (Shi et al., 2015) are used to better preserve spatial correlation in the image. We further investigate the necessity of including such two-level recurrent structures in our experiment, particularly considering potential complexities in constructing computation graphs for neural networks. The iterative update equations of RIIR at step t𝑡titalic_t have the following form, with the hidden memory states:

{Δϕt,𝒉t+1}Δsubscriptbold-italic-ϕ𝑡subscript𝒉𝑡1\displaystyle\left\{\Delta\bm{\phi}_{t},\bm{h}_{t+1}\right\}{ roman_Δ bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT } =gθ(ϕt,ϕtinner(Imovϕt,Ifix),𝒉t),absentsubscript𝑔𝜃subscriptbold-italic-ϕ𝑡subscriptsubscriptbold-italic-ϕ𝑡subscriptinnersubscript𝐼movsubscriptbold-italic-ϕ𝑡subscript𝐼fixsubscript𝒉𝑡\displaystyle=g_{\theta}(\bm{\phi}_{t},\nabla_{\bm{\phi}_{t}}\mathcal{L}_{% \text{inner}}\left(I_{\text{mov}}\circ\bm{\phi}_{t},I_{\text{fix}}\right),\bm{% h}_{t}),= italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∇ start_POSTSUBSCRIPT bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT ∘ bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT ) , bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (11)
ϕt+1subscriptbold-italic-ϕ𝑡1\displaystyle\bm{\phi}_{t+1}bold_italic_ϕ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT =ϕt+Δϕt,absentsubscriptbold-italic-ϕ𝑡Δsubscriptbold-italic-ϕ𝑡\displaystyle=\bm{\phi}_{t}+\Delta\bm{\phi}_{t},= bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + roman_Δ bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (12)

where 𝒉t={𝒉t1,𝒉t2}subscript𝒉𝑡superscriptsubscript𝒉𝑡1superscriptsubscript𝒉𝑡2\bm{h}_{t}=\{\bm{h}_{t}^{1},\bm{h}_{t}^{2}\}bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } denotes the two-level hidden memory states at step t𝑡titalic_t. The size of 𝒉tsubscript𝒉𝑡\bm{h}_{t}bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT depends on the size of input image pair (Imov,Ifix)subscript𝐼movsubscript𝐼fix(I_{\text{mov}},I_{\text{fix}})( italic_I start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT ) with multiple channels. For t=1𝑡1t=1italic_t = 1, 𝒉1subscript𝒉1\bm{h}_{1}bold_italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is initialized to a zero input. We name our network gθsubscript𝑔𝜃g_{\theta}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT as RIIR Cell, with its detailed architecture illustrated in Fig. 1. To address the difference between our RIIR from the existing gradient-based iterative algorithm (GraDIRN) (Qiu et al., 2022) under the same definition of innersubscriptinner\mathcal{L}_{\text{inner}}caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT as in Eq. 10, RIIR uses the gradient of inner loss as the neural network input to calculate the incremental update. On the other hand, GraDIRN takes the channel-wise warped image pair (Imovϕ,Ifix)subscript𝐼movitalic-ϕsubscript𝐼fix(I_{\text{mov}}\circ\phi,I_{\text{fix}})( italic_I start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT ∘ italic_ϕ , italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT ) and deformation field ϕitalic-ϕ\phiitalic_ϕ as the input to the network to output regularization update in Eq. 1, while the gradient of innersubscriptinner\mathcal{L}_{\text{inner}}caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT is added to the update without any further processing. Since the network gθsubscript𝑔𝜃g_{\theta}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT learns an incremental update with the gradient of inner loss as compressed information, the training of RIIR can be more efficient, which will be discussed in the experiment section.

Unlike previous work in deep learning-based iterative deformable image registration methods which does not incorporate internal hidden states (Zhao et al., 2019; Fechter and Baltas, 2020; Qiu et al., 2022), we propose to combine the gradient information and hidden states as the network input. Using 𝒉tsubscript𝒉𝑡\bm{h}_{t}bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT also suggests an analogy with gradient-based optimization methods such as the Limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm (L-BFGS) to track and memorize progression (Putzky and Welling, 2017). To substantiate this design, the input selections of RIIR will be further ablation-studied and discussed in our experiments.

Since the ground-truth deformation field is not known in deformable image registration, we use the optimization objective Eq. 1 as the proposed outer loss to optimize the parameters θ𝜃\thetaitalic_θ of RIIR Cell gθsubscript𝑔𝜃g_{\theta}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. We incorporate a weighted sum of losses for the outer loss outersubscriptouter\mathcal{L}_{\text{outer}}caligraphic_L start_POSTSUBSCRIPT outer end_POSTSUBSCRIPT to ensure that each step contributes to the final prediction:

outer(θ)=t=1Twt(sim (Imovϕt,Ifix)+λreg (ϕt)),subscriptouter𝜃superscriptsubscript𝑡1𝑇subscript𝑤𝑡subscriptsim subscript𝐼movsubscriptbold-italic-ϕ𝑡subscript𝐼fix𝜆subscriptreg subscriptbold-italic-ϕ𝑡\mathcal{L}_{\text{outer}}(\theta)=\sum_{t=1}^{T}w_{t}\left(\mathcal{L}_{\text% {sim }}(I_{\text{mov}}\circ\bm{\phi}_{t},I_{\text{fix}})+\lambda\mathcal{L}_{% \text{reg }}(\bm{\phi}_{t})\right),caligraphic_L start_POSTSUBSCRIPT outer end_POSTSUBSCRIPT ( italic_θ ) = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( caligraphic_L start_POSTSUBSCRIPT sim end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT ∘ bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT ) + italic_λ caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT ( bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , (13)

where wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a (positive) scalar indicating the weight of step t𝑡titalic_t. In our experiment, both uniform (wt=1Tsubscript𝑤𝑡1𝑇w_{t}=\frac{1}{T}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG) and exponential weights (wt=10t1T1subscript𝑤𝑡superscript10𝑡1𝑇1w_{t}=10^{\frac{t-1}{T-1}}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT divide start_ARG italic_t - 1 end_ARG start_ARG italic_T - 1 end_ARG end_POSTSUPERSCRIPT) are considered and will be compared in the experiments. It is noticeable that the design of using (weighted) average of step-wise loss also makes our proposed RIIR different from other iterative deep learning-based methods (Qiu et al., 2022; Zhao et al., 2019), addressing the fact that early steps in the prediction process were neglected before.

2.4 Metrics

Similarity Functions for Inner Loss innersubscriptinner\mathcal{L}_{\text{inner}}caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT: In the context of image registration, unlike inverse problems with straightforward forward models, the problem is addressed as a broader optimization challenge. Therefore, it requires an investigation of choosing a (differentiable) function acting as the inner loss function evaluating the quality of estimation of ϕtsubscriptbold-italic-ϕ𝑡\bm{\phi}_{t}bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT iteratively in RIIR. Furthermore, the gradient of innersubscriptinner\mathcal{L}_{\text{inner}}caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT as an input of a convolutional recurrent neural network has not been studied before for deformable image registration. These motivate the study on the different choices of innersubscriptinner\mathcal{L}_{\text{inner}}caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT under a fixed choice of outlet loss outersubscriptouter\mathcal{L}_{\text{outer}}caligraphic_L start_POSTSUBSCRIPT outer end_POSTSUBSCRIPT. In this work, we evaluate three similarity functions: mean squared error (MSE), normalized cross-correlation (NCC) (Avants et al., 2008), and normalized mutual information (NMI) (Studholme et al., 1999).

The MSE between two 3D images I1,I2dx×dy×dzsubscript𝐼1subscript𝐼2superscriptsubscript𝑑𝑥subscript𝑑𝑦subscript𝑑𝑧I_{1},I_{2}\in\mathbb{R}^{d_{x}\times d_{y}\times d_{z}}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is defined as follows:

MSE(I1,I2)=1dxdydzI1I222.MSEsubscript𝐼1subscript𝐼21subscript𝑑𝑥subscript𝑑𝑦subscript𝑑𝑧superscriptsubscriptnormsubscript𝐼1subscript𝐼222\operatorname{MSE}\left(I_{1},I_{2}\right)=\frac{1}{d_{x}d_{y}d_{z}}{\left\|I_% {1}-I_{2}\right\|}_{2}^{2}.roman_MSE ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG ∥ italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (14)

The MSE metric is minimized when pixels of I1subscript𝐼1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and I2subscript𝐼2I_{2}italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT have the same intensities. Therefore, it is sensitive to the contrast change. In comparison, the NCC metric measures the difference between images with the image intensity normalized. The NCC difference between I1subscript𝐼1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and I2subscript𝐼2I_{2}italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is given by:

NCC(I1,I2)=1dxdydz𝒙ΩI1𝒙Ω𝒙(I1(𝒙)I1¯(𝒙))(I2(𝒙)I2¯(𝒙))I^1(𝒙)I^2(𝒙),NCCsubscript𝐼1subscript𝐼21subscript𝑑𝑥subscript𝑑𝑦subscript𝑑𝑧subscript𝒙subscriptΩsubscript𝐼1subscriptsuperscript𝒙subscriptΩ𝒙subscript𝐼1superscript𝒙¯subscript𝐼1𝒙subscript𝐼2superscript𝒙¯subscript𝐼2𝒙subscript^𝐼1𝒙subscript^𝐼2𝒙\operatorname{NCC}(I_{1},I_{2})=\frac{1}{d_{x}d_{y}d_{z}}\sum_{\bm{x}\in\Omega% _{I_{1}}}\frac{\sum_{\bm{x}^{\prime}\in\Omega_{\bm{x}}}(I_{1}(\bm{x}^{\prime})% -\bar{I_{1}}(\bm{x}))(I_{2}(\bm{x}^{\prime})-\bar{I_{2}}(\bm{x}))}{\sqrt{\hat{% I}_{1}(\bm{x})\hat{I}_{2}(\bm{x})}},roman_NCC ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT bold_italic_x ∈ roman_Ω start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Ω start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - over¯ start_ARG italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ( bold_italic_x ) ) ( italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - over¯ start_ARG italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ( bold_italic_x ) ) end_ARG start_ARG square-root start_ARG over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x ) over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x ) end_ARG end_ARG , (15)

where ΩI1subscriptΩsubscript𝐼1\Omega_{I_{1}}roman_Ω start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes all possible coordinates in I1subscript𝐼1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, Ω𝒙subscriptΩ𝒙\Omega_{\bm{x}}roman_Ω start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT represents a neighborhood of voxels around coordinate position 𝒙𝒙\bm{x}bold_italic_x and I¯(𝒙)¯𝐼𝒙\bar{I}(\bm{x})over¯ start_ARG italic_I end_ARG ( bold_italic_x ) and I^(𝒙)^𝐼𝒙\hat{I}(\bm{x})over^ start_ARG italic_I end_ARG ( bold_italic_x ) denote the (local) mean and variance in Ω𝒙subscriptΩ𝒙\Omega_{\bm{x}}roman_Ω start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT. In this study, we consider the local NCC with a window size of 9.

Compared to MSE and NCC, NMI is shown to be more robust when the linear relation of signal intensities between two images does not hold (Studholme et al., 1999; de Vos et al., 2020), which is often the case in quantitative MRI as the signal models are mostly exponential (Messroghli et al., 2004; Chow et al., 2022). The NMI between two images can be written as:

NMI(I1,I2)=H(I1)+H(I2)H(I1,I2),NMIsubscript𝐼1subscript𝐼2𝐻subscript𝐼1𝐻subscript𝐼2𝐻subscript𝐼1subscript𝐼2\operatorname{NMI}(I_{1},I_{2})=\frac{H(I_{1})+H(I_{2})}{H(I_{1},I_{2})},roman_NMI ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG italic_H ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_H ( italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_H ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG , (16)

where H(I1)𝐻subscript𝐼1H(I_{1})italic_H ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and H(I2)𝐻subscript𝐼2H(I_{2})italic_H ( italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) are marginal entropies of I1subscript𝐼1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and I2subscript𝐼2I_{2}italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, respectively, and H(I1,I2)𝐻subscript𝐼1subscript𝐼2H(I_{1},I_{2})italic_H ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) denotes the joint entropy of the two images. Since the gradient is both necessary for innersubscriptinner\mathcal{L}_{\text{inner}}caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT and outersubscriptouter\mathcal{L}_{\text{outer}}caligraphic_L start_POSTSUBSCRIPT outer end_POSTSUBSCRIPT we adopt a differentiable approximation of the joint distribution proposed in Qiu et al. (2021) based on Parzen window with Gaussian distributions (Thévenaz and Unser, 2000).

Regularization Metrics: To ensure a smooth and reasonable deformation field, we use a diffusion regularization loss which penalizes large displacements in ϕbold-italic-ϕ\bm{\phi}bold_italic_ϕ acting on Idx×dy×dz𝐼superscriptsubscript𝑑𝑥subscript𝑑𝑦subscript𝑑𝑧I\in\mathbb{R}^{d_{x}\times d_{y}\times d_{z}}italic_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (Balakrishnan et al., 2019):

reg=1dxdydz𝒙ΩIϕ(𝒙)22,subscriptreg1subscript𝑑𝑥subscript𝑑𝑦subscript𝑑𝑧subscript𝒙subscriptΩ𝐼superscriptsubscriptnormbold-italic-ϕ𝒙22\mathcal{L}_{\text{reg}}=\frac{1}{d_{x}d_{y}d_{z}}\sum_{\bm{x}\in\Omega_{I}}\|% \nabla\bm{\phi}(\bm{x})\|_{2}^{2},caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT bold_italic_x ∈ roman_Ω start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ ∇ bold_italic_ϕ ( bold_italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (17)

where ϕ(𝒙)bold-italic-ϕ𝒙\nabla\bm{\phi}(\bm{x})∇ bold_italic_ϕ ( bold_italic_x ) denotes the Jacobian of ϕbold-italic-ϕ\bm{\phi}bold_italic_ϕ at coordinate 𝒙𝒙\bm{x}bold_italic_x. It is noticeable that Eq. 17 and its gradient are not evaluated in each RIIR inference step as indicated in Eq. 10, the outer loss outersubscriptouter\mathcal{L}_{\text{outer}}caligraphic_L start_POSTSUBSCRIPT outer end_POSTSUBSCRIPT and the data-driven training process can guide the RIIR Cell gθsubscript𝑔𝜃g_{\theta}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to learn the regularization implicitly.

3 Experiments

3.1 Dataset

We evaluated our proposed RIIR framework on two separate datasets: 1) A 3D brain MRI image dataset with inter-subject registration setup, OASIS (Marcus et al., 2007) with pre-processing from Hoopes et al. (2021), denoted as OASIS. 2) A 2D quantitative cardiac MRI image datasets based on multiparametric SAturation-recovery single-SHot Acquisition (mSASHA) image time series (Chow et al., 2022), denoted as mSASHA; These datasets, each serving our interests in inter-subject tissue alignment and respiratory motion correction with contrast variation.

Refer to caption
Figure 2: An example of OASIS dataset for two subjects as Ifixsubscript𝐼fixI_{\text{fix}}italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT and Imovsubscript𝐼movI_{\text{mov}}italic_I start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT. The choices of Ifixsubscript𝐼fixI_{\text{fix}}italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT and Imovsubscript𝐼movI_{\text{mov}}italic_I start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT are random during training.
Refer to caption
Figure 3: An example of mSASHA dataset, from left to right: Ifixsubscript𝐼fixI_{\text{fix}}italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT, Imovsubscript𝐼movI_{\text{mov}}italic_I start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT (random sample 1), and Imovsubscript𝐼movI_{\text{mov}}italic_I start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT (random sample 2). The three images were taken from the same image series, with different acquisition time points. To emphasize the difference in both signal intensity and contrast across images in a single series, the color ranges are set to be the same for the three images.

OASIS: The dataset contains 414 subjects, where for each subject, the normalized T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-weighted scan was acquired. The subjects are split into train/validation/test with counts of [300,30,84]3003084[300,30,84][ 300 , 30 , 84 ]. For training, images are randomly paired using an on-the-fly data loader, while in the validation and test sets, all images are paired with the next image in a fixed order. The dataset was preprocessed with FreeSurfer and SAMSEG by Hoopes et al. (2021), resulting in skull-stripped and bias-corrected 3D volumes with a size of 160×192×224160192224160\times 192\times 224160 × 192 × 224. We further resampled the images into a size of 128×128×128128128128128\times 128\times 128128 × 128 × 128 with intensity clipping between (1%,99%)percent1percent99(1\%,99\%)( 1 % , 99 % ) percentiles. Fig. 2 illustrates an example pair of OASIS images, showcasing the consistency in signal intensity and contrast.

mSASHA: During an free-breathing mSASHA examination, a time series of N=30𝑁30N=30italic_N = 30 real-valued 2D images, denoted by I={In|n=1,2,,N}𝐼conditional-setsubscript𝐼𝑛𝑛12𝑁I=\{I_{n}|\ n=1,2,\ldots,N\}italic_I = { italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | italic_n = 1 , 2 , … , italic_N }, are acquired for the same subject. In the setting of quantitative MRI, we aim to spatially align N𝑁Nitalic_N images in a single sequence I𝐼Iitalic_I into a common fixed template image Ifixsubscript𝐼fixI_{\text{fix}}italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT, by individually performing N𝑁Nitalic_N pairwise registration processes over (In,Ifix)subscript𝐼𝑛subscript𝐼fix(I_{n},I_{\text{fix}})( italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT ) where n=1,2,,N𝑛12𝑁n=1,2,\ldots,Nitalic_n = 1 , 2 , … , italic_N.

The mSASHA acquisition technique (Chow et al., 2022) is a voxel-wise 3-parameter signal model based on the joint cardiac T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT signal model:

𝒮(T1,T2,A)=A{1[1(1eTS/T1)eTE/T2]eTD/T1},𝒮subscript𝑇1subscript𝑇2𝐴𝐴1delimited-[]11superscript𝑒𝑇𝑆subscript𝑇1superscript𝑒𝑇𝐸subscript𝑇2superscript𝑒𝑇𝐷subscript𝑇1\mathcal{S}\left(T_{1},T_{2},A\right)=A\left\{1-\left[1-\left(1-e^{-TS/T_{1}}% \right)e^{-TE/T_{2}}\right]e^{-TD/T_{1}}\right\},caligraphic_S ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_A ) = italic_A { 1 - [ 1 - ( 1 - italic_e start_POSTSUPERSCRIPT - italic_T italic_S / italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) italic_e start_POSTSUPERSCRIPT - italic_T italic_E / italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] italic_e start_POSTSUPERSCRIPT - italic_T italic_D / italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } , (18)

where (TS,TE,TD)𝑇𝑆𝑇𝐸𝑇𝐷(TS,TE,TD)( italic_T italic_S , italic_T italic_E , italic_T italic_D ) denotes the set of three acquisition variables, and (T1,T2,A)subscript𝑇1subscript𝑇2𝐴(T_{1},T_{2},A)( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_A ) is the set of parameters to be estimated for each voxel coordinate of the image series. We encourage interested readers to refer to Chow et al. (2022) for a more detailed explanation.

In our experiment, an in-house mSASHA dataset was used. This fully anonymized raw dataset was provided by NIH, and was considered “non-human subject data research” by the NIH Office of Human Subjects Research”. The dataset comprises 120 subjects, with each subject having 3 slice positions, resulting in a total number of 360 slices. Each mSASHA time series consists of a fixed length of N=30𝑁30N=30italic_N = 30 images. We split mSASHA into train/validation/test with counts of [84,12,24]841224[84,12,24][ 84 , 12 , 24 ] by subjects to avoid data leakage across the three slices. Given variations in image sizes due to different acquisition conditions, we first center-cropped the images into the same size of 144×144144144144\times 144144 × 144. Subsequently, we applied intensity clipping between (1%,95%)percent1percent95(1\%,95\%)( 1 % , 95 % ) percentiles to mitigate extreme signal intensities from the chest wall region. We selected the last image in the series as the template Ifixsubscript𝐼fixI_{\text{fix}}italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT, which is a T2-weighted image with the greatest contrast between the myocardium and adjacent blood pool. An illustrative example of mSASHA images can be found in Fig. 3, showing varying contrasts and non-rigid motion across frames.

3.2 Evaluation Metrics

Conventionally, the root mean squared error (RMSE) is evaluated as an overall estimation of agreement between image pair (Imovϕ,Ifix)subscript𝐼movbold-italic-ϕsubscript𝐼fix(I_{\text{mov}}\circ\bm{\phi},I_{\text{fix}})( italic_I start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT ∘ bold_italic_ϕ , italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT ). Meanwhile, the spatial smoothness of the transformation is evaluated by the standard error of log|J|=log|ϕ(𝒙)|𝐽bold-italic-ϕ𝒙\log\left|J\right|=\log\left|\nabla\bm{\phi}(\bm{x})\right|roman_log | italic_J | = roman_log | ∇ bold_italic_ϕ ( bold_italic_x ) |, where ||\left|\cdot\right|| ⋅ | denotes the determinant of a matrix. However, these conventional image- and displacement-based metrics are often closely tied to the training objectives simsubscriptsim\mathcal{L}_{\text{sim}}caligraphic_L start_POSTSUBSCRIPT sim end_POSTSUBSCRIPT and regsubscriptreg\mathcal{L}_{\text{reg}}caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT, thereby being affected by the trade-off parameter λ𝜆\lambdaitalic_λ. To mitigate the potential impact of the choice of λ𝜆\lambdaitalic_λ, in this work, we maintain a uniform value λ𝜆\lambdaitalic_λ across all baseline models and our proposed RIIR model for a given dataset and simsubscriptsim\mathcal{L}_{\text{sim}}caligraphic_L start_POSTSUBSCRIPT sim end_POSTSUBSCRIPT.

For inter-subject brain MRI, two metrics, Dice score and Hausdorff distance (HD) are considered to evaluate the segmentation quality after registration. Given two sets XM𝑋𝑀X\subset Mitalic_X ⊂ italic_M and YM𝑌𝑀Y\subset Mitalic_Y ⊂ italic_M, the Dice score, is defined to measure the overlapping of X𝑋Xitalic_X and Y𝑌Yitalic_Y:

Dice(X,Y)=2|XY||X|+|Y|.𝐷𝑖𝑐𝑒𝑋𝑌2𝑋𝑌𝑋𝑌Dice(X,Y)=\frac{2|X\cap Y|}{|X|+|Y|}.italic_D italic_i italic_c italic_e ( italic_X , italic_Y ) = divide start_ARG 2 | italic_X ∩ italic_Y | end_ARG start_ARG | italic_X | + | italic_Y | end_ARG . (19)

Similarly, the Hausdorff distance of two aforementioned sets X𝑋Xitalic_X and Y𝑌Yitalic_Y is given by:

HD(X,Y):=max{supxXd(x,Y),supyYd(X,y)},assign𝐻𝐷𝑋𝑌subscriptsupremum𝑥𝑋𝑑𝑥𝑌subscriptsupremum𝑦𝑌𝑑𝑋𝑦HD(X,Y):=\max\left\{\sup_{x\in X}d(x,Y),\sup_{y\in Y}d(X,y)\right\},italic_H italic_D ( italic_X , italic_Y ) := roman_max { roman_sup start_POSTSUBSCRIPT italic_x ∈ italic_X end_POSTSUBSCRIPT italic_d ( italic_x , italic_Y ) , roman_sup start_POSTSUBSCRIPT italic_y ∈ italic_Y end_POSTSUBSCRIPT italic_d ( italic_X , italic_y ) } , (20)

where d(,)𝑑d(\cdot,\cdot)italic_d ( ⋅ , ⋅ ) is a metric (2-norm in this work) on M𝑀Mitalic_M and d(x,Y):=infyYd(x,y)assign𝑑𝑥𝑌subscriptinfimum𝑦𝑌𝑑𝑥𝑦d(x,Y):=\inf_{y\in Y}d(x,y)italic_d ( italic_x , italic_Y ) := roman_inf start_POSTSUBSCRIPT italic_y ∈ italic_Y end_POSTSUBSCRIPT italic_d ( italic_x , italic_y ). As a remark, in this work, we consider the average across all segmentation labels to calculate the Dice score and HD in OASIS instead of only considering major regions.

Furthermore, we also evaluate two more independent metrics for the mSASHA dataset proposed by Huizinga et al. (2016) isolated from training. The metrics are based on the principal component analysis (PCA) of images. Assume Mdxdy×N𝑀superscriptsubscript𝑑𝑥subscript𝑑𝑦𝑁M\in\mathbb{R}^{d_{x}d_{y}\times N}italic_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT × italic_N end_POSTSUPERSCRIPT is the matrix representation of I𝐼Iitalic_I, where a row of M𝑀Mitalic_M represents a coordinate in the image space. The correlation matrix of M𝑀Mitalic_M is then calculated by:

K=1dxdy1Σ1(MM¯)(MM¯)Σ1,𝐾1subscript𝑑𝑥subscript𝑑𝑦1superscriptΣ1superscript𝑀¯𝑀𝑀¯𝑀superscriptΣ1K=\frac{1}{d_{x}d_{y}-1}\Sigma^{-1}(M-\overline{M})^{\intercal}(M-\overline{M}% )\Sigma^{-1},italic_K = divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT - 1 end_ARG roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_M - over¯ start_ARG italic_M end_ARG ) start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT ( italic_M - over¯ start_ARG italic_M end_ARG ) roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , (21)

where ΣΣ\Sigmaroman_Σ is a diagonal matrix representing the standard deviation of each column, and M¯¯𝑀\overline{M}over¯ start_ARG italic_M end_ARG denotes the column-wise mean for each column entry. Since an ideal qMRI model assumes a voxel-wise tissue alignment, the actual underlying dimension of K𝐾Kitalic_K can be characterized by a low-dimensional (linear) subspace driven by the signal model. In the mSASHA signal model, the dimension of such a subspace is assumed to be four according to Eq. 18, determined by the number of parameters to be estimated. With the fact that the trace of K𝐾Kitalic_K, tr(K)tr𝐾\text{tr}(K)tr ( italic_K ) is a constant, two PCA-based metrics were proposed as follows:

𝒟PCA1subscript𝒟PCA1\displaystyle\mathcal{D}_{\text{PCA1}}caligraphic_D start_POSTSUBSCRIPT PCA1 end_POSTSUBSCRIPT =i=1Nλij=1Lλj=tr(K)j=1Lλj,absentsuperscriptsubscript𝑖1𝑁subscript𝜆𝑖superscriptsubscript𝑗1𝐿subscript𝜆𝑗tr𝐾superscriptsubscript𝑗1𝐿subscript𝜆𝑗\displaystyle=\sum_{i=1}^{N}\lambda_{i}-\sum_{j=1}^{L}\lambda_{j}=\text{tr}(K)% -\sum_{j=1}^{L}\lambda_{j},= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = tr ( italic_K ) - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (22)
𝒟PCA2subscript𝒟PCA2\displaystyle\mathcal{D}_{\text{PCA2}}caligraphic_D start_POSTSUBSCRIPT PCA2 end_POSTSUBSCRIPT =j=1Njλj.absentsuperscriptsubscript𝑗1𝑁𝑗subscript𝜆𝑗\displaystyle=\sum_{j=1}^{N}j\lambda_{j}.= ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_j italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT . (23)

Both metrics were designed to penalize a long-tail distribution of the spectrum of K𝐾Kitalic_K, and L𝐿Litalic_L is a hyperparameter regarding the number of parameters of the signal model. For 𝒟PCA1subscript𝒟PCA1\mathcal{D}_{\text{PCA1}}caligraphic_D start_POSTSUBSCRIPT PCA1 end_POSTSUBSCRIPT, an ideal scenario would involve all images perfectly aligning with tissue anatomy and the signal model, resulting in a value of 00. Meanwhile, the interpretation of 𝒟PCA2subscript𝒟PCA2\mathcal{D}_{\text{PCA2}}caligraphic_D start_POSTSUBSCRIPT PCA2 end_POSTSUBSCRIPT further emphasizes the tail of the eigenvalues thus enlarging the gaps across experiments.

To narrow the analysis to the region of interest to the heart region, the calculation is confined to this area by cropping the resulting images before computing the metric. This constraint ensures that the evaluation is focused on the relevant anatomical structures.

3.3 Experimental Settings

We here summarize the main experiments for evaluation and further ablation experiments for RIIR. For all experiments, the main workflow is to register the image series I𝐼Iitalic_I of length N𝑁Nitalic_N in a pairwise manner: that is, we first choose a template Ifixsubscript𝐼fixI_{\text{fix}}italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT, and then perform N𝑁Nitalic_N registrations. When N=2𝑁2N=2italic_N = 2, the registration process simplifies to straightforward pairwise registration, as is the situation for OASIS.

Experiment 1: Comparison Study with Varying Data Availability

We introduce five data-availability scenarios to evaluate the robustness of the models when data availability is limited, on both datasets, which often happens in both research settings and clinical practices as the number of subjects is heavily limited. The training data availability settings in this study were set to [5%,10%,25%,50%,100%]percent5percent10percent25percent50percent100[5\%,10\%,25\%,50\%,100\%][ 5 % , 10 % , 25 % , 50 % , 100 % ] for OASIS and mSASHA. It is worth noticing that for limited data availability scenarios, the data used for training remained the same for all models in consideration, and the leave-out test split remained unchanged for all scenarios.

Experiment 2: Inclusion of Hidden States

Unlike most related works utilizing the original RIM framework (Lønning et al., 2019; Sabidussi et al., 2021) where two levels of hidden states are considered, we explored the impact of modifying or even turning off hidden states. In our implementation of convolutional GRU, at most two levels of hidden states 𝒉t1superscriptsubscript𝒉𝑡1\bm{h}_{t}^{1}bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and 𝒉t2superscriptsubscript𝒉𝑡2\bm{h}_{t}^{2}bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are considered following most of recent works using RIM (Lønning et al., 2019; Sabidussi et al., 2021, 2023). Two hidden states were supposed to have the same channels as in their corresponding convolutional GRU, where in this study, the channels were set to be [16,16]1616[16,16][ 16 , 16 ] for 3D OASIS and [32,16]3216[32,16][ 32 , 16 ] for mSASHA after taking the VRAM consumption into account. From this experiment, all experiments were implemented on the validation split for both datasets.

Experiment 3: Choice of Inner Loss Functions innersubscriptinner\mathcal{L}_{\text{inner}}caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT

As detailed in the Methods section, the purpose innersubscriptinner\mathcal{L}_{\text{inner}}caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT is different from previous iterative registration methods as innersubscriptinner\mathcal{L}_{\text{inner}}caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT is not proposed as a choice of input of a convolutional recurrent neural network. We evaluated the ablation experiments from now on mSASHA which includes multiple-image registration with motion and varies in contrasts. Since it has been proposed previously that NMI is a more suitable choice of simsubscriptsim\mathcal{L}_{\text{sim}}caligraphic_L start_POSTSUBSCRIPT sim end_POSTSUBSCRIPT for outersubscriptouter\mathcal{L}_{\text{outer}}caligraphic_L start_POSTSUBSCRIPT outer end_POSTSUBSCRIPT in qMRI registration (de Vos et al., 2020) over NCC and MSE, we focused on the choice of innersubscriptinner\mathcal{L}_{\text{inner}}caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT to investigate the registration performance and the information propagated through the RIIR. That is, in this experiment, we evaluated the choice of outersubscriptouter\mathcal{L}_{\text{outer}}caligraphic_L start_POSTSUBSCRIPT outer end_POSTSUBSCRIPT with MSE, NCC, and NMI with outersubscriptouter\mathcal{L}_{\text{outer}}caligraphic_L start_POSTSUBSCRIPT outer end_POSTSUBSCRIPT as NMI.

Experiment 4: Inclusion of Gradient of Inner Loss ϕtinnersubscriptsubscriptitalic-ϕ𝑡subscriptinner\nabla_{\bm{\phi}_{t}}\mathcal{L}_{\text{inner}}∇ start_POSTSUBSCRIPT bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT as RIIR Input

We performed an experimental study on the input composition for RIIR. As shown in Eq. 12, the goal was to study the data efficiency and the registration performance by incorporating the gradient of innersubscriptinner\mathcal{L}_{\text{inner}}caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT in RIIR. We could achieve the ablation by changing the input of gθsubscript𝑔𝜃g_{\theta}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. A comparison with other input modelling strategies seen in Qiu et al. (2022) against the gradient-based input for gθsubscript𝑔𝜃g_{\theta}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT was proposed. Depending on whether the moving image is deformed (explicit) or not (implicit), we ended up with three input compositions:

  1. 1.

    Implicit Input without innersubscriptinner\nabla\mathcal{L}_{\text{inner}}∇ caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT: [ϕt,Imov,Ifix]subscriptbold-italic-ϕ𝑡subscript𝐼movsubscript𝐼fix[\bm{\phi}_{t},I_{\text{mov}},I_{\text{fix}}][ bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT ];

  2. 2.

    Explicit Input without innersubscriptinner\nabla\mathcal{L}_{\text{inner}}∇ caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT: [ϕt,Imovϕt,Ifix]subscriptbold-italic-ϕ𝑡subscript𝐼movsubscriptbold-italic-ϕ𝑡subscript𝐼fix[\bm{\phi}_{t},I_{\text{mov}}\circ\bm{\phi}_{t},I_{\text{fix}}][ bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT ∘ bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT ];

  3. 3.

    RIIR Input: [ϕt,ϕtinner]subscriptbold-italic-ϕ𝑡subscriptsubscriptbold-italic-ϕ𝑡subscriptinner[\bm{\phi}_{t},\nabla_{\bm{\phi}_{t}}\mathcal{L}_{\text{inner}}][ bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∇ start_POSTSUBSCRIPT bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT ].

This study aimed to provide insights into the impact of different input compositions on the efficiency of RIIR when data availability varies. We conducted the experiment with low data availability choices ([5%,10%]percent5percent10[5\%,10\%][ 5 % , 10 % ]) to examine the data efficiency induced by the gradient input.

Experiment 5: RIIR Architecture Ablation Since RIIR is the first attempt to formulate and implement the RIM framework for medical image registration, we performed an extensive ablation study on the network architecture of RIIR, including: 1. number of evaluation steps T𝑇Titalic_T; 2. choice of loss weights wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

3.4 Implementation Details and Baseline Methods

We implemented the RIIR in the following settings for experiment 1: We used the inference steps T=6𝑇6T=6italic_T = 6 and exponential weighting for wt=10t1T1subscript𝑤𝑡superscript10𝑡1𝑇1w_{t}=10^{\frac{t-1}{T-1}}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT divide start_ARG italic_t - 1 end_ARG start_ARG italic_T - 1 end_ARG end_POSTSUPERSCRIPT. We compared our proposed RIIR with two representative methods in the deep learning paradigm: VoxelMorph (Balakrishnan et al., 2019) and a modification of GraDIRN (Qiu et al., 2022). We conducted an in-depth hyperparameter tuning to compare existing methods and RIIR. For VoxelMorph, we followed a standard setting: a U-Net (Ronneberger et al., 2015) with 5 down-sampling and 5 up-sampling blocks as the backbone. The channels used in each down-sampling block are [16,16,32,32,32]1616323232\left[16,16,32,32,32\right][ 16 , 16 , 32 , 32 , 32 ]. The implementation of GraDIRN followed the authors’ settings suggested in Qiu et al. (2022) with 2 resolutions and 3 iterations per resolution and the explicit warped images as input. The numbers of parameters are given as follows for RIIR, VoxelMorph, and GraDIRN respectively: 61K, 254K, and 173K. The trade-off parameter λ𝜆\lambdaitalic_λ for all models was set to 0.050.050.050.05 when considering MSE (for OASIS as simsubscriptsim\mathcal{L}_{\text{sim}}caligraphic_L start_POSTSUBSCRIPT sim end_POSTSUBSCRIPT) and 0.10.10.10.1 when considering NMI (for mSASHA) as simsubscriptsim\mathcal{L}_{\text{sim}}caligraphic_L start_POSTSUBSCRIPT sim end_POSTSUBSCRIPT. The optimizer of all methods remained the same using Adam (Kingma and Ba, 2015) with β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 and β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999. The initial learning rate was set to 8×1048superscript1048\times 10^{-4}8 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for all models. The experiments were performed on an NVIDIA RTX 4090 GPU with a VRAM of 24 GB. We will make our code for RIIR publicly available after acceptance, as well as our implementation of the baseline models.

4 Results

4.1 Experiment 1: Comparison Study with Varying Data Availability

The results for OASIS are depicted in Fig 4(a), as well as an illustrative visualization of RIIR inference on an example test data, as shown in Fig. 5. It is evident that RIIR outperforms when data availability is severely limited and maintains consistent performance across various data availability scenarios, showcasing its data efficiency and accuracy. For a more detailed comparison, the statistics when full data availability is guaranteed are presented in Tab. 1 for OASIS.

Refer to caption
(a) Dice score on OASIS.
Refer to caption
(b) 𝒟PCA1subscript𝒟PCA1\mathcal{D}_{\text{PCA1}}caligraphic_D start_POSTSUBSCRIPT PCA1 end_POSTSUBSCRIPT on mSASHA.
Figure 4: Results of Experiment 1 with boxplots for Dice score and 𝒟PCA1subscript𝒟PCA1\mathcal{D}_{\text{PCA1}}caligraphic_D start_POSTSUBSCRIPT PCA1 end_POSTSUBSCRIPT on two datasets, respectively. The circle denotes the mean of the metric of interest. The segmentation metric Dice is calculated for all 35 segmentation labels and post-processed by taking the average. The group-wise metric 𝒟PCA1subscript𝒟PCA1\mathcal{D}_{\text{PCA1}}caligraphic_D start_POSTSUBSCRIPT PCA1 end_POSTSUBSCRIPT was calculated based on further center-cropping at a ratio of 70%percent7070\%70 % on the warped images.
Refer to caption
Figure 5: A visualization of RIIR inference on OASIS test split, visualized with a 2D slice and in-plane deformation. The inference step was set to 6666 in both training and inference. All images in the same row were plotted using the same color range for better consistency.
Table 1: Results of Experiment 1 on OASIS with 100%percent100100\%100 % data availability. The metrics are reported with a format of mean(standard deviation). A double-sided Wilcoxon test is calculated for Dice score, and denotes a significant difference compared with the baseline VoxelMorph in terms of Dice score (p<0.05𝑝0.05p<0.05italic_p < 0.05).
Model Dice \uparrow HD \downarrow std(log|J|)std𝐽\text{std}(\log|J|)std ( roman_log | italic_J | ) \downarrow
VoxelMorph 0.694(0.035) 3.78(0.54) 0.26(0.11)
GraDIRN 0.697(0.034) 3.76(0.54) 0.31(0.14)
RIIR 0.724(0.031) 3.71(0.53) 0.32(0.12)

The results of this experiment on mSASHA are shown in Fig. 4(b) using a composition of boxplots. An illustrative evolving visualization of RIIR inference demonstrated in a step-wise fashion can be found in Fig. 6. The corresponding boxplots for 𝒟PCA2subscript𝒟PCA2\mathcal{D}_{\text{PCA2}}caligraphic_D start_POSTSUBSCRIPT PCA2 end_POSTSUBSCRIPT can be found in the supplementary material. Similarly, the statistics of mSASHA when full data availability is guaranteed are shown in in Tab. 2.

Table 2: Results of Experiment 1 on mSASHA with 100%percent100100\%100 % data availability. The metrics are reported with a format of mean(standard deviation). A double-sided Wilcoxon test is calculated for 𝒟PCA1subscript𝒟PCA1\mathcal{D}_{\text{PCA1}}caligraphic_D start_POSTSUBSCRIPT PCA1 end_POSTSUBSCRIPT, and denotes a significant difference compared with the baseline VoxelMorph (p<0.05𝑝0.05p<0.05italic_p < 0.05).
Model 𝒟PCA1subscript𝒟PCA1\mathcal{D}_{\text{PCA1}}caligraphic_D start_POSTSUBSCRIPT PCA1 end_POSTSUBSCRIPT \downarrow 𝒟PCA2subscript𝒟PCA2\mathcal{D}_{\text{PCA2}}caligraphic_D start_POSTSUBSCRIPT PCA2 end_POSTSUBSCRIPT \downarrow std(log|J|)std𝐽\text{std}(\log|J|)std ( roman_log | italic_J | ) \downarrow
VoxelMorph 0.425(0.126) 35.88(20.13) 0.33(0.25)
GraDIRN 0.343(0.078) 34.89(12.69) 1.56(0.78)
RIIR 0.324(0.072) 34.49(11.99) 2.18(0.79)
Refer to caption
Figure 6: A visualization of RIIR inference on mSASHA test split. The inference step was set to 6666 in both training and inference. All images in the same row were plotted using the same color range for better consistency. The deformation field was initialized as an identity mapping.

Overall, both iterative methods performed better across most data availability scenarios on mSASHA. As data availability approached saturation, the performances of both iterative methods converged. A further qualitative comparison among the three models is shown in Fig. 7.

Refer to caption
Figure 7: Qualitative comparison of mSASHA. The images were scaled with the same color range in each row for comparison. Artifacts (highlighted by red arrow) in the myocardium region are observed for both VoxelMorph and GraDIRN.

4.2 Experiment 2: Inclusion of Hidden States

The architectural settings were kept the same as in the aforementioned experiments and results are shown in Fig. 12. In addition to better performance in the two datasets, we also empirically noticed that the training was more stable with two hidden states both activated.

Refer to caption
Figure 8: Results of Experiment 2 evaluated on OASIS validation set regarding Dice score. Two-sided Wilcoxon tests were conducted for [1,1]11[1,1][ 1 , 1 ] against other settings with statistical significance (p<0.05)𝑝0.05(p<0.05)( italic_p < 0.05 ), except for [0,1]01[0,1][ 0 , 1 ] (p=0.11)𝑝0.11(p=0.11)( italic_p = 0.11 ).
Refer to caption
Figure 9: Results of Experiment 2 evaluated on mSASHA validation set regarding 𝒟PCA1subscript𝒟PCA1\mathcal{D}_{\text{PCA1}}caligraphic_D start_POSTSUBSCRIPT PCA1 end_POSTSUBSCRIPT. Here, for example, [0,0]00[0,0][ 0 , 0 ] denotes the case that no hidden states are considered, and [1,1]11[1,1][ 1 , 1 ] denotes both hidden states were considered in the pipeline. Two-sided Wilcoxon tests were conducted for [1,1]11[1,1][ 1 , 1 ] against other settings with statistical significance (p<0.05)𝑝0.05(p<0.05)( italic_p < 0.05 ).

4.3 Experiment 3: Choice of Inner Loss Functions innersubscriptinner\mathcal{L}_{\text{inner}}caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT

Refer to caption
Figure 10: Results of Experiment 3 evaluated on mSASHA validation set. Raw denotes no registration is implemented to the image series.

The quantitative results, measured by 𝒟PCA1subscript𝒟PCA1\mathcal{D}_{\text{PCA1}}caligraphic_D start_POSTSUBSCRIPT PCA1 end_POSTSUBSCRIPT, are shown in Fig. 10. It is noteworthy that the performance when considering NMI as the likelihood function did not align with the performance of the other two scenarios. In response, we conducted a further investigation, presenting a qualitative visualization of innersubscriptinner\nabla\mathcal{L}_{\text{inner}}∇ caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT in Fig. 11 to shed light on this discrepancy. In our implementation, autograd (Paszke et al., 2017) were used to calculate innersubscriptinner\nabla\mathcal{L}_{\text{inner}}∇ caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT for NCC and NMI, while we used the analytical gradient of MSE due to its simplicity. However, the Gaussian approximation of NMI involves flattening Ifixsubscript𝐼fixI_{\text{fix}}italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT and Imovϕsubscript𝐼movbold-italic-ϕI_{\text{mov}}\circ\bm{\phi}italic_I start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT ∘ bold_italic_ϕ to compute the joint probability, potentially resulting in the scattering pattern shown in Fig. 11 and influencing performance, as innersubscriptinner\nabla\mathcal{L}_{\text{inner}}∇ caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT is the sole image information passed into RIIR cell.

Refer to caption
Figure 11: Qualitative check of Experiment 3 with an example pair from mSASHA. Three gradient images are taken from the last step of the RIIR inference pipeline at the same epoch.

4.4 Experiment 4: Inclusion of Inner Loss Gradient as RIIR Input

The results are shown in Fig. 9. It can be observed that the network struggled if the warped image Imovϕsubscript𝐼movbold-italic-ϕI_{\text{mov}}\circ\bm{\phi}italic_I start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT ∘ bold_italic_ϕ is only implicitly fed to the RIIR cell, i.e., there’s an additional cost to learn the transformation. During our training, we also noticed that when trained with explicit warped images, the network tends to have a larger deformation magnitude since the images are more detailed compared to the gradient input (as shown in Fig. 11). Based on the results, we conjecture that the explicit input often offers more details to guide the optimization which benefits the training. However, it can be also redundant for training when data availability is limited thus potentially leading to overfitting.

Refer to caption
Figure 12: Results of Experiment 4 evaluated on mSASHA validation set. In the experiment, we have Implicit: [ϕt,Imov,Ifix]subscriptbold-italic-ϕ𝑡subscript𝐼movsubscript𝐼fix[\bm{\phi}_{t},I_{\text{mov}},I_{\text{fix}}][ bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT ]; Explicit: [ϕt,Imovϕt,Ifix]subscriptbold-italic-ϕ𝑡subscript𝐼movsubscriptbold-italic-ϕ𝑡subscript𝐼fix[\bm{\phi}_{t},I_{\text{mov}}\circ\bm{\phi}_{t},I_{\text{fix}}][ bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT ∘ bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT fix end_POSTSUBSCRIPT ]; and Default: [ϕt,ϕtinner]subscriptbold-italic-ϕ𝑡subscriptsubscriptbold-italic-ϕ𝑡subscriptinner[\bm{\phi}_{t},\nabla_{\bm{\phi}_{t}}\mathcal{L}_{\text{inner}}][ bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∇ start_POSTSUBSCRIPT bold_italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT ].

4.5 Experiment 5: RIIR Architecture Ablation

We here demonstrate the model architecture ablation by showing the corresponding boxplots in Fig. 13. Apart from the main experiments shown previously, these experiments can be regarded as minor ablation studies, aiming to strike a balance between computational precision and inference speed.

Refer to caption
Refer to caption
Figure 13: Results of Experiment 5. For the ablation study on network steps, double-sided Wilcoxon tests suggest no significant difference is found for t=6,8,10𝑡6810t=6,8,10italic_t = 6 , 8 , 10. While for the ablation on loss weighting strategy, the exponential decay (denoted by ‘Decay’) shows a significant difference to uniform weighting (‘Uniform’) (p<0.05)𝑝0.05(p<0.05)( italic_p < 0.05 ).

5 Discussion

In this study, we introduced RIIR, a deep learning-based medical image registration method that leverages recurrent inferences as a meta-learning strategy. We extended Recurrent Inference Machines (RIMs) to the image registration problem, which has no explicit forward models. RIIR was extensively evaluated on public brain MR and in-house quantitative cardiac MR datasets, and demonstrated consistently improved performance over established deep learning models, in both one-step and iterative settings. Additionally, our ablation study confirmed the importance of incorporating hidden states within the RIM-based framework.

The acclaimed performance improvement of registration is especially pronounced in scenarios with limited training data as demonstrated in Fig. 4. Notably, RIIR achieved superior average evaluation metrics on segmentation accuracy and groupwise alignment, with significantly lower variance. This performance advantage was pronounced in the mSASHA experiments with very limited training data, a stark contrast to the baseline models which required much more training data to reach similar performances. Though both GraDIRN and RIIR are iterative methods, GraDIRN isolates the update of explicit simsubscriptsim\mathcal{L}_{\text{sim}}caligraphic_L start_POSTSUBSCRIPT sim end_POSTSUBSCRIPT and deep learning-based regsubscriptreg\mathcal{L}_{\text{reg}}caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT with no internal states, thus potentially resulting in worse generalization and slower convergence compared with RIIR. Notably, GraDIRN initialized the deformation field randomly by default, which could lead to optimization difficulties when data were extremely limited during training. Furthermore, in the mSASHA dataset, the qualitative visual comparison in Fig. 7 highlights that incorporating gradient information in our RIIR method can better preserve the original structure of the myocardium in the registration process. Furthermore, the step-wise loss function, as used in Eq. 13, also prevents exaggerated registration that can compromise the accuracy and affect the resulting quantitative maps as shown in the baseline methods.

Although hidden states were used in the original RIM and later works (Putzky and Welling, 2017; Putzky et al., 2019; Sabidussi et al., 2021, 2023), it has not been investigated in detail regarding its impact on optimization of RIM-based methods. Our second experiment investigates the impact of these hidden states within RIIR. Our findings reveal that the presence of hidden states, as proposed in the original RIM work (Putzky and Welling, 2017), contributes positively to the performance of our model, as shown by the quantitative results in Fig. 9 and Fig. 8. However, in the 3D OASIS dataset, this improvement was less pronounced compared to the mSASHA dataset, where multiple images in a time series are evaluated together.

We further investigated the influence of inner loss selection (Experiment 3), which was not studied before in iterative deep learning-based registration (Qiu et al., 2022). In mSAHSA, with simsubscriptsim\mathcal{L}_{\text{sim}}caligraphic_L start_POSTSUBSCRIPT sim end_POSTSUBSCRIPT chosen as NMI in outersubscriptouter\mathcal{L}_{\text{outer}}caligraphic_L start_POSTSUBSCRIPT outer end_POSTSUBSCRIPT, the quantitative results are not as satisfactory as other choices (MSE, NCC). This is probably due to the implementation of NMI operation that may dissect the neighborhood structure of the original image for RIIR Cell as input as shown in Fig. 11. Nevertheless, this discrepancy suggests that the gradient information of innersubscriptinner\mathcal{L}_{\text{inner}}caligraphic_L start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT can be considered as a form of compressed image data that is processed with the convolutional nature in RIIR Cell. The ablation study on input combinations (Experiment 4) indicates that, in scenarios with limited data, the inclusion of gradient input effectively prevents model overfitting, thus leading to better generalization performance as shown in Fig. 12.

RIIR’s superior registration performance and data efficiency suggest its potential for applications in medical image registration. However, it is necessary to acknowledge the current limitations, to further enhance the framework in future work. First, the performance of RIIR could be bottlenecked by the simplistic convolutional GRU unit, where potential improvement can be considered such as dilated convolutional kernels which can preserve larger receptive fields. Second, the expanding computation graph of RIM-based models, with increasing inference steps, leads to heavy GPU memory consumption.

6 Conclusion

In conclusion, we present RIIR, a novel recurrent deep-learning framework for medical image registration. RIIR significantly extends the concept of recurrent inference machines for inverse problem solving, to high-dimensional optimization challenges with no closed-form forward models. Meanwhile, RIIR distinguishes itself from previous iterative methods by integrating implicit regularization with explicit loss gradients. Our experiments across diverse medical image datasets demonstrated RIIR’s superior accuracy and data efficiency. We also empirically demonstrated the effectiveness of its architectural design and the value of hidden states, significantly enhancing both registration accuracy and data efficiency. RIIR is shown to be an effective and generalizable tool for medical image registration, and potentially extends to other high-dimensional optimization problems.


Acknowledgments

This work was partly supported by the TU Delft AI Initiative, Amazon Research Awards, and the National Heart, Lung, and Blood Institute, National Institutes of Health by the Division of Intramural Research (Z1A-HL006214).


Ethical Standards

The work follows appropriate ethical standards in conducting research and writing the manuscript, following all applicable laws and regulations regarding treatment of animals or human subjects.


Conflicts of Interest

We declare we don’t have conflicts of interest.

References

  • Andrychowicz et al. (2016) Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. Advances in Neural Information Processing Systems, 29, 2016.
  • Avants et al. (2008) Brian B Avants, Charles L Epstein, Murray Grossman, and James C Gee. Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain. Medical Image Analysis, 12(1):26–41, 2008.
  • Avants et al. (2011) Brian B Avants, Nicholas J Tustison, Gang Song, Philip A Cook, Arno Klein, and James C Gee. A reproducible evaluation of ants similarity metric performance in brain image registration. NeuroImage, 54(3):2033–2044, 2011.
  • Balakrishnan et al. (2019) Guha Balakrishnan, Amy Zhao, Mert R Sabuncu, John Guttag, and Adrian V Dalca. Voxelmorph: a learning framework for deformable medical image registration. IEEE Transactions on Medical Imaging, 38(8):1788–1800, 2019.
  • Byrne et al. (2022) Mikel Byrne, Ben Archibald-Heeren, Yunfei Hu, Amy Teh, Rhea Beserminji, Emma Cai, Guilin Liu, Angela Yates, James Rijken, Nick Collett, and Trent Aland. Varian ethos online adaptive radiotherapy for prostate cancer: Early results of contouring accuracy, treatment plan quality, and treatment time. Journal of Applied Clinical Medical Physics, 23(1):e13479, 2022. .
  • Cao et al. (2017) Xiaohuan Cao, Jianhua Yang, Jun Zhang, Dong Nie, Minjeong Kim, Qian Wang, and Dinggang Shen. Deformable image registration based on similarity-steered cnn regression. In Medical Image Computing and Computer Assisted Intervention- MICCAI 2017: 20th International Conference, Quebec City, QC, Canada, September 11-13, 2017, Proceedings, Part I 20, pages 300–308. Springer, 2017.
  • Chen et al. (2017) Yutian Chen, Matthew W Hoffman, Sergio Gómez Colmenarejo, Misha Denil, Timothy P Lillicrap, Matt Botvinick, and Nando Freitas. Learning to learn without gradient descent by gradient descent. In International Conference on Machine Learning, pages 748–756. PMLR, 2017.
  • Chow et al. (2022) Kelvin Chow, Genevieve Hayes, Jacqueline A Flewitt, Patricia Feuchter, Carmen Lydell, Andrew Howarth, Joseph J Pagano, Richard B Thompson, Peter Kellman, and James A White. Improved accuracy and precision with three-parameter simultaneous myocardial t1 and t2 mapping using multiparametric sasha. Magnetic Resonance in Medicine, 87(6):2775–2791, 2022.
  • Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
  • Cotter and Conwell (1990) Neil E Cotter and Peter R Conwell. Fixed-weight networks can learn. In International Joint Conference on Neural Networks, pages 553–559. IEEE, 1990.
  • Dalca et al. (2019) Adrian V Dalca, Guha Balakrishnan, John Guttag, and Mert R Sabuncu. Unsupervised learning of probabilistic diffeomorphic registration for images and surfaces. Medical Image Analysis, 57:226–236, 2019.
  • De Vos et al. (2019) Bob D De Vos, Floris F Berendsen, Max A Viergever, Hessam Sokooti, Marius Staring, and Ivana Išgum. A deep learning framework for unsupervised affine and deformable image registration. Medical Image Analysis, 52:128–143, 2019.
  • de Vos et al. (2020) Bob D de Vos, Bas HM van der Velden, Jörg Sander, Kenneth GA Gilhuijs, Marius Staring, and Ivana Išgum. Mutual information for unsupervised deep learning image registration. In Medical Imaging 2020: Image Processing, volume 11313, pages 155–161. SPIE, 2020.
  • Fechter and Baltas (2020) Tobias Fechter and Dimos Baltas. One-shot learning for deformable medical image registration and periodic motion tracking. IEEE Transactions on Medical Imaging, 39(7):2506–2517, 2020.
  • Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126–1135. PMLR, 2017.
  • Haskins et al. (2020) Grant Haskins, Uwe Kruger, and Pingkun Yan. Deep learning in medical image registration: a survey. Machine Vision and Applications, 31:1–18, 2020.
  • Hering et al. (2019) Alessa Hering, Bram van Ginneken, and Stefan Heldmann. mlvirnet: Multilevel variational image registration network. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part VI 22, pages 257–265. Springer, 2019.
  • Hoopes et al. (2021) Andrew Hoopes, Malte Hoffmann, Bruce Fischl, John Guttag, and Adrian V Dalca. Hypermorph: Amortized hyperparameter learning for image registration. In Information Processing in Medical Imaging: 27th International Conference, IPMI 2021, Virtual Event, June 28–June 30, 2021, Proceedings 27, pages 3–17. Springer, 2021.
  • Hospedales et al. (2021) Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. Meta-learning in neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):5149–5169, 2021.
  • Huizinga et al. (2016) W. Huizinga, D.H.J. Poot, J.-M. Guyader, R. Klaassen, B.F. Coolen, M. van Kranenburg, R.J.M. van Geuns, A. Uitterdijk, M. Polfliet, J. Vandemeulebroucke, A. Leemans, W.J. Niessen, and S. Klein. Pca-based groupwise image registration for quantitative mri. Medical Image Analysis, 29:65–78, 2016. ISSN 1361-8415. . URL https://www.sciencedirect.com/science/article/pii/S1361841515001851.
  • Jin et al. (2021) Cheng Jin, Heng Yu, Jia Ke, Peirong Ding, Yongju Yi, Xiaofeng Jiang, Xin Duan, Jinghua Tang, Daniel T Chang, Xiaojian Wu, Feng Gao, and Ruijiang Li. Predicting treatment response from longitudinal images using multi-task deep learning. Nature Communications, 12(1):1851, 2021.
  • Kanter and Lellmann (2022) Frederic Kanter and Jan Lellmann. A flexible meta learning model for image registration. In Ender Konukoglu, Bjoern Menze, Archana Venkataraman, Christian Baumgartner, Qi Dou, and Shadi Albarqouni, editors, Proceedings of The 5th Medical Imaging with Deep Learning, volume 172 of Proceedings of Machine Learning Research, pages 638–652. PMLR, 06–08 Jul 2022. URL https://proceedings.mlr.press/v172/kanter22a.html.
  • Karkalousos et al. (2022) Dimitrios Karkalousos, Samantha Noteboom, Hanneke E Hulst, Franciscus M Vos, and Matthan WA Caan. Assessment of data consistency through cascades of independently recurrent inference machines for fast and robust accelerated mri reconstruction. Physics in Medicine & Biology, 67(12):124001, 2022.
  • King et al. (2010) Andrew P King, Kawal S Rhode, Y Ma, Cheng Yao, Christian Jansen, Reza Razavi, and Graeme P Penney. Registering preprocedure volumetric images with intraprocedure 3-d ultrasound using an ultrasound imaging model. IEEE Transactions on Medical Imaging, 29(3):924–937, 2010.
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
  • Klein et al. (2007) Stefan Klein, Marius Staring, and Josien PW Pluim. Evaluation of optimization methods for nonrigid medical image registration using mutual information and b-splines. IEEE Transactions on Image Processing, 16(12):2879–2890, 2007.
  • Liu et al. (2021) Risheng Liu, Zi Li, Xin Fan, Chenying Zhao, Hao Huang, and Zhongxuan Luo. Learning deformable image registration from optimization: perspective, modules, bilevel training and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7688–7704, 2021.
  • Lønning et al. (2019) Kai Lønning, Patrick Putzky, Jan-Jakob Sonke, Liesbeth Reneman, Matthan WA Caan, and Max Welling. Recurrent inference machines for reconstructing heterogeneous mri data. Medical Image Analysis, 53:64–78, 2019.
  • Marcus et al. (2007) Daniel S Marcus, Tracy H Wang, Jamie Parker, John G Csernansky, John C Morris, and Randy L Buckner. Open access series of imaging studies (oasis): cross-sectional mri data in young, middle aged, nondemented, and demented older adults. Journal of Cognitive Neuroscience, 19(9):1498–1507, 2007.
  • Messroghli et al. (2004) Daniel R Messroghli, Aleksandra Radjenovic, Sebastian Kozerke, David M Higgins, Mohan U Sivananthan, and John P Ridgway. Modified look-locker inversion recovery (molli) for high-resolution t1 mapping of the heart. Magnetic Resonance in Medicine, 52(1):141–146, 2004.
  • Miao et al. (2016) Shun Miao, Z Jane Wang, and Rui Liao. A cnn regression approach for real-time 2d/3d registration. IEEE Transactions on Medical Imaging, 35(5):1352–1363, 2016.
  • Modi et al. (2021) Chirag Modi, François Lanusse, Uroš Seljak, David N Spergel, and Laurence Perreault-Levasseur. Cosmicrim: reconstructing early universe by combining differentiable simulations with recurrent inference machines. arXiv preprint arXiv:2104.12864, 2021.
  • Morningstar et al. (2019) Warren R Morningstar, Laurence Perreault Levasseur, Yashar D Hezaveh, Roger Blandford, Phil Marshall, Patrick Putzky, Thomas D Rueter, Risa Wechsler, and Max Welling. Data-driven reconstruction of gravitationally lensed galaxies using recurrent inference machines. The Astrophysical Journal, 883(1):14, 2019. .
  • Muckley et al. (2021) Matthew J. Muckley, Bruno Riemenschneider, Alireza Radmanesh, Sunwoo Kim, Geunu Jeong, Jingyu Ko, Yohan Jun, Hyungseob Shin, Dosik Hwang, Mahmoud Mostapha, Simon Arberet, Dominik Nickel, Zaccharie Ramzi, Philippe Ciuciu, Jean-Luc Starck, Jonas Teuwen, Dimitrios Karkalousos, Chaoping Zhang, Anuroop Sriram, Zhengnan Huang, Nafissa Yakubova, Yvonne W. Lui, and Florian Knoll. Results of the 2020 fastmri challenge for machine learning mr image reconstruction. IEEE Transactions on Medical Imaging, 40(9):2306–2317, 2021. .
  • Oliveira and Tavares (2014) Francisco PM Oliveira and Joao Manuel RS Tavares. Medical image registration: a review. Computer Methods in Biomechanics and Biomedical Engineering, 17(2):73–93, 2014.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
  • Putzky and Welling (2017) Patrick Putzky and Max Welling. Recurrent inference machines for solving inverse problems. arXiv preprint arXiv:1706.04008, 2017.
  • Putzky et al. (2019) Patrick Putzky, Dimitrios Karkalousos, Jonas Teuwen, Nikita Miriakov, Bart Bakker, Matthan Caan, and Max Welling. i-rim applied to the fastmri challenge. arXiv preprint arXiv:1910.08952, 2019.
  • Qiu et al. (2021) Huaqi Qiu, Chen Qin, Andreas Schuh, Kerstin Hammernik, and Daniel Rueckert. Learning diffeomorphic and modality-invariant registration using b-splines. In Medical Imaging with Deep Learning, 2021.
  • Qiu et al. (2022) Huaqi Qiu, Kerstin Hammernik, Chen Qin, Chen Chen, and Daniel Rueckert. Embedding gradient-based optimization in image registration networks. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2022, pages 56–65. Springer, 2022.
  • Rohé et al. (2017) Marc-Michel Rohé, Manasi Datar, Tobias Heimann, Maxime Sermesant, and Xavier Pennec. Svf-net: learning deformable image registration using shape matching. In Medical Image Computing and Computer Assisted Intervention- MICCAI 2017: 20th International Conference, Quebec City, QC, Canada, September 11-13, 2017, Proceedings, Part I 20, pages 266–274. Springer, 2017.
  • Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  • Rueckert and Schnabel (2019) Daniel Rueckert and Julia A Schnabel. Model-based and data-driven strategies in medical image computing. Proceedings of the IEEE, 108(1):110–124, 2019.
  • Sabidussi et al. (2021) Emanoel R Sabidussi, Stefan Klein, Matthan WA Caan, Shabab Bazrafkan, Arnold J den Dekker, Jan Sijbers, Wiro J Niessen, and Dirk HJ Poot. Recurrent inference machines as inverse problem solvers for mr relaxometry. Medical Image Analysis, 74:102220, 2021.
  • Sabidussi et al. (2023) ER Sabidussi, S Klein, B Jeurissen, and DHJ Poot. dtiRIM: A generalisable deep learning method for diffusion tensor imaging. NeuroImage, 269:119900, 2023.
  • Sandkühler et al. (2019) Robin Sandkühler, Simon Andermatt, Grzegorz Bauman, Sylvia Nyilas, Christoph Jud, and Philippe C Cattin. Recurrent registration neural networks for deformable image registration. Advances in Neural Information Processing Systems, 32, 2019.
  • Sauer (2006) Frank Sauer. Image registration: enabling technology for image guided surgery and therapy. In 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference, pages 7242–7245. IEEE, 2006.
  • Schmidhuber (1993) Jürgen Schmidhuber. A neural network that embeds its own meta-levels. In IEEE International Conference on Neural Networks, pages 407–412. IEEE, 1993.
  • Shi et al. (2015) Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. Advances in Neural Information Processing Systems, 28, 2015.
  • Sotiras et al. (2013) Aristeidis Sotiras, Christos Davatzikos, and Nikos Paragios. Deformable medical image registration: A survey. IEEE Transactions on Medical Imaging, 32(7):1153–1190, 2013.
  • Staring et al. (2009) Marius Staring, Uulke A van der Heide, Stefan Klein, Max A Viergever, and Josien PW Pluim. Registration of cervical mri using multifeature mutual information. IEEE Transactions on Medical Imaging, 28(9):1412–1421, 2009.
  • Studholme et al. (1999) Colin Studholme, Derek LG Hill, and David J Hawkes. An overlap invariant entropy measure of 3d medical image alignment. Pattern Recognition, 32(1):71–86, 1999.
  • Thévenaz and Unser (2000) Philippe Thévenaz and Michael Unser. Optimization of mutual information for multiresolution image registration. IEEE Transactions on Image Processing, 9(12):2083–2099, 2000.
  • van Harten et al. (2023) Louis van Harten, Rudolf Leonardus Mirjam Van Herten, Jaap Stoker, and Ivana Isgum. Deformable image registration with geometry-informed implicit neural representations. In Medical Imaging with Deep Learning, 2023.
  • Wolterink et al. (2022) Jelmer M Wolterink, Jesse C Zwienenberg, and Christoph Brune. Implicit neural representations for deformable image registration. In Medical Imaging with Deep Learning, pages 1349–1359. PMLR, 2022.
  • Xu et al. (2021) Junshen Xu, Eric Z Chen, Xiao Chen, Terrence Chen, and Shanhui Sun. Multi-scale neural odes for 3d medical image registration. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pages 213–223. Springer, 2021.
  • Yang et al. (2016) Xiao Yang, Roland Kwitt, and Marc Niethammer. Fast predictive image registration. In Deep Learning and Data Labeling for Medical Applications: First International Workshop, LABELS 2016, and Second International Workshop, DLMIA 2016, Held in Conjunction with MICCAI 2016, Athens, Greece, October 21, 2016, Proceedings 1, pages 48–57. Springer, 2016.
  • Yang et al. (2017) Xiao Yang, Roland Kwitt, Martin Styner, and Marc Niethammer. Quicksilver: Fast predictive image registration – a deep learning approach. NeuroImage, 158:378–396, 2017. ISSN 1053-8119. .
  • Younger et al. (1999) A Steven Younger, Peter R Conwell, and Neil E Cotter. Fixed-weight on-line learning. IEEE Transactions on Neural Networks, 10(2):272–283, 1999.
  • Zbontar et al. (2018) Jure Zbontar, Florian Knoll, Anuroop Sriram, Matthew J. Muckley, Mary Bruno, Aaron Defazio, Marc Parente, Krzysztof J. Geras, Joe Katsnelson, Hersh Chandarana, Zizhao Zhang, Michal Drozdzal, Adriana Romero, Michael G. Rabbat, Pascal Vincent, James Pinkerton, Duo Wang, Nafissa Yakubova, Erich Owens, C. Lawrence Zitnick, Michael P. Recht, Daniel K. Sodickson, and Yvonne W. Lui. fastmri: An open dataset and benchmarks for accelerated mri. arXiv preprint arXiv:1811.08839, 2018.
  • Zhang et al. (2021) Yungeng Zhang, Yuru Pei, and Hongbin Zha. Learning dual transformer network for diffeomorphic registration. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pages 129–138. Springer, 2021.
  • Zhao et al. (2019) Shengyu Zhao, Yue Dong, Eric I Chang, and Yan Xu. Recursive cascaded networks for unsupervised medical image registration. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10600–10610, 2019.

A Supplementary Material

Refer to caption
(a) HD on test OASIS data.
Refer to caption
(b) 𝒟PCA2subscript𝒟PCA2\mathcal{D}_{\text{PCA2}}caligraphic_D start_POSTSUBSCRIPT PCA2 end_POSTSUBSCRIPT on test mSASHA data.
Figure 14: Supplementary results of Experiment 1.

We show the boxplots of Experiment 1 here in terms of HD and 𝒟PCA2subscript𝒟PCA2\mathcal{D}_{\text{PCA2}}caligraphic_D start_POSTSUBSCRIPT PCA2 end_POSTSUBSCRIPT on OASIS and mSASHA datasets respectively in Fig. 14.