Recurrent Inference Machine for Medical Image Registration

\firstnameYi \surnameZhang \emaily.zhang-43@tudelft.nl
\addrDepartment of Imaging Physics, Delft University of Technology, Delft, The Netherlands \AND\nameYidong Zhao \emaily.zhao-8@tudelft.nl
\addrDepartment of Imaging Physics, Delft University of Technology, Delft, The Netherlands \AND\nameHui Xue \emailxueh@microsoft.com
\addrHealth Futures, Microsoft Research, Redmond, Washington, USA \AND\namePeter Kellman \emailkellmanp@nhlbi.nih.gov
\addrNational Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland, USA \AND\nameStefan Klein \emails.klein@erasmusmc.nl
\addrDepartment of Radiology and Nuclear Medicine, Erasmus University Medical Center, Rotterdam, the Netherlands \AND\nameQian Tao \emailq.tao@tudelft.nl
\addrDepartment of Imaging Physics, Delft University of Technology, Delft, The Netherlands

Abstract

Image registration is essential for medical image applications where alignment of voxels across multiple images is needed for qualitative or quantitative analysis. With recent advancements in deep neural networks and parallel computing, deep learning-based medical image registration methods become competitive with their flexible modelling and fast inference capabilities. However, compared to traditional optimization-based registration methods, the speed advantage may come at the cost of registration performance at inference time. Besides, deep neural networks ideally demand large training datasets while optimization-based methods are training-free. To improve registration accuracy and data efficiency, we propose a novel image registration method, termed Recurrent Inference Image Registration (RIIR) network. RIIR is formulated as a meta-learning solver to the registration problem in an iterative manner. RIIR addresses the accuracy and data efficiency issues, by learning the update rule of optimization, with implicit regularization combined with explicit gradient input.

We evaluated RIIR extensively on brain MRI and quantitative cardiac MRI datasets, in terms of both registration accuracy and training data efficiency. Our experiments showed that RIIR outperformed a range of deep learning-based methods, even with only $5\%$ of the training data, demonstrating high data efficiency. Key findings from our ablation studies highlighted the important added value of the hidden states introduced in the recurrent inference framework for meta-learning. Our proposed RIIR offers a highly data-efficient framework for deep learning-based medical image registration.

Keywords: Deep Learning, Image Registration, Recurrent Inference Machine

1 Introduction

Medical image registration, the process of establishing anatomical correspondences between two or more medical images, finds wide applications in medical imaging research, including imaging feature fusion (Haskins et al., 2020; Oliveira and Tavares, 2014), treatment planning (Staring et al., 2009; King et al., 2010; Byrne et al., 2022), and longitudinal patient studies (Sotiras et al., 2013; Jin et al., 2021). Medical image registration is traditionally formulated as an optimization problem, which aims to solve a parameterized transformation in an iterative manner (Klein et al., 2007). Typically, the optimization objective consists of two parts: a similarity term that enforces the alignments between images, and a regularization term that imposes smoothness constraints. Due to the complexity of non-convex optimization, traditional methods often struggle with long run time, especially for large, high-resolution images. This hinders its practical use in clinical practice, e.g. surgery guidance (Sauer, 2006), where fast image registration is demanded(Avants et al., 2011; Balakrishnan et al., 2019).

With recent developments in machine learning, the data-driven deep-learning paradigm has gained popularity in medical image registration (Rueckert and Schnabel, 2019). Instead of iteratively updating the transformation parameters by a conventional optimization pipeline, deep learning-based methods make fast image-to-transformation predictions at inference time. Early works learned the transformation in a supervised manner (Miao et al., 2016; Yang et al., 2016), while unsupervised learning methods later became prevalent. They adopt similar loss functions as those in conventional methods but optimize them through amortized neural networks (Balakrishnan et al., 2019; De Vos et al., 2019). These works demonstrate the great potential of deep learning-based modes for medical image registration. Nonetheless, one-step inference of image transformation is in principle a difficult problem, compared to the iterative approach, especially when the deformation field is large. In practice, the one-step inference requires a relatively large amount of data to train the deep learning network for consistent prediction, and may still lead to unexpected transformations at inference time (Fechter and Baltas, 2020; Hering et al., 2019; Zhao et al., 2019).

In contrast to one-step inference, recent studies revisited iterative registration, using multi-step inference processes (Fechter and Baltas, 2020; Kanter and Lellmann, 2022; Qiu et al., 2022; Sandkühler et al., 2019; Zhao et al., 2019). Some of these iterative methods Kanter and Lellmann (2022) Qiu et al. (2022) fall within the realm of meta-learning. Instead of learning the optimized parameters, meta-learning focuses on learning the optimization process itself. The use of meta-learning in optimization, as explored by Andrychowicz et al. (2016) and Finn et al. (2017) for image classification tasks, has led to enhanced generalization and faster convergence. For medical imaging applications, a prominent example is the recurrent inference machine (RIM) by Putzky and Welling (2017), originally proposed to solve inverse problems with explicit forward physics models. RIM has demonstrated excellent performance in fast MRI reconstruction (Lønning et al., 2019) and MR relaxometry (Sabidussi et al., 2021).

In this study, we propose a novel meta-learning medical image registration method, named Recurrent Inference Image Registration (RIIR). RIIR is inspired by RIM, but significantly extends its concept to solve more generic optimization problems: different from inverse problems, medical image registration presents a high-dimensional optimization challenge with no closed-form forward model. Below we provide a detailed review to motivate our work.

1.1 Related Works

In this section, we review deep learning-based medical image registration methods in more detail, categorizing them into one-step methods for direct image-to-transformation inference, and iterative methods for multi-step inference. Additionally, we provide a brief overview of meta-learning for medical imaging applications.

1.1.1 One-step Deep Learning-based Registration

Early attempts of utilizing convolutional neural networks (CNNs) for medical image registration supported confined transformations, such as SVF-Net (Rohé et al., 2017), Quicksilver (Yang et al., 2017), and the work of Cao et al. (2017), which are mostly trained in a supervised manner. With the introduction of U-Net architecture (Ronneberger et al., 2015), which has excellent spatial expression capability thanks to its multi-resolution and skip connection, Balakrishnan et al. (2019); Dalca et al. (2019); Hoopes et al. (2021) proposed unsupervised deformable registration frameworks. In the work of De Vos et al. (2019), a combination of affine and deformable transformations was further considered. More recent methods extended the framework by different neural network backbones such as transformers (Zhang et al., 2021) or implicit neural representations (Wolterink et al., 2022; van Harten et al., 2023).

1.1.2 Iterative Deep Learning-based Registration

However, a one-step inference strategy may struggle when predicting large and complex transformations (Hering et al., 2019; Zhao et al., 2019). In contrast to one-step deep learning-based registration methods, recent work adopted iterative processes, reincarnating the conventional pipeline of optimization for medical image registration, either in terms of image resolution (Hering et al., 2019; Fechter and Baltas, 2020; Xu et al., 2021; Liu et al., 2021), multiple optimization steps (Zhao et al., 2019; Sandkühler et al., 2019; Kanter and Lellmann, 2022), or combined (Qiu et al., 2022). In Sandkühler et al. (2019), the use of RNN with gated recurrent unit (GRU) (Chung et al., 2014) was considered, where each step progressively updates the transformation by adding an independent parameterized transformation. Another multi-step method proposed in Zhao et al. (2019) uses recursive cascaded networks to generate a sequence of transformations, which is then composed to get the final transformation. However, the method requires independent modules for each step, which can be memory-inefficient.Hering et al. (2019) proposed a variational method on different levels of resolution, where the final transformation is the composition of the transformations from coarse- to fine-grained. Fechter and Baltas (2020) addresses the importance of data efficiency of deep learning-based models by evaluating the model performance when data availability is limited, and a large domain shift exists. A more recent work proposed in Qiu et al. (2022), Gradient Descent Network for Image Registration (GraDIRN), integrates multi-step and multi-resolution for medical image registration. Specifically, the update rule follows the idea of conventional optimization by deriving the gradient of the similarity term w.r.t. the current transformation and using a CNN to estimate the gradient of the regularization term. Though the direct influence of the gradient term shows to be minor compared to the CNN output (Qiu et al., 2022), the method bridges gradient-based optimization and deep learning-based methods. The method proposed in Kanter and Lellmann (2022) used individual long short-term memory (LSTM) modules for implementing recurrent refinement of the transformation. However, the scope of the work is limited to affine transformation, which only serves as an initialization for the conventional medical image registration pipeline.

1.1.3 Meta-Learning and Recurrent Inference Machine

Meta-learning, also described as “learning to learn”, is a subfield of machine learning. In this approach, an outer algorithm updates an inner learning algorithm, enabling the model to adapt and optimize its learning strategy to achieve a broader objective. For example, in a meta-learning scenario, a model could be trained on a variety of tasks, such as different types of image recognition, with the goal of quickly adapting to unseen similar tasks, like recognizing new kinds of objects not included in the original training set, using a few training samples. (Hospedales et al., 2021). An early approach in meta-learning is designing an architecture of networks that can update their parameters according to different tasks and data inputs (Schmidhuber, 1993). The work of Cotter and Conwell (1990) and Younger et al. (1999) further show that a fixed-weight RNN demonstrates flexibility in learning multiple tasks. More recently, methods learning an optimization process with RNNs were developed and studied in Andrychowicz et al. (2016); Chen et al. (2017); Finn et al. (2017), demonstrating superior convergence speed and better generalization ability for unseen tasks.

In the spirit of meta-learning, RIM was developed by Putzky and Welling (2017) to solve inverse problems. RIM learns a single recurrent architecture that shares the parameters across all iterations, with internal states passing through iterations (Putzky and Welling, 2017). In the context of meta-learning, RIM distinguishes two tasks of different levels: the ‘inner task’, which focuses on solving a specific inverse problem (e.g., superresolution of an image), and the ‘outer task’, aimed at optimizing the optimization process itself. This setting enables RIM to efficiently learn and apply optimization strategies to complex problems. RIM has shown robust and competitive performance across different application domains, from cosmology (Morningstar et al., 2019; Modi et al., 2021) to medical imaging (Karkalousos et al., 2022; Lønning et al., 2019; Putzky et al., 2019; Sabidussi et al., 2021, 2023). They all aim to solve an inverse problem with a known differentiable forward model in closed form, such as Fourier transform with sensitivity map and sampling mask in MRI reconstruction (Lønning et al., 2019).

However, the definition of an explicit forward model does not exist for the medical image registration task. In this work, we sought to extend the framework of RIM, which demonstrated state-of-the-art performance in medical image reconstruction challenges (Muckley et al., 2021; Putzky et al., 2019; Zbontar et al., 2018), to the medical image registration problem. The same formulation can be generalized to other high-dimensional optimization problems where explicit forward models are absent but differentiable evaluation metrics are available.

1.2 Contributions

The main contributions of our work are three-fold:

1. We propose a novel meta-learning framework, RIIR, for medical image registration. RIIR learns the optimization process, in the absence of explicit forward models. RIIR is flexible w.r.t. the input modality while demonstrating competitive accuracy in different medical image registration applications.

2. Unlike existing iterative deep learning-based methods, our method integrates the gradient information of input images into the prediction of dense incremental transformations. As such, RIIR largely simplifies the learning task compared to one-step inference, significantly enhancing the overall data efficiency, as demonstrated by our experiments.

3. Through in-depth ablation experiments, we not only showed the flexibility of our proposed method with varying input choices but also investigated how different architectural choices within the RIM framework impact its performance. In particular, we showed the added value of hidden states in solving complex optimization problems in the context of medical imaging, which was under-explored in existing literature.

2 Methods

2.1 Deformable Image Registration

Deformable image registration aims to align a moving image $I_{\text{mov}}$ to a fixed image $I_{\text{fix}}$ by determining a transformation $\bm{\phi}$ acting on the shared coordinates $\bm{x}$ , such that the transformed image $I_{\text{mov}}\circ\bm{\bm{\phi}}$ is similar enough to $I_{\text{fix}}$ . The similarity is often evaluated by a scalar-valued metric. In deformable image registration, $\bm{\bm{\phi}}$ is considered to be a relatively small displacement added to the original coordinate $\bm{x}$ , expressed as $\bm{\phi}=\bm{x}+u(\bm{x})$ . Since the transformation $\bm{\phi}$ is calculated between the pair $(I_{\text{mov}},I_{\text{fix}})$ , the process is often referred to as pairwise registration (Balakrishnan et al., 2019). Finding such transformation $\bm{\phi}$ in pairwise registration can be viewed as the following optimization problem:

\hat{\bm{\phi}}=\underset{\bm{\phi}}{\operatorname{argmin}}\ \mathcal{L}_{% \text{sim }}(I_{\text{mov}}\circ\bm{\phi},I_{\text{fix}})+\lambda\mathcal{L}_{% \text{reg }}(\bm{\phi}),

(1)

where $\mathcal{L}_{\text{sim}}$ is a similarity term between the deformed image $I_{\text{mov}}\circ\bm{\phi}$ and fixed image $I_{\text{fix}}$ , $\mathcal{L}_{\text{reg}}$ is a regularization term constraining $\bm{\phi}$ , and $\lambda$ is a trade-off weight term.

2.2 Recurrent Inference Machine (RIM)

Refer to caption — Figure 1: Overview of RIIR framework. Here, an illustrative cardiac image pair is shown as an example. The hidden states $\bm{h}_{t}=[\bm{h}_{t}^{1},\bm{h}_{t}^{2}]$ are visualized in channel-wise fashion. The inner loss $\mathcal{L}_{\text{inner}}$ is calculated during each step of RIIR thus dynamically changing. When $t=0$ , the deformation field $\bm{\phi}_{0}$ is initialized as an identical transformation. In RIIR Cell, the dimensions of Conv and ConvGRU layer are dependent on the input (2D or 3D).

The idea of RIM originates from solving a closed-form inverse problem (Putzky and Welling, 2017):

\bm{y}=A\bm{x}+\bm{n},

(2)

where $\bm{y}\in\mathbb{R}^{m}$ is a noisy measurement vector, $\bm{x}\in\mathbb{R}^{d}$ is the underlying noiseless signal, $A\in\mathbb{R}^{m\times d}$ is a measurement matrix, and $\bm{n}$ is a random noise vector. When $m\ll d$ , the inverse problem is ill-posed. Thus, to constrain the solution space of $x$ , a common practice is to solve a maximum a posteriori (MAP) problem:

\max_{\bm{x}}\log\mathcal{L}_{\text{likelihood}}(\bm{y}|\bm{x})+\log p_{\text{% prior}}(\bm{x}),

(3)

where $\mathcal{L}_{\text{likelihood}}(\bm{y}|\bm{x})$ is a likelihood term representing the noisy forward model, such as the Fourier transform with masks in MRI reconstruction Putzky et al. (2019), and $p_{\text{prior}}$ is the prior distribution of the underlying signal $\bm{x}$ . A simple iterative scheme at step $t$ for solving Eq. 3 is via gradient descent:

\bm{x}_{t+1}=\bm{x}_{t}+\gamma_{t}\nabla_{\bm{x}_{t}}\left(\log\mathcal{L}_{% \text{likelihood}}(\bm{y}|\bm{x})+\log p_{\text{prior}}(\bm{x})\right),

(4)

where $\gamma_{t}$ denotes a scalable step length and $\nabla_{\bm{x}_{t}}$ denotes the gradient w.r.t. $\bm{x}$ , evaluated at $\bm{x}_{t}$ . Then, in RIM implementation, Eq. 4 is represented as:

\bm{x}_{t+1}=\bm{x}_{t}+g_{\theta}\left(\nabla_{\bm{x}_{t}}\left(\log\mathcal{% L}_{\text{likelihood}}(\bm{y}|\bm{x})\right),\bm{x}_{t}\right),

(5)

where $g_{\theta}$ is a neural network parameterized by $\theta$ . In RIM, the prior distribution $p_{\text{prior}}(\bm{x})$ is implicitly integrated into the parameterized neural network $g_{\theta}$ which is trained with a weighted sum of the individual prediction losses between $\bm{x}$ and $\bm{x}_{t}$ (e.g., the mean squared loss) at each time step $t$ .

In the context of meta-learning, we regard the likelihood term $\mathcal{L}_{\text{likelihood}}$ guided by the forward model as the ‘inner loss’, denoted by $\mathcal{L}_{\text{inner}}$ as it is serving as the input of the neural network $g_{\theta}$ . For example, given the Gaussian assumption of the noise $\bm{n}$ with a known variance of $\sigma^{2}$ and linear forward model described in Eq. 2, the inner loss can be given as the logarithm of the maximum likelihood estimation (MLE) solution:

\mathcal{L}_{\text{inner}}=\frac{1}{\sigma^{2}}\|\bm{y}-A\bm{x}\|_{2}^{2}.

(6)

In RIM, the gradient of $\mathcal{L}_{\text{inner}}$ is calculated explicitly with the (linear) forward operator $A$ , which is free of the forward pass of a neural network. That means $\mathcal{L}_{\text{inner}}$ does not directly contribute to the update of the network parameters $\theta$ . The weighted loss for training the neural network $g_{\theta}$ for efficient solving the inverse problem can be regarded as the ‘outer loss’, denoted by $\mathcal{L}_{\text{outer}}$ . In the form of the inverse problem shown in Eq. 2, the outer loss to update the network parameter $\theta$ across $T$ time steps can be expressed as:

\mathcal{L}_{\text{outer}}(\theta)=\frac{1}{T}\sum_{i=1}^{T}\|\bm{x}-\bm{x}_{t% }\|_{2}^{2}.

(7)

For clarity and consistency, these notations of $\mathcal{L}_{\text{inner}}$ and $\mathcal{L}_{\text{outer}}$ will be uniformly applied in the subsequent sections.

2.3 Recurrent Inference Image Registration Network (RIIR)

Inspired by the formulation of RIM and the optimization nature of medical image registration, we present a novel deep learning-based image registration framework, named the Recurrent Inference Image Registration Network (RIIR). The overview of our proposed framework can be found in Fig. 1. The proposed framework performs an end-to-end iterative prediction of a dense transformation $\bm{\phi}$ in $T$ steps for pairwise registration: Given the input image pair $\left(I_{\text{mov}},I_{\text{fix}}\right)$ , the optimization problem in Eq. 1 can be solved by the iterative update of $\bm{\phi}$ . And the update rule at step $t\in\{0,1,\ldots,T-1\}$ is:

\bm{\phi}_{t+1}=\bm{\phi}_{t}+\Delta\bm{\phi}_{t},

(8)

where $\bm{\phi}_{0}$ is initialized as an identity mapping $\bm{\phi}_{0}(\bm{x})=\bm{x}$ . The update at step $t$ , $\Delta\bm{\phi}_{t}$ , is calculated by a recurrent update network $g_{\theta}$ by taking a channel-wise concatenation of

\{\bm{\phi}_{t},\nabla_{\bm{\phi}_{t}}\mathcal{L}_{\text{inner}}\left(I_{\text% {mov}}\circ\bm{\phi}_{t},I_{\text{fix}}\right)\}

(9)

as input, where $\nabla_{\bm{\phi}_{t}}$ denotes the gradient w.r.t. $\bm{\phi}$ evaluated for $\bm{\phi}=\bm{\phi}_{t}$ and $\mathcal{L}_{\text{inner}}$ denotes the inner loss.

Originally, RIM aimed to learn a recurrent solver for an inverse problem where the forward model from signal to measurement is known, such as quantitative mapping (Sabidussi et al., 2021) or MRI reconstruction (Lønning et al., 2019). However, the optimization of image registration is in an unsupervised manner with the forward model from the measurement pair $(I_{\text{mov}},I_{\text{fix}})$ to the deformation field $\bm{\phi}$ unknown. In this work, we extend the framework of RIM to be amendable to a broader range of tasks including image registration. We design the inner loss $\mathcal{L}_{\text{inner}}$ by adapting the optimization objective in Eq. 1. Specifically, we use the similarity part $\mathcal{L}_{\text{sim}}$ in Eq. 1 as the inner loss at time step $t$ :

\mathcal{L}_{\text{inner}}\left(I_{\text{mov}}\circ\bm{\phi}_{t},I_{\text{fix}% }\right)=\mathcal{L}_{\text{sim}}\left(I_{\text{mov}}\circ\bm{\phi}_{t},I_{% \text{fix}}\right).

(10)

The explicit modelling of $\mathcal{L}_{\text{inner}}$ also helps the auto differentiation for calculating its gradient while making the regularization part to be learned implicitly in $g_{\theta}$ during the training.

In the implementation of RIM, the iterative update Eq. 8 is achieved by a recurrent neural network (RNN) to generalize the update rule in Eq. 5 with hidden memory state variable $\bm{h}$ estimated for each time step $t$ . Unlike previous RIM-based works (Putzky et al., 2019; Sabidussi et al., 2021) which use two linear gated recurrent units (GRU) to calculate the hidden states $\bm{h}_{t}$ , in RIIR, two convolutional gated recurrent units (ConvGRU) (Shi et al., 2015) are used to better preserve spatial correlation in the image. We further investigate the necessity of including such two-level recurrent structures in our experiment, particularly considering potential complexities in constructing computation graphs for neural networks. The iterative update equations of RIIR at step $t$ have the following form, with the hidden memory states:

	$\displaystyle\left\{\Delta\bm{\phi}_{t},\bm{h}_{t+1}\right\}$	$\displaystyle=g_{\theta}(\bm{\phi}_{t},\nabla_{\bm{\phi}_{t}}\mathcal{L}_{% \text{inner}}\left(I_{\text{mov}}\circ\bm{\phi}_{t},I_{\text{fix}}\right),\bm{% h}_{t}),$		(11)
	$\displaystyle\bm{\phi}_{t+1}$	$\displaystyle=\bm{\phi}_{t}+\Delta\bm{\phi}_{t},$		(12)

where $\bm{h}_{t}=\{\bm{h}_{t}^{1},\bm{h}_{t}^{2}\}$ denotes the two-level hidden memory states at step $t$ . The size of $\bm{h}_{t}$ depends on the size of input image pair $(I_{\text{mov}},I_{\text{fix}})$ with multiple channels. For $t=1$ , $\bm{h}_{1}$ is initialized to a zero input. We name our network $g_{\theta}$ as RIIR Cell, with its detailed architecture illustrated in Fig. 1. To address the difference between our RIIR from the existing gradient-based iterative algorithm (GraDIRN) (Qiu et al., 2022) under the same definition of $\mathcal{L}_{\text{inner}}$ as in Eq. 10, RIIR uses the gradient of inner loss as the neural network input to calculate the incremental update. On the other hand, GraDIRN takes the channel-wise warped image pair $(I_{\text{mov}}\circ\phi,I_{\text{fix}})$ and deformation field $\phi$ as the input to the network to output regularization update in Eq. 1, while the gradient of $\mathcal{L}_{\text{inner}}$ is added to the update without any further processing. Since the network $g_{\theta}$ learns an incremental update with the gradient of inner loss as compressed information, the training of RIIR can be more efficient, which will be discussed in the experiment section.

Unlike previous work in deep learning-based iterative deformable image registration methods which does not incorporate internal hidden states (Zhao et al., 2019; Fechter and Baltas, 2020; Qiu et al., 2022), we propose to combine the gradient information and hidden states as the network input. Using $\bm{h}_{t}$ also suggests an analogy with gradient-based optimization methods such as the Limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm (L-BFGS) to track and memorize progression (Putzky and Welling, 2017). To substantiate this design, the input selections of RIIR will be further ablation-studied and discussed in our experiments.

Since the ground-truth deformation field is not known in deformable image registration, we use the optimization objective Eq. 1 as the proposed outer loss to optimize the parameters $\theta$ of RIIR Cell $g_{\theta}$ . We incorporate a weighted sum of losses for the outer loss $\mathcal{L}_{\text{outer}}$ to ensure that each step contributes to the final prediction:

\mathcal{L}_{\text{outer}}(\theta)=\sum_{t=1}^{T}w_{t}\left(\mathcal{L}_{\text% {sim }}(I_{\text{mov}}\circ\bm{\phi}_{t},I_{\text{fix}})+\lambda\mathcal{L}_{% \text{reg }}(\bm{\phi}_{t})\right),

(13)

where $w_{t}$ is a (positive) scalar indicating the weight of step $t$ . In our experiment, both uniform ( $w_{t}=\frac{1}{T}$ ) and exponential weights ( $w_{t}=10^{\frac{t-1}{T-1}}$ ) are considered and will be compared in the experiments. It is noticeable that the design of using (weighted) average of step-wise loss also makes our proposed RIIR different from other iterative deep learning-based methods (Qiu et al., 2022; Zhao et al., 2019), addressing the fact that early steps in the prediction process were neglected before.

2.4 Metrics

Similarity Functions for Inner Loss $\mathcal{L}_{\text{inner}}$ : In the context of image registration, unlike inverse problems with straightforward forward models, the problem is addressed as a broader optimization challenge. Therefore, it requires an investigation of choosing a (differentiable) function acting as the inner loss function evaluating the quality of estimation of $\bm{\phi}_{t}$ iteratively in RIIR. Furthermore, the gradient of $\mathcal{L}_{\text{inner}}$ as an input of a convolutional recurrent neural network has not been studied before for deformable image registration. These motivate the study on the different choices of $\mathcal{L}_{\text{inner}}$ under a fixed choice of outlet loss $\mathcal{L}_{\text{outer}}$ . In this work, we evaluate three similarity functions: mean squared error (MSE), normalized cross-correlation (NCC) (Avants et al., 2008), and normalized mutual information (NMI) (Studholme et al., 1999).

The MSE between two 3D images $I_{1},I_{2}\in\mathbb{R}^{d_{x}\times d_{y}\times d_{z}}$ is defined as follows:

\operatorname{MSE}\left(I_{1},I_{2}\right)=\frac{1}{d_{x}d_{y}d_{z}}{\left\|I_% {1}-I_{2}\right\|}_{2}^{2}.

(14)

The MSE metric is minimized when pixels of $I_{1}$ and $I_{2}$ have the same intensities. Therefore, it is sensitive to the contrast change. In comparison, the NCC metric measures the difference between images with the image intensity normalized. The NCC difference between $I_{1}$ and $I_{2}$ is given by:

\operatorname{NCC}(I_{1},I_{2})=\frac{1}{d_{x}d_{y}d_{z}}\sum_{\bm{x}\in\Omega% _{I_{1}}}\frac{\sum_{\bm{x}^{\prime}\in\Omega_{\bm{x}}}(I_{1}(\bm{x}^{\prime})% -\bar{I_{1}}(\bm{x}))(I_{2}(\bm{x}^{\prime})-\bar{I_{2}}(\bm{x}))}{\sqrt{\hat{% I}_{1}(\bm{x})\hat{I}_{2}(\bm{x})}},

(15)

where $\Omega_{I_{1}}$ denotes all possible coordinates in $I_{1}$ , $\Omega_{\bm{x}}$ represents a neighborhood of voxels around coordinate position $\bm{x}$ and $\bar{I}(\bm{x})$ and $\hat{I}(\bm{x})$ denote the (local) mean and variance in $\Omega_{\bm{x}}$ . In this study, we consider the local NCC with a window size of 9.

Compared to MSE and NCC, NMI is shown to be more robust when the linear relation of signal intensities between two images does not hold (Studholme et al., 1999; de Vos et al., 2020), which is often the case in quantitative MRI as the signal models are mostly exponential (Messroghli et al., 2004; Chow et al., 2022). The NMI between two images can be written as:

\operatorname{NMI}(I_{1},I_{2})=\frac{H(I_{1})+H(I_{2})}{H(I_{1},I_{2})},

(16)

where $H(I_{1})$ and $H(I_{2})$ are marginal entropies of $I_{1}$ and $I_{2}$ , respectively, and $H(I_{1},I_{2})$ denotes the joint entropy of the two images. Since the gradient is both necessary for $\mathcal{L}_{\text{inner}}$ and $\mathcal{L}_{\text{outer}}$ we adopt a differentiable approximation of the joint distribution proposed in Qiu et al. (2021) based on Parzen window with Gaussian distributions (Thévenaz and Unser, 2000).

Regularization Metrics: To ensure a smooth and reasonable deformation field, we use a diffusion regularization loss which penalizes large displacements in $\bm{\phi}$ acting on $I\in\mathbb{R}^{d_{x}\times d_{y}\times d_{z}}$ (Balakrishnan et al., 2019):

\mathcal{L}_{\text{reg}}=\frac{1}{d_{x}d_{y}d_{z}}\sum_{\bm{x}\in\Omega_{I}}\|% \nabla\bm{\phi}(\bm{x})\|_{2}^{2},

(17)

where $\nabla\bm{\phi}(\bm{x})$ denotes the Jacobian of $\bm{\phi}$ at coordinate $\bm{x}$ . It is noticeable that Eq. 17 and its gradient are not evaluated in each RIIR inference step as indicated in Eq. 10, the outer loss $\mathcal{L}_{\text{outer}}$ and the data-driven training process can guide the RIIR Cell $g_{\theta}$ to learn the regularization implicitly.

3 Experiments

3.1 Dataset

We evaluated our proposed RIIR framework on two separate datasets: 1) A 3D brain MRI image dataset with inter-subject registration setup, OASIS (Marcus et al., 2007) with pre-processing from Hoopes et al. (2021), denoted as OASIS. 2) A 2D quantitative cardiac MRI image datasets based on multiparametric SAturation-recovery single-SHot Acquisition (mSASHA) image time series (Chow et al., 2022), denoted as mSASHA; These datasets, each serving our interests in inter-subject tissue alignment and respiratory motion correction with contrast variation.

OASIS: The dataset contains 414 subjects, where for each subject, the normalized $T_{1}$ -weighted scan was acquired. The subjects are split into train/validation/test with counts of $[300,30,84]$ . For training, images are randomly paired using an on-the-fly data loader, while in the validation and test sets, all images are paired with the next image in a fixed order. The dataset was preprocessed with FreeSurfer and SAMSEG by Hoopes et al. (2021), resulting in skull-stripped and bias-corrected 3D volumes with a size of $160\times 192\times 224$ . We further resampled the images into a size of $128\times 128\times 128$ with intensity clipping between $(1\%,99\%)$ percentiles. Fig. 2 illustrates an example pair of OASIS images, showcasing the consistency in signal intensity and contrast.

mSASHA: During an free-breathing mSASHA examination, a time series of $N=30$ real-valued 2D images, denoted by $I=\{I_{n}|\ n=1,2,\ldots,N\}$ , are acquired for the same subject. In the setting of quantitative MRI, we aim to spatially align $N$ images in a single sequence $I$ into a common fixed template image $I_{\text{fix}}$ , by individually performing $N$ pairwise registration processes over $(I_{n},I_{\text{fix}})$ where $n=1,2,\ldots,N$ .

The mSASHA acquisition technique (Chow et al., 2022) is a voxel-wise 3-parameter signal model based on the joint cardiac $T_{1}$ - $T_{2}$ signal model:

\mathcal{S}\left(T_{1},T_{2},A\right)=A\left\{1-\left[1-\left(1-e^{-TS/T_{1}}% \right)e^{-TE/T_{2}}\right]e^{-TD/T_{1}}\right\},

(18)

where $(TS,TE,TD)$ denotes the set of three acquisition variables, and $(T_{1},T_{2},A)$ is the set of parameters to be estimated for each voxel coordinate of the image series. We encourage interested readers to refer to Chow et al. (2022) for a more detailed explanation.

In our experiment, an in-house mSASHA dataset was used. This fully anonymized raw dataset was provided by NIH, and was considered “non-human subject data research” by the NIH Office of Human Subjects Research”. The dataset comprises 120 subjects, with each subject having 3 slice positions, resulting in a total number of 360 slices. Each mSASHA time series consists of a fixed length of $N=30$ images. We split mSASHA into train/validation/test with counts of $[84,12,24]$ by subjects to avoid data leakage across the three slices. Given variations in image sizes due to different acquisition conditions, we first center-cropped the images into the same size of $144\times 144$ . Subsequently, we applied intensity clipping between $(1\%,95\%)$ percentiles to mitigate extreme signal intensities from the chest wall region. We selected the last image in the series as the template $I_{\text{fix}}$ , which is a T2-weighted image with the greatest contrast between the myocardium and adjacent blood pool. An illustrative example of mSASHA images can be found in Fig. 3, showing varying contrasts and non-rigid motion across frames.

3.2 Evaluation Metrics

Conventionally, the root mean squared error (RMSE) is evaluated as an overall estimation of agreement between image pair $(I_{\text{mov}}\circ\bm{\phi},I_{\text{fix}})$ . Meanwhile, the spatial smoothness of the transformation is evaluated by the standard error of $\log\left|J\right|=\log\left|\nabla\bm{\phi}(\bm{x})\right|$ , where $\left|\cdot\right|$ denotes the determinant of a matrix. However, these conventional image- and displacement-based metrics are often closely tied to the training objectives $\mathcal{L}_{\text{sim}}$ and $\mathcal{L}_{\text{reg}}$ , thereby being affected by the trade-off parameter $\lambda$ . To mitigate the potential impact of the choice of $\lambda$ , in this work, we maintain a uniform value $\lambda$ across all baseline models and our proposed RIIR model for a given dataset and $\mathcal{L}_{\text{sim}}$ .

For inter-subject brain MRI, two metrics, Dice score and Hausdorff distance (HD) are considered to evaluate the segmentation quality after registration. Given two sets $X\subset M$ and $Y\subset M$ , the Dice score, is defined to measure the overlapping of $X$ and $Y$ :

Dice(X,Y)=\frac{2|X\cap Y|}{|X|+|Y|}.

(19)

Similarly, the Hausdorff distance of two aforementioned sets $X$ and $Y$ is given by:

HD(X,Y):=\max\left\{\sup_{x\in X}d(x,Y),\sup_{y\in Y}d(X,y)\right\},

(20)

where $d(\cdot,\cdot)$ is a metric (2-norm in this work) on $M$ and $d(x,Y):=\inf_{y\in Y}d(x,y)$ . As a remark, in this work, we consider the average across all segmentation labels to calculate the Dice score and HD in OASIS instead of only considering major regions.

Furthermore, we also evaluate two more independent metrics for the mSASHA dataset proposed by Huizinga et al. (2016) isolated from training. The metrics are based on the principal component analysis (PCA) of images. Assume $M\in\mathbb{R}^{d_{x}d_{y}\times N}$ is the matrix representation of $I$ , where a row of $M$ represents a coordinate in the image space. The correlation matrix of $M$ is then calculated by:

K=\frac{1}{d_{x}d_{y}-1}\Sigma^{-1}(M-\overline{M})^{\intercal}(M-\overline{M}% )\Sigma^{-1},

(21)

where $\Sigma$ is a diagonal matrix representing the standard deviation of each column, and $\overline{M}$ denotes the column-wise mean for each column entry. Since an ideal qMRI model assumes a voxel-wise tissue alignment, the actual underlying dimension of $K$ can be characterized by a low-dimensional (linear) subspace driven by the signal model. In the mSASHA signal model, the dimension of such a subspace is assumed to be four according to Eq. 18, determined by the number of parameters to be estimated. With the fact that the trace of $K$ , $\text{tr}(K)$ is a constant, two PCA-based metrics were proposed as follows:

	$\displaystyle\mathcal{D}_{\text{PCA1}}$	$\displaystyle=\sum_{i=1}^{N}\lambda_{i}-\sum_{j=1}^{L}\lambda_{j}=\text{tr}(K)% -\sum_{j=1}^{L}\lambda_{j},$		(22)
	$\displaystyle\mathcal{D}_{\text{PCA2}}$	$\displaystyle=\sum_{j=1}^{N}j\lambda_{j}.$		(23)

Both metrics were designed to penalize a long-tail distribution of the spectrum of $K$ , and $L$ is a hyperparameter regarding the number of parameters of the signal model. For $\mathcal{D}_{\text{PCA1}}$ , an ideal scenario would involve all images perfectly aligning with tissue anatomy and the signal model, resulting in a value of $0$ . Meanwhile, the interpretation of $\mathcal{D}_{\text{PCA2}}$ further emphasizes the tail of the eigenvalues thus enlarging the gaps across experiments.

To narrow the analysis to the region of interest to the heart region, the calculation is confined to this area by cropping the resulting images before computing the metric. This constraint ensures that the evaluation is focused on the relevant anatomical structures.

3.3 Experimental Settings

We here summarize the main experiments for evaluation and further ablation experiments for RIIR. For all experiments, the main workflow is to register the image series $I$ of length $N$ in a pairwise manner: that is, we first choose a template $I_{\text{fix}}$ , and then perform $N$ registrations. When $N=2$ , the registration process simplifies to straightforward pairwise registration, as is the situation for OASIS.

Experiment 1: Comparison Study with Varying Data Availability

We introduce five data-availability scenarios to evaluate the robustness of the models when data availability is limited, on both datasets, which often happens in both research settings and clinical practices as the number of subjects is heavily limited. The training data availability settings in this study were set to $[5\%,10\%,25\%,50\%,100\%]$ for OASIS and mSASHA. It is worth noticing that for limited data availability scenarios, the data used for training remained the same for all models in consideration, and the leave-out test split remained unchanged for all scenarios.

Experiment 2: Inclusion of Hidden States

Unlike most related works utilizing the original RIM framework (Lønning et al., 2019; Sabidussi et al., 2021) where two levels of hidden states are considered, we explored the impact of modifying or even turning off hidden states. In our implementation of convolutional GRU, at most two levels of hidden states $\bm{h}_{t}^{1}$ and $\bm{h}_{t}^{2}$ are considered following most of recent works using RIM (Lønning et al., 2019; Sabidussi et al., 2021, 2023). Two hidden states were supposed to have the same channels as in their corresponding convolutional GRU, where in this study, the channels were set to be $[16,16]$ for 3D OASIS and $[32,16]$ for mSASHA after taking the VRAM consumption into account. From this experiment, all experiments were implemented on the validation split for both datasets.

Experiment 3: Choice of Inner Loss Functions $\mathcal{L}_{\text{inner}}$

As detailed in the Methods section, the purpose $\mathcal{L}_{\text{inner}}$ is different from previous iterative registration methods as $\mathcal{L}_{\text{inner}}$ is not proposed as a choice of input of a convolutional recurrent neural network. We evaluated the ablation experiments from now on mSASHA which includes multiple-image registration with motion and varies in contrasts. Since it has been proposed previously that NMI is a more suitable choice of $\mathcal{L}_{\text{sim}}$ for $\mathcal{L}_{\text{outer}}$ in qMRI registration (de Vos et al., 2020) over NCC and MSE, we focused on the choice of $\mathcal{L}_{\text{inner}}$ to investigate the registration performance and the information propagated through the RIIR. That is, in this experiment, we evaluated the choice of $\mathcal{L}_{\text{outer}}$ with MSE, NCC, and NMI with $\mathcal{L}_{\text{outer}}$ as NMI.

Experiment 4: Inclusion of Gradient of Inner Loss $\nabla_{\bm{\phi}_{t}}\mathcal{L}_{\text{inner}}$ as RIIR Input

We performed an experimental study on the input composition for RIIR. As shown in Eq. 12, the goal was to study the data efficiency and the registration performance by incorporating the gradient of $\mathcal{L}_{\text{inner}}$ in RIIR. We could achieve the ablation by changing the input of $g_{\theta}$ . A comparison with other input modelling strategies seen in Qiu et al. (2022) against the gradient-based input for $g_{\theta}$ was proposed. Depending on whether the moving image is deformed (explicit) or not (implicit), we ended up with three input compositions:

1.

Implicit Input without $\nabla\mathcal{L}_{\text{inner}}$ : $[\bm{\phi}_{t},I_{\text{mov}},I_{\text{fix}}]$ ;
2.

Explicit Input without $\nabla\mathcal{L}_{\text{inner}}$ : $[\bm{\phi}_{t},I_{\text{mov}}\circ\bm{\phi}_{t},I_{\text{fix}}]$ ;
3.

RIIR Input: $[\bm{\phi}_{t},\nabla_{\bm{\phi}_{t}}\mathcal{L}_{\text{inner}}]$ .

This study aimed to provide insights into the impact of different input compositions on the efficiency of RIIR when data availability varies. We conducted the experiment with low data availability choices ( $[5\%,10\%]$ ) to examine the data efficiency induced by the gradient input.

Experiment 5: RIIR Architecture Ablation Since RIIR is the first attempt to formulate and implement the RIM framework for medical image registration, we performed an extensive ablation study on the network architecture of RIIR, including: 1. number of evaluation steps $T$ ; 2. choice of loss weights $w_{t}$ .

3.4 Implementation Details and Baseline Methods

We implemented the RIIR in the following settings for experiment 1: We used the inference steps $T=6$ and exponential weighting for $w_{t}=10^{\frac{t-1}{T-1}}$ . We compared our proposed RIIR with two representative methods in the deep learning paradigm: VoxelMorph (Balakrishnan et al., 2019) and a modification of GraDIRN (Qiu et al., 2022). We conducted an in-depth hyperparameter tuning to compare existing methods and RIIR. For VoxelMorph, we followed a standard setting: a U-Net (Ronneberger et al., 2015) with 5 down-sampling and 5 up-sampling blocks as the backbone. The channels used in each down-sampling block are $\left[16,16,32,32,32\right]$ . The implementation of GraDIRN followed the authors’ settings suggested in Qiu et al. (2022) with 2 resolutions and 3 iterations per resolution and the explicit warped images as input. The numbers of parameters are given as follows for RIIR, VoxelMorph, and GraDIRN respectively: 61K, 254K, and 173K. The trade-off parameter $\lambda$ for all models was set to $0.05$ when considering MSE (for OASIS as $\mathcal{L}_{\text{sim}}$ ) and $0.1$ when considering NMI (for mSASHA) as $\mathcal{L}_{\text{sim}}$ . The optimizer of all methods remained the same using Adam (Kingma and Ba, 2015) with $\beta_{1}=0.9$ and $\beta_{2}=0.999$ . The initial learning rate was set to $8\times 10^{-4}$ for all models. The experiments were performed on an NVIDIA RTX 4090 GPU with a VRAM of 24 GB. We will make our code for RIIR publicly available after acceptance, as well as our implementation of the baseline models.

4 Results

4.1 Experiment 1: Comparison Study with Varying Data Availability

The results for OASIS are depicted in Fig 4(a), as well as an illustrative visualization of RIIR inference on an example test data, as shown in Fig. 5. It is evident that RIIR outperforms when data availability is severely limited and maintains consistent performance across various data availability scenarios, showcasing its data efficiency and accuracy. For a more detailed comparison, the statistics when full data availability is guaranteed are presented in Tab. 1 for OASIS.

Table 1: Results of Experiment 1 on OASIS with

100\%

data availability. The metrics are reported with a format of mean(standard deviation). A double-sided Wilcoxon test is calculated for Dice score, and ^∗ denotes a significant difference compared with the baseline VoxelMorph in terms of Dice score (

p<0.05

Model	Dice $\uparrow$	HD $\downarrow$	$\text{std}(\log\|J\|)$ $\downarrow$
VoxelMorph	0.694(0.035)	3.78(0.54)	0.26(0.11)
GraDIRN	0.697(0.034)	3.76(0.54)	0.31(0.14)
RIIR^∗	0.724(0.031)	3.71(0.53)	0.32(0.12)

The results of this experiment on mSASHA are shown in Fig. 4(b) using a composition of boxplots. An illustrative evolving visualization of RIIR inference demonstrated in a step-wise fashion can be found in Fig. 6. The corresponding boxplots for $\mathcal{D}_{\text{PCA2}}$ can be found in the supplementary material. Similarly, the statistics of mSASHA when full data availability is guaranteed are shown in in Tab. 2.

Table 2: Results of Experiment 1 on mSASHA with

100\%

data availability. The metrics are reported with a format of mean(standard deviation). A double-sided Wilcoxon test is calculated for

\mathcal{D}_{\text{PCA1}}

, and ^∗ denotes a significant difference compared with the baseline VoxelMorph (

p<0.05

Model	$\mathcal{D}_{\text{PCA1}}$ $\downarrow$	$\mathcal{D}_{\text{PCA2}}$ $\downarrow$	$\text{std}(\log\|J\|)$ $\downarrow$
VoxelMorph	0.425(0.126)	35.88(20.13)	0.33(0.25)
GraDIRN^∗	0.343(0.078)	34.89(12.69)	1.56(0.78)
RIIR^∗	0.324(0.072)	34.49(11.99)	2.18(0.79)

Overall, both iterative methods performed better across most data availability scenarios on mSASHA. As data availability approached saturation, the performances of both iterative methods converged. A further qualitative comparison among the three models is shown in Fig. 7.

4.2 Experiment 2: Inclusion of Hidden States

The architectural settings were kept the same as in the aforementioned experiments and results are shown in Fig. 12. In addition to better performance in the two datasets, we also empirically noticed that the training was more stable with two hidden states both activated.

4.3 Experiment 3: Choice of Inner Loss Functions $\mathcal{L}_{\text{inner}}$

The quantitative results, measured by $\mathcal{D}_{\text{PCA1}}$ , are shown in Fig. 10. It is noteworthy that the performance when considering NMI as the likelihood function did not align with the performance of the other two scenarios. In response, we conducted a further investigation, presenting a qualitative visualization of $\nabla\mathcal{L}_{\text{inner}}$ in Fig. 11 to shed light on this discrepancy. In our implementation, autograd (Paszke et al., 2017) were used to calculate $\nabla\mathcal{L}_{\text{inner}}$ for NCC and NMI, while we used the analytical gradient of MSE due to its simplicity. However, the Gaussian approximation of NMI involves flattening $I_{\text{fix}}$ and $I_{\text{mov}}\circ\bm{\phi}$ to compute the joint probability, potentially resulting in the scattering pattern shown in Fig. 11 and influencing performance, as $\nabla\mathcal{L}_{\text{inner}}$ is the sole image information passed into RIIR cell.

4.4 Experiment 4: Inclusion of Inner Loss Gradient as RIIR Input

The results are shown in Fig. 9. It can be observed that the network struggled if the warped image $I_{\text{mov}}\circ\bm{\phi}$ is only implicitly fed to the RIIR cell, i.e., there’s an additional cost to learn the transformation. During our training, we also noticed that when trained with explicit warped images, the network tends to have a larger deformation magnitude since the images are more detailed compared to the gradient input (as shown in Fig. 11). Based on the results, we conjecture that the explicit input often offers more details to guide the optimization which benefits the training. However, it can be also redundant for training when data availability is limited thus potentially leading to overfitting.

4.5 Experiment 5: RIIR Architecture Ablation

We here demonstrate the model architecture ablation by showing the corresponding boxplots in Fig. 13. Apart from the main experiments shown previously, these experiments can be regarded as minor ablation studies, aiming to strike a balance between computational precision and inference speed.

5 Discussion

In this study, we introduced RIIR, a deep learning-based medical image registration method that leverages recurrent inferences as a meta-learning strategy. We extended Recurrent Inference Machines (RIMs) to the image registration problem, which has no explicit forward models. RIIR was extensively evaluated on public brain MR and in-house quantitative cardiac MR datasets, and demonstrated consistently improved performance over established deep learning models, in both one-step and iterative settings. Additionally, our ablation study confirmed the importance of incorporating hidden states within the RIM-based framework.

The acclaimed performance improvement of registration is especially pronounced in scenarios with limited training data as demonstrated in Fig. 4. Notably, RIIR achieved superior average evaluation metrics on segmentation accuracy and groupwise alignment, with significantly lower variance. This performance advantage was pronounced in the mSASHA experiments with very limited training data, a stark contrast to the baseline models which required much more training data to reach similar performances. Though both GraDIRN and RIIR are iterative methods, GraDIRN isolates the update of explicit $\mathcal{L}_{\text{sim}}$ and deep learning-based $\mathcal{L}_{\text{reg}}$ with no internal states, thus potentially resulting in worse generalization and slower convergence compared with RIIR. Notably, GraDIRN initialized the deformation field randomly by default, which could lead to optimization difficulties when data were extremely limited during training. Furthermore, in the mSASHA dataset, the qualitative visual comparison in Fig. 7 highlights that incorporating gradient information in our RIIR method can better preserve the original structure of the myocardium in the registration process. Furthermore, the step-wise loss function, as used in Eq. 13, also prevents exaggerated registration that can compromise the accuracy and affect the resulting quantitative maps as shown in the baseline methods.

Although hidden states were used in the original RIM and later works (Putzky and Welling, 2017; Putzky et al., 2019; Sabidussi et al., 2021, 2023), it has not been investigated in detail regarding its impact on optimization of RIM-based methods. Our second experiment investigates the impact of these hidden states within RIIR. Our findings reveal that the presence of hidden states, as proposed in the original RIM work (Putzky and Welling, 2017), contributes positively to the performance of our model, as shown by the quantitative results in Fig. 9 and Fig. 8. However, in the 3D OASIS dataset, this improvement was less pronounced compared to the mSASHA dataset, where multiple images in a time series are evaluated together.

We further investigated the influence of inner loss selection (Experiment 3), which was not studied before in iterative deep learning-based registration (Qiu et al., 2022). In mSAHSA, with $\mathcal{L}_{\text{sim}}$ chosen as NMI in $\mathcal{L}_{\text{outer}}$ , the quantitative results are not as satisfactory as other choices (MSE, NCC). This is probably due to the implementation of NMI operation that may dissect the neighborhood structure of the original image for RIIR Cell as input as shown in Fig. 11. Nevertheless, this discrepancy suggests that the gradient information of $\mathcal{L}_{\text{inner}}$ can be considered as a form of compressed image data that is processed with the convolutional nature in RIIR Cell. The ablation study on input combinations (Experiment 4) indicates that, in scenarios with limited data, the inclusion of gradient input effectively prevents model overfitting, thus leading to better generalization performance as shown in Fig. 12.

RIIR’s superior registration performance and data efficiency suggest its potential for applications in medical image registration. However, it is necessary to acknowledge the current limitations, to further enhance the framework in future work. First, the performance of RIIR could be bottlenecked by the simplistic convolutional GRU unit, where potential improvement can be considered such as dilated convolutional kernels which can preserve larger receptive fields. Second, the expanding computation graph of RIM-based models, with increasing inference steps, leads to heavy GPU memory consumption.

6 Conclusion

In conclusion, we present RIIR, a novel recurrent deep-learning framework for medical image registration. RIIR significantly extends the concept of recurrent inference machines for inverse problem solving, to high-dimensional optimization challenges with no closed-form forward models. Meanwhile, RIIR distinguishes itself from previous iterative methods by integrating implicit regularization with explicit loss gradients. Our experiments across diverse medical image datasets demonstrated RIIR’s superior accuracy and data efficiency. We also empirically demonstrated the effectiveness of its architectural design and the value of hidden states, significantly enhancing both registration accuracy and data efficiency. RIIR is shown to be an effective and generalizable tool for medical image registration, and potentially extends to other high-dimensional optimization problems.

Acknowledgments

This work was partly supported by the TU Delft AI Initiative, Amazon Research Awards, and the National Heart, Lung, and Blood Institute, National Institutes of Health by the Division of Intramural Research (Z1A-HL006214).

Ethical Standards

The work follows appropriate ethical standards in conducting research and writing the manuscript, following all applicable laws and regulations regarding treatment of animals or human subjects.

Conflicts of Interest

We declare we don’t have conflicts of interest.

References

Andrychowicz et al. (2016) Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. Advances in Neural Information Processing Systems, 29, 2016.
Avants et al. (2008) Brian B Avants, Charles L Epstein, Murray Grossman, and James C Gee. Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain. Medical Image Analysis, 12(1):26–41, 2008.
Avants et al. (2011) Brian B Avants, Nicholas J Tustison, Gang Song, Philip A Cook, Arno Klein, and James C Gee. A reproducible evaluation of ants similarity metric performance in brain image registration. NeuroImage, 54(3):2033–2044, 2011.
Balakrishnan et al. (2019) Guha Balakrishnan, Amy Zhao, Mert R Sabuncu, John Guttag, and Adrian V Dalca. Voxelmorph: a learning framework for deformable medical image registration. IEEE Transactions on Medical Imaging, 38(8):1788–1800, 2019.
Byrne et al. (2022) Mikel Byrne, Ben Archibald-Heeren, Yunfei Hu, Amy Teh, Rhea Beserminji, Emma Cai, Guilin Liu, Angela Yates, James Rijken, Nick Collett, and Trent Aland. Varian ethos online adaptive radiotherapy for prostate cancer: Early results of contouring accuracy, treatment plan quality, and treatment time. Journal of Applied Clinical Medical Physics, 23(1):e13479, 2022. .
Cao et al. (2017) Xiaohuan Cao, Jianhua Yang, Jun Zhang, Dong Nie, Minjeong Kim, Qian Wang, and Dinggang Shen. Deformable image registration based on similarity-steered cnn regression. In Medical Image Computing and Computer Assisted Intervention- MICCAI 2017: 20th International Conference, Quebec City, QC, Canada, September 11-13, 2017, Proceedings, Part I 20, pages 300–308. Springer, 2017.
Chen et al. (2017) Yutian Chen, Matthew W Hoffman, Sergio Gómez Colmenarejo, Misha Denil, Timothy P Lillicrap, Matt Botvinick, and Nando Freitas. Learning to learn without gradient descent by gradient descent. In International Conference on Machine Learning, pages 748–756. PMLR, 2017.
Chow et al. (2022) Kelvin Chow, Genevieve Hayes, Jacqueline A Flewitt, Patricia Feuchter, Carmen Lydell, Andrew Howarth, Joseph J Pagano, Richard B Thompson, Peter Kellman, and James A White. Improved accuracy and precision with three-parameter simultaneous myocardial t1 and t2 mapping using multiparametric sasha. Magnetic Resonance in Medicine, 87(6):2775–2791, 2022.
Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
Cotter and Conwell (1990) Neil E Cotter and Peter R Conwell. Fixed-weight networks can learn. In International Joint Conference on Neural Networks, pages 553–559. IEEE, 1990.
Dalca et al. (2019) Adrian V Dalca, Guha Balakrishnan, John Guttag, and Mert R Sabuncu. Unsupervised learning of probabilistic diffeomorphic registration for images and surfaces. Medical Image Analysis, 57:226–236, 2019.
De Vos et al. (2019) Bob D De Vos, Floris F Berendsen, Max A Viergever, Hessam Sokooti, Marius Staring, and Ivana Išgum. A deep learning framework for unsupervised affine and deformable image registration. Medical Image Analysis, 52:128–143, 2019.
de Vos et al. (2020) Bob D de Vos, Bas HM van der Velden, Jörg Sander, Kenneth GA Gilhuijs, Marius Staring, and Ivana Išgum. Mutual information for unsupervised deep learning image registration. In Medical Imaging 2020: Image Processing, volume 11313, pages 155–161. SPIE, 2020.
Fechter and Baltas (2020) Tobias Fechter and Dimos Baltas. One-shot learning for deformable medical image registration and periodic motion tracking. IEEE Transactions on Medical Imaging, 39(7):2506–2517, 2020.
Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126–1135. PMLR, 2017.
Haskins et al. (2020) Grant Haskins, Uwe Kruger, and Pingkun Yan. Deep learning in medical image registration: a survey. Machine Vision and Applications, 31:1–18, 2020.
Hering et al. (2019) Alessa Hering, Bram van Ginneken, and Stefan Heldmann. mlvirnet: Multilevel variational image registration network. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part VI 22, pages 257–265. Springer, 2019.
Hoopes et al. (2021) Andrew Hoopes, Malte Hoffmann, Bruce Fischl, John Guttag, and Adrian V Dalca. Hypermorph: Amortized hyperparameter learning for image registration. In Information Processing in Medical Imaging: 27th International Conference, IPMI 2021, Virtual Event, June 28–June 30, 2021, Proceedings 27, pages 3–17. Springer, 2021.
Hospedales et al. (2021) Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. Meta-learning in neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):5149–5169, 2021.
Huizinga et al. (2016) W. Huizinga, D.H.J. Poot, J.-M. Guyader, R. Klaassen, B.F. Coolen, M. van Kranenburg, R.J.M. van Geuns, A. Uitterdijk, M. Polfliet, J. Vandemeulebroucke, A. Leemans, W.J. Niessen, and S. Klein. Pca-based groupwise image registration for quantitative mri. Medical Image Analysis, 29:65–78, 2016. ISSN 1361-8415. . URL https://www.sciencedirect.com/science/article/pii/S1361841515001851.
Jin et al. (2021) Cheng Jin, Heng Yu, Jia Ke, Peirong Ding, Yongju Yi, Xiaofeng Jiang, Xin Duan, Jinghua Tang, Daniel T Chang, Xiaojian Wu, Feng Gao, and Ruijiang Li. Predicting treatment response from longitudinal images using multi-task deep learning. Nature Communications, 12(1):1851, 2021.
Kanter and Lellmann (2022) Frederic Kanter and Jan Lellmann. A flexible meta learning model for image registration. In Ender Konukoglu, Bjoern Menze, Archana Venkataraman, Christian Baumgartner, Qi Dou, and Shadi Albarqouni, editors, Proceedings of The 5th Medical Imaging with Deep Learning, volume 172 of Proceedings of Machine Learning Research, pages 638–652. PMLR, 06–08 Jul 2022. URL https://proceedings.mlr.press/v172/kanter22a.html.
Karkalousos et al. (2022) Dimitrios Karkalousos, Samantha Noteboom, Hanneke E Hulst, Franciscus M Vos, and Matthan WA Caan. Assessment of data consistency through cascades of independently recurrent inference machines for fast and robust accelerated mri reconstruction. Physics in Medicine & Biology, 67(12):124001, 2022.
King et al. (2010) Andrew P King, Kawal S Rhode, Y Ma, Cheng Yao, Christian Jansen, Reza Razavi, and Graeme P Penney. Registering preprocedure volumetric images with intraprocedure 3-d ultrasound using an ultrasound imaging model. IEEE Transactions on Medical Imaging, 29(3):924–937, 2010.
Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
Klein et al. (2007) Stefan Klein, Marius Staring, and Josien PW Pluim. Evaluation of optimization methods for nonrigid medical image registration using mutual information and b-splines. IEEE Transactions on Image Processing, 16(12):2879–2890, 2007.
Liu et al. (2021) Risheng Liu, Zi Li, Xin Fan, Chenying Zhao, Hao Huang, and Zhongxuan Luo. Learning deformable image registration from optimization: perspective, modules, bilevel training and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7688–7704, 2021.
Lønning et al. (2019) Kai Lønning, Patrick Putzky, Jan-Jakob Sonke, Liesbeth Reneman, Matthan WA Caan, and Max Welling. Recurrent inference machines for reconstructing heterogeneous mri data. Medical Image Analysis, 53:64–78, 2019.
Marcus et al. (2007) Daniel S Marcus, Tracy H Wang, Jamie Parker, John G Csernansky, John C Morris, and Randy L Buckner. Open access series of imaging studies (oasis): cross-sectional mri data in young, middle aged, nondemented, and demented older adults. Journal of Cognitive Neuroscience, 19(9):1498–1507, 2007.
Messroghli et al. (2004) Daniel R Messroghli, Aleksandra Radjenovic, Sebastian Kozerke, David M Higgins, Mohan U Sivananthan, and John P Ridgway. Modified look-locker inversion recovery (molli) for high-resolution t1 mapping of the heart. Magnetic Resonance in Medicine, 52(1):141–146, 2004.
Miao et al. (2016) Shun Miao, Z Jane Wang, and Rui Liao. A cnn regression approach for real-time 2d/3d registration. IEEE Transactions on Medical Imaging, 35(5):1352–1363, 2016.
Modi et al. (2021) Chirag Modi, François Lanusse, Uroš Seljak, David N Spergel, and Laurence Perreault-Levasseur. Cosmicrim: reconstructing early universe by combining differentiable simulations with recurrent inference machines. arXiv preprint arXiv:2104.12864, 2021.
Morningstar et al. (2019) Warren R Morningstar, Laurence Perreault Levasseur, Yashar D Hezaveh, Roger Blandford, Phil Marshall, Patrick Putzky, Thomas D Rueter, Risa Wechsler, and Max Welling. Data-driven reconstruction of gravitationally lensed galaxies using recurrent inference machines. The Astrophysical Journal, 883(1):14, 2019. .
Muckley et al. (2021) Matthew J. Muckley, Bruno Riemenschneider, Alireza Radmanesh, Sunwoo Kim, Geunu Jeong, Jingyu Ko, Yohan Jun, Hyungseob Shin, Dosik Hwang, Mahmoud Mostapha, Simon Arberet, Dominik Nickel, Zaccharie Ramzi, Philippe Ciuciu, Jean-Luc Starck, Jonas Teuwen, Dimitrios Karkalousos, Chaoping Zhang, Anuroop Sriram, Zhengnan Huang, Nafissa Yakubova, Yvonne W. Lui, and Florian Knoll. Results of the 2020 fastmri challenge for machine learning mr image reconstruction. IEEE Transactions on Medical Imaging, 40(9):2306–2317, 2021. .
Oliveira and Tavares (2014) Francisco PM Oliveira and Joao Manuel RS Tavares. Medical image registration: a review. Computer Methods in Biomechanics and Biomedical Engineering, 17(2):73–93, 2014.
Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
Putzky and Welling (2017) Patrick Putzky and Max Welling. Recurrent inference machines for solving inverse problems. arXiv preprint arXiv:1706.04008, 2017.
Putzky et al. (2019) Patrick Putzky, Dimitrios Karkalousos, Jonas Teuwen, Nikita Miriakov, Bart Bakker, Matthan Caan, and Max Welling. i-rim applied to the fastmri challenge. arXiv preprint arXiv:1910.08952, 2019.
Qiu et al. (2021) Huaqi Qiu, Chen Qin, Andreas Schuh, Kerstin Hammernik, and Daniel Rueckert. Learning diffeomorphic and modality-invariant registration using b-splines. In Medical Imaging with Deep Learning, 2021.
Qiu et al. (2022) Huaqi Qiu, Kerstin Hammernik, Chen Qin, Chen Chen, and Daniel Rueckert. Embedding gradient-based optimization in image registration networks. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2022, pages 56–65. Springer, 2022.
Rohé et al. (2017) Marc-Michel Rohé, Manasi Datar, Tobias Heimann, Maxime Sermesant, and Xavier Pennec. Svf-net: learning deformable image registration using shape matching. In Medical Image Computing and Computer Assisted Intervention- MICCAI 2017: 20th International Conference, Quebec City, QC, Canada, September 11-13, 2017, Proceedings, Part I 20, pages 266–274. Springer, 2017.
Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
Rueckert and Schnabel (2019) Daniel Rueckert and Julia A Schnabel. Model-based and data-driven strategies in medical image computing. Proceedings of the IEEE, 108(1):110–124, 2019.
Sabidussi et al. (2021) Emanoel R Sabidussi, Stefan Klein, Matthan WA Caan, Shabab Bazrafkan, Arnold J den Dekker, Jan Sijbers, Wiro J Niessen, and Dirk HJ Poot. Recurrent inference machines as inverse problem solvers for mr relaxometry. Medical Image Analysis, 74:102220, 2021.
Sabidussi et al. (2023) ER Sabidussi, S Klein, B Jeurissen, and DHJ Poot. dtiRIM: A generalisable deep learning method for diffusion tensor imaging. NeuroImage, 269:119900, 2023.
Sandkühler et al. (2019) Robin Sandkühler, Simon Andermatt, Grzegorz Bauman, Sylvia Nyilas, Christoph Jud, and Philippe C Cattin. Recurrent registration neural networks for deformable image registration. Advances in Neural Information Processing Systems, 32, 2019.
Sauer (2006) Frank Sauer. Image registration: enabling technology for image guided surgery and therapy. In 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference, pages 7242–7245. IEEE, 2006.
Schmidhuber (1993) Jürgen Schmidhuber. A neural network that embeds its own meta-levels. In IEEE International Conference on Neural Networks, pages 407–412. IEEE, 1993.
Shi et al. (2015) Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. Advances in Neural Information Processing Systems, 28, 2015.
Sotiras et al. (2013) Aristeidis Sotiras, Christos Davatzikos, and Nikos Paragios. Deformable medical image registration: A survey. IEEE Transactions on Medical Imaging, 32(7):1153–1190, 2013.
Staring et al. (2009) Marius Staring, Uulke A van der Heide, Stefan Klein, Max A Viergever, and Josien PW Pluim. Registration of cervical mri using multifeature mutual information. IEEE Transactions on Medical Imaging, 28(9):1412–1421, 2009.
Studholme et al. (1999) Colin Studholme, Derek LG Hill, and David J Hawkes. An overlap invariant entropy measure of 3d medical image alignment. Pattern Recognition, 32(1):71–86, 1999.
Thévenaz and Unser (2000) Philippe Thévenaz and Michael Unser. Optimization of mutual information for multiresolution image registration. IEEE Transactions on Image Processing, 9(12):2083–2099, 2000.
van Harten et al. (2023) Louis van Harten, Rudolf Leonardus Mirjam Van Herten, Jaap Stoker, and Ivana Isgum. Deformable image registration with geometry-informed implicit neural representations. In Medical Imaging with Deep Learning, 2023.
Wolterink et al. (2022) Jelmer M Wolterink, Jesse C Zwienenberg, and Christoph Brune. Implicit neural representations for deformable image registration. In Medical Imaging with Deep Learning, pages 1349–1359. PMLR, 2022.
Xu et al. (2021) Junshen Xu, Eric Z Chen, Xiao Chen, Terrence Chen, and Shanhui Sun. Multi-scale neural odes for 3d medical image registration. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pages 213–223. Springer, 2021.
Yang et al. (2016) Xiao Yang, Roland Kwitt, and Marc Niethammer. Fast predictive image registration. In Deep Learning and Data Labeling for Medical Applications: First International Workshop, LABELS 2016, and Second International Workshop, DLMIA 2016, Held in Conjunction with MICCAI 2016, Athens, Greece, October 21, 2016, Proceedings 1, pages 48–57. Springer, 2016.
Yang et al. (2017) Xiao Yang, Roland Kwitt, Martin Styner, and Marc Niethammer. Quicksilver: Fast predictive image registration – a deep learning approach. NeuroImage, 158:378–396, 2017. ISSN 1053-8119. .
Younger et al. (1999) A Steven Younger, Peter R Conwell, and Neil E Cotter. Fixed-weight on-line learning. IEEE Transactions on Neural Networks, 10(2):272–283, 1999.
Zbontar et al. (2018) Jure Zbontar, Florian Knoll, Anuroop Sriram, Matthew J. Muckley, Mary Bruno, Aaron Defazio, Marc Parente, Krzysztof J. Geras, Joe Katsnelson, Hersh Chandarana, Zizhao Zhang, Michal Drozdzal, Adriana Romero, Michael G. Rabbat, Pascal Vincent, James Pinkerton, Duo Wang, Nafissa Yakubova, Erich Owens, C. Lawrence Zitnick, Michael P. Recht, Daniel K. Sodickson, and Yvonne W. Lui. fastmri: An open dataset and benchmarks for accelerated mri. arXiv preprint arXiv:1811.08839, 2018.
Zhang et al. (2021) Yungeng Zhang, Yuru Pei, and Hongbin Zha. Learning dual transformer network for diffeomorphic registration. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pages 129–138. Springer, 2021.
Zhao et al. (2019) Shengyu Zhao, Yue Dong, Eric I Chang, and Yan Xu. Recursive cascaded networks for unsupervised medical image registration. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10600–10610, 2019.

A Supplementary Material

We show the boxplots of Experiment 1 here in terms of HD and $\mathcal{D}_{\text{PCA2}}$ on OASIS and mSASHA datasets respectively in Fig. 14.