Modeling Foreground Spatial Variations in 21 cm Gaussian Process Component Separation

Kangning Diao Department of Astronomy, Tsinghua University, Beijing 100084, China Richard D.P. Grumitt Department of Astronomy, Tsinghua University, Beijing 100084, China Yi Mao Department of Astronomy, Tsinghua University, Beijing 100084, China Richard D.P. Grumitt, Yi Mao rgrumitt@tsinghua.edu.cn (RDPG), ymao@tsinghua.edu.cn (YM)

(Received XXX; Revised YYY; Accepted ZZZ)

Abstract

Gaussian processes (GPs) have been extensively utilized as nonparametric models for component separation in 21 cm data analyses. This exploits the distinct spectral behavior of the cosmological and foreground signals, which are modeled through the GP covariance kernel. Previous approaches have employed a global GP kernel along all lines of sight (LoS). In this work, we study Bayesian approaches that allow for spatial variations in foreground kernel parameters, testing them against simulated H I intensity mapping observations. We consider a no-pooling (NP) model, which treats each LoS independently by fitting for separate covariance kernels, and a hierarchical Gaussian Process (HGP) model that allows for variation in kernel parameters between different LoS, regularized through a global hyperprior. We find that accounting for spatial variations in the GP kernel parameters results in a significant improvement in cosmological signal recovery, achieving up to a 30% reduction in the standard deviation of the residual distribution and improved model predictive performance. Allowing for spatial variations in GP kernel parameters also improves the recovery of the H I power spectra and wavelet scattering transform coefficients. Whilst the NP model achieves superior recovery as measured by the residual distribution, it demands extensive computational resources, faces formidable convergence challenges, and is prone to overfitting. Conversely, the HGP model strikes a balance between the accuracy and robustness of the signal recovery. Further improvements to the HGP model will require more physically motivated modeling of foreground spatial variations. Our code can be found in this repo.

Cosmology(343) — Large-scale structure of the universe(902) — 21 cm lines(690) — Gaussian Processes regression(1930)

^†^†journal: ApJ^†^†software: Kymatio (Andreux et al., 2018), numpyro (Phan et al., 2019; Bingham et al., 2019), JAX (Bradbury et al., 2018), GPR4IM (Soares et al., 2022), ArviZ (Kumar et al., 2019)

1 Introduction

Measurements of the 21 cm signal from neutral hydrogen provide a powerful cosmological probe. At high redshifts ( $z\gtrsim 6$ ), these observations are a key window into the astrophysics of the Epoch of Reionization (EoR) and Cosmic Dawn (CD), as well as allowing for independent constraints on cosmological parameters (Pritchard & Loeb, 2012; Furlanetto, 2016). At lower redshifts, the neutral hydrogen (H I) intensity mapping can be used as a tracer for the large-scale structure of the Universe (Bharadwaj et al., 2001; Battye et al., 2004; Chang et al., 2008). However, the accurate detection and characterization of such cosmological signals face a severe challenge in foreground removal. Any 21 cm signal will be orders of magnitude weaker than the foreground emission, consisting of extragalactic emission along with diffuse and partially polarized Galactic emission (Di Matteo et al., 2002; Oh & Mack, 2003; Wang et al., 2006; Jelić et al., 2008; Bernardi et al., 2009, 2010; Liu & Tegmark, 2011, 2012; Wolz et al., 2014; Alonso et al., 2015; Ghosh et al., 2020; Liu & Shaw, 2020; Cunnington et al., 2021).

Foreground removal methods typically proceed by exploiting the distinct spectral behaviors of the foreground and 21 cm components. The foreground emission here is usually assumed to vary much more smoothly with frequency than the 21 cm signal (Liu & Tegmark, 2011; Alonso et al., 2015; Shaw et al., 2015; Mertens et al., 2018). This can come in the form of parametric modeling where, e.g., synchrotron emission is modeled as following a power law with frequency (Eriksen et al., 2008; Bigot-Sazy et al., 2015; Cunnington et al., 2021; Hibbard et al., 2023). Parametric modeling in this form has advantages insofar as the parameters can be easily interpreted, allowing us to assign physically motivated priors to spectral parameters. However, the spectral energy distributions (SEDs) of the foreground components are more complicated than such simple parametric models. Systematics in such a modeling has the potential to induce significant biases in the final signal extraction when the signals of interest are orders of magnitude weaker than the foreground components (Remazeilles et al., 2016, 2018). It is worth noting that these problems arise even if the underlying emission mechanism follows an exact power law with frequency. Given multiple bodies along the line of sight (LoS) emitting as power laws, averaging effects along the LoS, along with beam and pixel averaging will induce spectral distortions which have the potential to bias simple parametric fits (Chluba et al., 2017).

As an alternative, one can adopt non-parametric methods where one does not assume an explicit form for the emission component SEDs. In this vein, Gaussian processes (GPs) offer an attractive solution and have seen previous applications to 21 cm component separation both on simulations and real data (Mertens et al., 2018, 2020, 2024; Gehlot et al., 2019; Offringa et al., 2019; Ghosh et al., 2020; Hothi et al., 2021; Kern & Liu, 2021; Soares et al., 2022). GPs belong to a class of non-parametric Bayesian models, where one effectively assigns a prior over the space of functions that describe the emission component SEDs (Rasmussen & Williams, 2006). The distinct spectral behaviors of the emission components can be accounted for through the GP covariance kernel, enabling a flexible component separation that can account for departures from simple parametric models. One can also fully marginalize the GP model parameters, allowing for a full propagation of uncertainty through the component separation process (Flaxman et al., 2015; Lalchand & Rasmussen, 2020).

In this work, we consider adapting the GP component separation framework to allow spatial variations in the foreground kernel parameters, so that we can better model spatial variations in the foreground spectral properties (Planck Collaboration et al., 2016a; Remazeilles et al., 2018; Jew & Grumitt, 2020). In previous work, the foreground GP kernel parameters have been assumed to be identical along every LoS, i.e. a “complete pooling” model (e.g., Soares et al. 2022; hereafter S22). We study the impact of allowing spatial variations through a “no pooling” model, where the foreground kernel parameters are allowed to vary independently along every LoS, and a hierarchical model where the kernel parameters are allowed to vary but are assumed to be drawn from some underlying hyper-distribution, whose parameters are jointly fit during the inference process. The hyper-distribution acts to regularize pixel-to-pixel variations in GP kernel parameters and allows for the sharing of information between different LoS (Gelman, 2006; Gelman & Hill, 2006). Hierarchical foreground modeling has previously been used in the context of parametric cosmic microwave background (CMB) component separation in Grumitt et al. (2020).

The structure of the rest of this paper is as follows. In Section 2 we give an overview of GP models, their specific application to 21 cm component separation, and the form of the various GP models we consider in this work. In Section 3 we describe the simulated H I intensity mapping observations, against which we test our component separation method, consisting of a mock MeerKLASS-type survey (Santos et al., 2016), and the computational setup used to perform inference over our models. In Section 4 we describe the results from these simulations, including the uncompressed pixel-level recovery and predictive performance, along with the recovery of H I field summary statistics. In Section 5 we discuss notable modeling and recovery details, covering the effect of bias correction on power spectrum recovery, the modeling choices used to define the hierarchical GP models, and the GP inference approach used in this work. Finally, in Section 6 we summarize our method and results and discuss improvements that can be made in the modeling of foreground spatial variations for component separation.

2 Gaussian Processes

2.1 Overview

Refer to caption — Figure 1: Samples from different GP kernels with different length scales $l$ . It can be seen that decreasing the length scale results in more rapidly varying GP realizations. Functions allowed by the RBF kernel are infinitely differentiable, which manifests in the smooth GP realizations seen in the left panels, irrespective of the length scale. The functions allowed by the exponential kernel are non-differentiable, which manifests itself in the spiky function realizations seen in the right panels.

GP models are a class of Bayesian non-parametric models that assign a prior distribution over the space of functions (Gelman et al., 2013; Rasmussen & Williams, 2006). Broadly, a GP prior consists of an infinite set of random variables for each of the possible inputs of the function. A GP prior may be written as

f\sim\mathcal{GP}(m,K),

(1)

where $m$ is the mean function and $K$ is the GP covariance kernel. The mean function defines the function around which realizations from the GP will be distributed, whilst the covariance kernel describes how GP realizations vary around the mean function. The mean function can be decoupled from the GP, writing $f=m+g$ where $g\sim\mathcal{GP}(0,K)$ , and is commonly chosen to be the zero-function. However, informed choice of the mean function, e.g., using some parametric model around which the GP models deviations, can significantly improve regression performance (Shafieloo et al., 2012; Ruiz-Zapatero et al., 2022). The choice of covariance kernel is key to assigning a prior over the expected behavior of the regression function. For example, when one expects smooth functions, a kernel that allows for infinitely differentiable functions, such as the radial basis function (RBF) kernel, can be selected as follows

K(x,x^{\prime})=\sigma^{2}\exp\left(-\frac{\lVert x-x^{\prime}\rVert^{2}}{2l^{% 2}}\right).

(2)

The GP variance $\sigma^{2}$ determines the expected amplitude over which the GP realizations vary, whilst the length-scale parameter $l$ determines the typical length-scale over which a GP realization varies with $x$ . In settings where the function is expected to vary in a more spiky fashion, a kernel from the Matern class may be selected,

	$\displaystyle K(x,x^{\prime})$	$\displaystyle=\sigma^{2}\frac{2^{1-\rho}}{\Gamma(\rho)}\left(\frac{\sqrt{2\rho% }\lVert x-x^{\prime}\rVert}{l}\right)^{\rho}$		(3)
		$\displaystyle\times K_{\rho}\left(\frac{\sqrt{2\rho}\lVert x-x^{\prime}\rVert}% {l}\right),$		(3)

where $\sigma^{2}$ and $l$ have analogous interpretations as for the squared-exponential kernel, and $K_{\rho}(\cdot)$ is a modified Bessel function of the second kind. The parameter $\rho>0$ controls the smoothness of the functions allowed by the Matern kernel, with GPs described by Matern kernels being $\lceil\rho\rceil-1$ times differentiable, where $\lceil\rho\rceil$ is the ceiling of $\rho$ . In the limit $\rho\rightarrow\infty$ , the Matern kernel converges to the RBF kernel.

In the setting where we have some finite set of observed data points $y_{i},\,i\in\{1,\ldots,N\}$ , corresponding to covariates $x_{i},\,i\in\{1,\ldots,N\}$ , GP regression is made tractable by the fact that the joint distribution over the GP realizations at these points is a multivariate Gaussian with mean $m_{i}=m(x_{i}),\,i\in\{1,\ldots,N\}$ and covariance $K_{ij}=K(x_{i},x_{j}),\,i,j\in\{1,\ldots,N\}$ . We may write a generic GP regression model (assuming zero mean function) as,

$\displaystyle\Phi$	$\displaystyle\sim\pi_{\Phi}(\Phi),$	(4)
$\displaystyle f(x)$	$\displaystyle\sim\mathcal{GP}(0,K(x,x^{\prime}\|\Phi)),$
$\displaystyle\sigma_{\eta}$	$\displaystyle\sim\pi_{\sigma_{\eta}}(\sigma_{\eta}),$
$\displaystyle y$	$\displaystyle\sim\pi_{y}(y\|f(x),\sigma_{\eta}),$

where $\Phi$ are the GP covariance parameters with corresponding prior $\pi_{\Phi}(\Phi)$ . The regression function $f(x)$ is assigned a GP prior, $\mathcal{GP}(0,K(x,x^{\prime}|\Phi))$ . The parameter $\sigma_{\eta}$ is chosen to describe the observation noise process and is assigned some prior $\pi_{\sigma_{\eta}}(\sigma_{\eta})$ . The data is distributed as $y\sim\pi_{y}(y|f(x),\sigma_{\eta})$ , depending on the GP function realizations and the observation noise process. In the case where the likelihood is Gaussian, the GP prior is conjugate to the likelihood, and the GP function realizations can be analytically marginalized. Taking $\sigma_{\eta}$ to be the Gaussian noise standard deviation, the analytic marginalization yields the simpler model,

$\displaystyle\Phi$	$\displaystyle\sim\pi_{\Phi}(\Phi),$	(5)
$\displaystyle\sigma_{\eta}$	$\displaystyle\sim\pi_{\sigma_{\eta}}(\sigma_{\eta}),$
$\displaystyle y$	$\displaystyle\sim\mathcal{N}(y\|0,K(x,x\|\Phi)+\sigma_{\eta}^{2}I_{N}),$

where $\mathcal{N}$ is a multivariate Gaussian distribution and $I_{N}$ is the $N\times N$ identity matrix. This amounts to a significant dimensionality reduction, with posterior inference only needing to be performed over the GP covariance and noise parameters.

A typical approach at this point is to optimize the log-marginal posterior with respect to $\Phi$ and $\sigma$ , then fixing their values at the maximum a-posteriori (MAP) values for the downstream inference of $f(x)$ (Rasmussen & Williams, 2006). This can be viewed under the empirical Bayes framework, where the data are used to set the prior parameters, which are fixed for inference over the remaining model parameters. This approach has the advantage of being fast compared to full integration over the marginal posterior. However, it comes with two significant drawbacks. By fixing $\Phi$ and $\sigma$ at their MAP values we do not fully propagate the posterior uncertainty on them to our inferences over $f(x)$ . In addition, the MAP estimate for parameter values is generally not a good estimator. For many models, using the MAP value as an estimator can lead to poor predictive performance and can result in significantly biased parameter inferences (MacKay, 2003; Betancourt, 2017). This has previously been noted in GP models, where MAP estimation has been observed to tend to result in overfitting (Lalchand & Rasmussen, 2020). Problems with MAP estimation are particularly apparent for hierarchical Bayesian models, which we consider in this work, where correlations between model hyperparameters and latent variables result in posterior geometries where the MAP is very far from the typical set, i.e., where the probability mass of the posterior is concentrated (Betancourt & Girolami, 2015; Betancourt, 2017; Hoffman et al., 2019). Given these problems with MAP estimation, we consider fully Bayesian GP models throughout this work, and demonstrate the degradation in our cosmological signal recovery using MAP estimation in Section 5.3.

For fixed $\Phi$ and $\sigma_{\eta}$ , the expectation and covariance of the GP function realizations at some points $x_{*}$ may be obtained through the standard prediction formulae,

$\displaystyle\mathbb{E}\left[f(x_{*})\|\Phi,\sigma_{\eta}\right]$	$\displaystyle=K(x_{*},x\|\Phi)\left(K(x,x\|\Phi)+\sigma_{\eta}^{2}I_{N}\right)^{% -1}y,$	(6)
$\displaystyle\mathrm{cov}\left[f(x_{*})\|\Phi,\sigma_{\eta}\right]$	$\displaystyle=K(x_{},x_{}\|\Phi)$
$\displaystyle-K(x_{*},x\|\Phi)$	$\displaystyle\left(K(x,x\|\Phi)+\sigma_{\eta}^{2}I_{N}\right)^{-1}K(x,x_{*}\|% \Phi).$

Posterior samples for $\Phi$ and $\sigma_{\eta}$ obtained from the marginalized model in Equation (5) can be used to construct an ensemble estimate for the posterior predictive distribution over $f(x_{*})$ (Lalchand & Rasmussen, 2020). The expected signal is obtained by calculating the ensemble average,

\mathbb{E}\left[f(x_{*})\right]=\frac{1}{M}\sum_{j=1}^{M}\mathbb{E}\left[f(x_{% *})|\Phi^{j},\sigma_{\eta}^{j}\right],

(7)

where the sum is over the $M$ posterior samples, indexed by $j$ . The covariance of the posterior predictive samples can be evaluated by drawing samples from the posterior predictive corresponding to each of the posterior samples, $f_{pp}^{j}(x_{*})\sim\mathcal{N}(\mathbb{E}\left[f(x_{*})|\Phi^{j},\sigma_{% \eta}^{j}\right],\mathrm{cov}\left[f(x_{*})|\Phi^{j},\sigma_{\eta}^{j}\right])$ , and evaluating the ensemble covariance,

	$\displaystyle\mathrm{cov}\left[f(x_{})\right]=\frac{1}{M-1}\sum_{j=1}^{M}% \left\{\left(f_{pp}^{j}(x_{})-\mathbb{E}[f(x_{*})]\right)\right.$		(8)
	$\displaystyle\left.\otimes\left(f_{pp}^{j}(x_{})-\mathbb{E}[f(x_{})]\right)% \right\}.$		(8)

2.2 Gaussian Processes for Component Separation

GP regression can be utilized in component separation. In this paper, we focus on 21 cm component separation, and as such consider a sky signal consisting of astrophysical foreground emission, polarization leakage, and the cosmological H I signal. These will be described in detail in Section 3. The core idea behind using GP models for component separation is to model the total sky signal as an additive GP realization, i.e., a GP realization that is the sum of foreground, polarization leakage, and H I terms which are each assigned their own GP prior. Each emission component can then be separated by exploiting the fact that they each display different behavior as a function of frequency, which can be modeled through their GP covariance kernels.

For our sky model, we follow S22, decomposing the sky emission into a cosmological H I component, $f_{\rm HI}$ and a foreground component, $f_{\rm fg}$ . The observed sky signal along a given LoS may then be written as

d_{\rm obs}(\nu)=f_{\rm HI}+f_{\rm fg}+\eta_{\nu},

(9)

where $\eta_{\nu}\sim\mathcal{N}(0,\sigma_{\nu}^{2})$ is additive Gaussian noise at the observing frequency $\nu$ . The foreground signal is further decomposed as

f_{\rm fg}=f_{\rm sm}+f_{\rm pol},

(10)

where $f_{\rm sm}$ is the signal due to astrophysical emission that varies smoothly with frequency, and $f_{\rm pol}$ is the signal due to polarization leakage. The total GP covariance kernel is given by

K=K_{\rm HI}+K_{\rm fg}=K_{\rm HI}+K_{\rm sm}+K_{\rm pol},

(11)

where $K_{\rm HI}$ , $K_{\rm sm}$ , and $K_{\rm pol}$ are the covariance kernels for the HI, smooth astrophysical foreground, and polarization leakage terms respectively.

For the Gaussian observation model defined through Equation (9), and assuming a fixed noise variance for all frequencies $\sigma_{\nu}^{2}=\sigma_{\eta}^{2}$ , we can write down the prediction formulae for the expectation and covariance of the foreground signal for the fixed GP covariance kernel parameters,

	$\displaystyle\mathbb{E}\left[f_{\rm fg}\|K,\sigma_{\eta}\right]$	$\displaystyle=K_{\rm fg}(K_{\rm fg}+K_{\rm HI}+\sigma_{\eta}^{2}I_{N})^{-1}d_{% \rm obs},$		(12)
	$\displaystyle\mathrm{cov}\left[f_{\rm fg}\|K,\sigma_{\eta}\right]$	$\displaystyle=K_{\rm fg}-K_{\rm fg}\left(K_{\rm fg}+K_{\rm HI}+\sigma_{\eta}^{% 2}I_{N}\right)^{-1}K_{\rm fg},$		(12)

where $N$ denotes the number of observed frequencies. Given predictions for the foreground signal, $\mathbb{E}[f_{\rm fg}|K,\sigma_{\eta}]$ , one can calculate the residual $f_{\rm res}$ as our estimate of the 21 cm signal,

f_{\rm 21cm,est}=f_{\rm res}=d_{\rm obs}-\mathbb{E}[f_{\rm fg}|K,\sigma_{\eta}].

(13)

The final expected foreground and 21 cm signals, along with their corresponding uncertainties, can be obtained by following the procedure outlined in Section 2.1 for evaluating the uncertainty over the GP posterior predictive samples, using the posterior samples for the GP kernel and noise parameters.

We follow S22 in selecting the kernels for our emission components. Astrophysical foreground emission is expected to vary smoothly as a function of frequency. As such, astrophysical foregrounds are modeled through an RBF kernel,

K_{\rm sm}(\nu,\nu^{\prime})=\sigma^{2}_{\rm sm}\exp{\left(-\frac{\lVert\nu-% \nu^{\prime}\lVert^{2}}{2l_{\rm sm}^{2}}\right)},

(14)

where $\sigma_{\rm sm}$ is a hyperparameter that controls the amplitude of the spectrum, whilst $l_{\rm sm}$ determines the correlation between frequency channels. Larger values of $l_{\rm sm}$ indicate data in all frequency channels are highly correlated, resulting in a smoother spectrum. The left panel of Figure 1 shows samples from a GP with an RBF kernel, with kernel variance $\sigma^{2}=1$ and length scales $l=0.5$ and $l=0.05$ . It can be seen that the GP samples vary smoothly for both length scales, which is to be expected given that the functions allowed by the RBF kernel are infinitely differentiable.

For the 21 cm signal, we apply the exponential kernel,

K_{\rm HI}(\nu,\nu^{\prime})=\sigma^{2}_{\rm HI}\exp{\left(-\frac{\lVert\nu-% \nu^{\prime}\rVert}{2l_{\rm HI}}\right)}.

(15)

The exponential kernel corresponds to the Matern kernel with $\rho=1/2$ . The functions allowed by this kernel are non-differentiable, and therefore vary in a very spiky fashion with frequency, as demonstrated in the right panel of Figure 1. The $\sigma_{\rm HI}$ and $l_{\rm HI}$ parameters control the amplitude and correlation length scale of the H I signal.

Alongside astrophysical foreground emission, we also consider the effects of polarization leakage in our simulations. This is also expected to vary smoothly with frequency and is therefore modeled with an RBF kernel,

K_{\rm pol}(\nu,\nu^{\prime})=\sigma^{2}_{\rm pol}\exp{\left(-\frac{\lVert\nu-% \nu^{\prime}\rVert^{2}}{2l_{\rm pol}^{2}}\right)},

(16)

where $\sigma_{\rm pol}$ and $l_{\rm pol}$ control the amplitude and correlation length scale of the polarization leakage respectively. In this work we will consider a three-kernel model, with the total GP kernel being given by Equation (11), and also a two-kernel model where we model the emission due to astrophysical foregrounds and polarization leakage through a single RBF kernel. For the two-kernel model, the total GP kernel can be written as $K=K_{\rm HI}+K_{\rm sm}$ .

2.3 Gaussian Process Model Variants

In this section, we summarize the different GP models that we study in this work, which account for spatial variations in foreground emission in different ways. A graphical illustration of the models is shown in Figure 2.

2.3.1 Complete Pooling (CP) model

The standard GP model used for component separation (e.g., S22) uses the same set of kernels to describe the emission spectrum along every LoS, i.e., the kernel parameters for every LoS are pooled to a single value. In this work, we refer to this as the complete pooling (CP) model, which is used as our baseline throughout.

For the GP kernel parameters, we use a set of weakly informative priors, chosen to place the probability mass over the expected scales of the parameters. In general, when information on the expected scale of a parameter is available, it is appropriate to select a prior distribution that concentrates its probability mass over those scales. For example, for a strictly positive parameter $\theta$ that is expected to have a value $\sim 1$ , one might select the half-normal prior $\theta\sim\rm{HalfNormal}(\sigma=1)$ . This has the benefit of helping to regularize our inferences using prior information and avoids the pitfalls associated with common choices for so-called uninformative prior distributions. For example, selecting a very broad uniform distribution as a default prior concentrates most of the prior probability mass at extreme values within the support of the distribution, which can bias the subsequent posterior inference. From an algorithmic perspective, this can also cause problems for Markov Chain Monte Carlo (MCMC) algorithms. Without regularization from a weakly informative prior, the posterior geometry can be such that it severely frustrates convergence, in some cases jeopardizing the geometric ergodicity of the MCMC algorithm. These issues appear frequently in the fully Bayesian GP and hierarchical models we consider in this work. For a detailed discussion of prior choice and the impacts on inference and sampling, we refer the reader to (Gelman et al., 2013; Stan Development Team, 2012; Betancourt & Girolami, 2015; Gelman & Hennig, 2017; Gelman et al., 2017; Gabry et al., 2019).

The weakly informative priors for the GP kernel parameters in the CP model are given by

$\displaystyle\sigma_{\rm sm}^{2}/10^{-2}$	$\displaystyle\sim\rm{LogNormal}(0,4),$	(17)
$\displaystyle l_{\rm sm}$	$\displaystyle\sim\rm{InverseGamma}(2,1),$
$\displaystyle\sigma_{\rm pol}^{2}/10^{-6}$	$\displaystyle\sim\rm{LogNormal}(0,4),$
$\displaystyle l_{\rm pol}$	$\displaystyle\sim\rm{InverseGamma}(5,1),$
$\displaystyle\sigma_{\rm HI}^{2}/10^{-8}$	$\displaystyle\sim\rm{HalfNormal}(1),$
$\displaystyle l_{\rm HI}$	$\displaystyle\sim\rm{HalfNormal}(0.02).$

Here $\rm{LogNormal}(\mu,\varsigma)$ corresponds to a log-normal distribution with mean parameter $\mu$ and scale parameter $\varsigma$ , $\rm{InverseGamma}(\alpha,\beta)$ is the inverse-gamma distribution with shape parameter $\alpha$ and scale parameter $\beta$ , and $\rm{HalfNormal}(\varsigma)$ is the half-normal distribution with scale parameter $\varsigma$ .

The astrophysical foreground and polarization leakage components are expected to be orders of magnitude brighter than the H I signal. For the GP variance parameters of these components, we select zero-avoiding log-normal priors that concentrate over the expected scales of the two emission components. For the length-scale parameters of the foreground components, we use inverse-gamma priors. The inverse-gamma distribution is a common choice for GP length scale priors. The distribution places negligible probability masses on very small length scales. For length scales below the minimum covariate spacing, the GP likelihood will plateau, putting considerable probability mass at these small scales. This can result in overfitting and also induce convergence problems for the MCMC algorithms (Stan Development Team, 2012). For the smoothly varying foreground emission components we do not expect such high-frequency signals, and therefore use the inverse-gamma prior to remove these modes. The remainder of the prior concentrates around the expected length scales of the foreground components, whilst having a heavy right tail to allow for low-frequency modes.

For the H I signal, we choose a half-normal prior for the GP kernel variance. This allows for potentially zero cosmological signal whilst down-weighting very large amplitude scales. For the length scale parameter, in this case, we also use a half normal prior, allowing the H I signal to be potentially completely uncorrelated in frequency, and marginalizing over any very high-frequency modes. We find that using a half-normal prior for the faint H I component does not frustrate the convergence of our sampling algorithm.

We account for noise in our GP model by fitting for a single noise standard deviation $\sigma_{\eta}$ , to which we assign the prior

\sigma_{\eta}/10^{-7}\sim\rm{HalfNormal}(10).

(18)

In real experiments, the true noise covariance can be challenging to estimate. This can be accounted for by marginalizing over the noise process, with some prior selected to concentrate over the expected noise scales. In more realistic setups, we may expect more complicated noise contributions e.g., heteroscedastic noise, where it would be appropriate to use a more sophisticated noise model than the uniform white noise we consider here. We leave more realistic noise modeling to future work, focusing here on spatial variations in foreground emission. We use the same noise model for all the GP models considered in this work.

2.3.2 No-Pooling (NP) model

The CP model assumes that the GP kernel parameters for the foreground emission are identical for every LoS. However, the spectral properties of diffuse Galactic emission are known to vary significantly across the sky (Planck Collaboration et al., 2016a; Remazeilles et al., 2018; Jew & Grumitt, 2020). Using a single set of GP kernel parameters for every LoS may fail to capture these variations, potentially resulting in mismodeling and a failure to accurately recover the underlying sky emission. At the opposite extreme from the CP model, we may consider a no-pooling (NP) model, where the foreground GP kernel parameters are assumed to vary completely independently along every LoS. We do not expect such variations in the H I GP prior along each LoS, and as such we retain the global kernel for the H I signal used in the CP model. The priors for the GP kernel parameters take the same form as in Equation (17), with the difference that now we have separate $\{\sigma_{\rm{sm},i},l_{\rm{sm},i},\sigma_{\rm{pol},i},l_{\rm{pol},i}\}$ for each LoS $i$ .

2.3.3 Hierarchical Gaussian Process (HGP) model

Sitting between the CP and NP models described above, we may consider a class of hierarchical Gaussian process (HGP) models. In this setting, we divide the whole data set into a set of superpixels containing multiple LoS, where the superpixel size is smaller than the full data set over the spatial dimensions. Within each superpixel, the smooth foreground and polarization leakage kernel parameters are assumed to be constant. Ideally, we would consider separate kernel parameters for each LoS. However, the memory requirements for such a setup render such a model intractable given our computational resources. We discuss the details of the HGP model setup in Section 3.2.

The priors for the foreground kernel parameters in a given superpixel $p$ are given by,

$\displaystyle\sigma_{\mathrm{sm},p}^{2}/10^{-2}$	$\displaystyle\sim\rm{HalfNormal}(\varsigma_{\rm sm}),$	(19)
$\displaystyle l_{\mathrm{sm},p}$	$\displaystyle\sim\rm{InverseGamma}(\alpha_{\rm sm},\beta_{\rm sm}),$
$\displaystyle\sigma_{\mathrm{pol},p}^{2}/10^{-6}$	$\displaystyle\sim\rm{HalfNormal}(\varsigma_{\rm pol}),$
$\displaystyle l_{\mathrm{pol},p}$	$\displaystyle\sim\rm{InverseGamma}(\alpha_{\rm pol},\beta_{\rm pol}),$

where $\{\varsigma_{\rm sm},\alpha_{\rm sm},\beta_{\rm sm},\varsigma_{\rm pol},\alpha% _{\rm pol},\beta_{\rm pol}\}$ are a set of global hyper-parameters that define the prior distribution over the foreground kernel parameters in each superpixel.

To these global hyper-parameters, we assign a set of weakly informative hyper-priors

		$\displaystyle\varsigma_{\rm sm}\sim\rm{LogNormal}(0,4),$		(20)
		$\displaystyle\alpha_{\rm sm}\sim\rm{LogNormal}(1,4),$
		$\displaystyle\beta_{\rm sm}\sim\rm{LogNormal}(0,4),$
		$\displaystyle\varsigma_{\rm pol}\sim\rm{LogNormal}(0,4),$
		$\displaystyle\alpha_{\rm pol}\sim\rm{LogNormal}(1,4),$
		$\displaystyle\beta_{\rm pol}\sim\rm{LogNormal}(0,4).$

These hyper-priors are chosen such that the potential priors over the foreground kernel parameters for each superpixel correspond broadly to the expected scales of those foreground parameters. A key feature of the hierarchical approach is that we jointly fit for these hyper-parameters, allowing us to learn the effective prior over the superpixel parameters and share information between superpixels. This allows for spatial variations in the kernel parameters, whilst regularizing those variations through the hyper-prior. This can help to reduce the tendency for NP models to overfit, particularly with very noisy data where the assumption of total independence means that there is no regularizing effect from the global hyperprior. As in the NP model, we assume a global H I kernel.

3 The 21 cm component separation simulations and computation

In this section, we describe the simulated data set used to test our GP models for 21 cm component separation, and the computational methods and setup we use for inference.

3.1 Mock data set

Our simulated sky signal consists of foreground emission, instrumental effects, and a cosmological 21 cm signal. We adopted the data set¹¹1https://www.dropbox.com/sh/9zftczeypu7xgt3/AABiiBw_0SBPrLgSHsjiISz8a?dl=0 provided by S22, which was originally used in Cunnington et al. (2021). We cut a box from the S22 data set on a grid with $(N_{x},N_{y},N_{z})=(256,256,256)$ . The corresponding box size is $(L_{x},L_{y},L_{z})=(1436,1436,1193)\>\rm Mpc$ , with a frequency range along the $z$ -axis of $899\>\rm{MHz}<\nu<1155\>\rm{MHz}$ , corresponding to a redshift range of $0.23<z<0.58$ . The frequency resolution of the simulated data set is $\delta\nu=1\>\rm MHz$ .

The simulated data serve as a conceptual model for future H I intensity mapping (IM) surveys, akin to the MeerKLASS survey (Santos et al., 2016). A flat-sky approximation is applied here, with curved-sky effects over the small survey area being negligible. In the following we summarize the simulation procedure for each of the sky components, referring the readers to Appendix A in S22 and Cunnington et al. (2021) for further details.

3.1.1 Foreground Model

The total foreground emission, $\delta T_{\rm FG}$ consists of four components,

\delta T_{\rm FG}=\delta T_{\rm sync}+\delta T_{\rm free}+\delta T_{\rm point}% +\delta T_{\rm pol},

(21)

where $\delta T_{\rm sync}$ is due to Galactic synchrotron emission, $\delta T_{\rm free}$ is due to Galactic free-free emission, $\delta T_{\rm point}$ is due to extragalactic point sources, and $\delta T_{\rm pol}$ is due to polarization leakage.

Galactic synchrotron emission is caused by cosmic-ray electrons spiraling in the Galactic magnetic field. The Planck Legacy Archive²²2http://pla.esac.esa.int/pla/ FFP10 simulations were used to generate synchrotron emission maps at 217 and 353 GHz. These simulations used the 408 MHz all-sky map with $56\ \rm arcmin$ resolution as a synchrotron emission template, enhanced to a Healpix (Górski et al., 2005) resolution of $N_{\rm side}=2048$ using Gaussian random field realizations (Haslam et al., 1981, 1982; Remazeilles et al., 2015). A spectral index map can be derived from the simulated 217 and 353 GHz maps, assuming a power-law emission model, which can then be used to extrapolate our simulated map over the observational frequency range.

Galactic free-free emission is caused by free electrons scattering off ions in an ionized gas. The simulated data set uses the 217 GHz FFP10 free-free simulation at $N_{\rm side}=2048$ . This free-free map was generated using templates from Dickinson et al. (2003), and WMAP Maximum Entropy Method (MEM) derived free-free maps. The simulated map is extrapolated over the observed frequency range using a power-law emission model with a constant spectral index over all pixels.

Extragalactic point sources, such as Active Galactic Nuclei, are included by applying a polynomial fit to 1.4 GHz radio sources as in Battye et al. (2013), and scaling to the target observational frequencies using the method described in Olivari et al. (2018).

Polarization leakage is caused by Faraday rotation of polarized synchrotron emission in the Galactic interstellar medium (Alonso et al., 2014; Cunnington et al., 2021). Some fraction of the synchrotron Stokes $Q$ and $U$ will leak into Stokes $I$ . The degree of Faraday rotation is frequency dependent, which induces additional structure on the synchrotron emission spectrum, complicating GP-based component separation methods that exploit the spectral smoothness of the foreground emission. Polarization leakage was simulated using the CRIME³³3http://intensitymapping.physics.ox.ac.uk/CRIME.html software package. This was used to produce Stokes $Q$ emission maps over the observational frequency range, subsequently assuming a polarization leakage level of 0.5% from Stokes $Q$ .

3.1.2 Instrument and Noise Model

In S22 a MeerKLASS-like (Santos et al., 2016) survey is simulated. The corresponding observational noise is assumed to be uncorrelated in frequency, and drawn from a Gaussian distribution with standard deviation at each frequency given by

\sigma_{\eta}(\nu)=T_{\rm sys}(\nu)\left(\delta_{\nu}t_{\rm tot}\frac{\Omega_{% p}}{\Omega_{a}N_{\rm dish}}\right)^{-1/2},

(22)

where $T_{\rm sys}(\nu)$ is the frequency-dependent system temperature, $\delta_{\nu}=1\,\rm{MHz}$ is the frequency resolution of the observations, $t_{\rm tot}=1000\>\rm{hours}$ is the total observing time, $\Omega_{p}$ is the pixel solid angle, $\Omega_{a}$ is the survey solid angle and $N_{\rm dish}=64$ is the number of dishes (identical to the MeerKAT survey). The noise root-mean-square (RMS) in the simulation ranges from $0.032$ to $0.040\>\rm mK$ over the observed frequency range.

The total sky signal is smoothed with a constant Gaussian beam corresponding the the angular resolution of the instrument. The beam full width at half maximum (FWHM) is given by

\theta_{\rm FWHM}=\frac{1.22c}{\nu D_{\rm dish}},

(23)

where $c$ is the speed of light, $\nu$ is the observed frequency, and $D_{\rm dish}$ is the telescope dish diameter. The telescope dish diameter is set to 15 m, consistent with MeerKAT and SKA1-MID dishes (Dewdney et al., 2019).

3.1.3 Cosmological Signal

The cosmological signal used here was obtained from the MultiDarkPlanck (MDPL2) N-body simulation (Klypin et al., 2016), which simulates $3840^{3}$ dark matter particles within a cubic volume with a size of $1\>{\rm Gpc}/h$ on each side, using the cosmological parameters of Planck 2015 (Planck Collaboration et al., 2016b). The MultiDark-SAGE catalog derived from this simulation, available on the Skies & Universe⁴⁴4skiesanduniverses.org website, was selected for our analysis. This simulation provides redshift snapshots, each representing the state of the cosmological density field and galaxies at different redshifts. For the simulated data set used in this work, the snapshot $z=0.39$ was used, assuming a redshift range of $z=0.2$ to $z=0.58$ . Galaxies in this snapshot were organized into voxels using Nearest Grid Point (NGP) assignment.

Following Cunnington et al. (2021), the H I mass was calculated from the cold gas mass of each galaxy in the catalog and combined within each voxel to form an H I intensity map. However, this approach excludes halos lighter than $10^{10}h^{-1}M_{\odot}$ . To compensate, the mean H I temperature was adjusted based on measurements from the GBTWiggleZ cross-correlation analysis (Masui et al., 2013) and the fitting in Villaescusa-Navarro et al. (2018). Each redshift slice in the H I intensity mapping simulation was then converted into an overtemperature field by subtracting the mean temperature, with the overtemperature field directly tracing the dark-matter overdensity field.

3.2 GP Component Separation Setup

GP model	Model description
abbreviation
CP2	Complete pooling model
	with 2 kernels
CP3	Complete pooling model
	with 3 kernels
HGP2	Hierarchical model
	with 2 kernels
HGP3	Hierarchical model
	with 3 kernels
NP2	No pooling model
	with 2 kernels
NP3	No pooling model
	with 3 kernels

Table 1: The model name abbreviations used throughout this work. Models with two kernels include an H I kernel and a single foreground kernel. Models with three kernels include an H I kernel and two foreground kernels, with one foreground kernel modeling smooth astrophysical emission, and the other foreground kernel modeling polarization leakage.

In Table 1 we list the model name abbreviations used in this work. As discussed in Section 2.1, we consider fully Bayesian GP models throughout this work. This requires us to perform sampling over the full GP hyperparameter posterior. In the case of the NP and HGP models, this is a very high-dimensional sampling problem, which is made tractable by using the No-U-Turn Sampler (NUTS) algorithm (Hoffman et al., 2014). The NUTS algorithm is an adaptive variant of the Hamiltonian Monte Carlo (HMC) algorithm. HMC exploits gradient information of the target posterior to generate efficient sampling proposals by following Hamiltonian trajectories through the target phase space and has excellent scaling to high dimensions. Our code can be found in this repo.

For each GP model, we use NUTS to sample the parameters for 3000 iterations and discard the first 1000 iterations as burn-in. Convergence is checked by analyzing the Gelman-Rubin $\hat{R}$ statistic (Gelman & Rubin, 1992), which compares the between-chain and within-chain variance. Values of $\hat{R}\approx 1$ are indicative of convergence, and in this work a strict requirement of $\hat{R}\leq 1.01$ is used. In addition to this, we also monitor for divergences during sampling. Divergences occur when the value of the Hamiltonian diverges whilst generating trajectories through the target phase space. This can be indicative of regions of the posterior where the curvature is such that it cannot be resolved by the sampler, and it can imply failures in geometric ergodicity (Betancourt & Girolami, 2015; Livingstone et al., 2016). Such issues are common in fully Bayesian GP and hierarchical models. We therefore ensure that zero divergences occur during sampling. Our framework is built with numpyro⁵⁵5https://github.com/pyro-ppl/numpyro (Phan et al., 2019; Bingham et al., 2019), a probabilistic programming language built on JAX⁶⁶6https://jax.readthedocs.io. After we obtain the desired hyperparameter samples, we thin the chain to 50 samples. For each sample, we recover the expected H I signal using Equation (13). Our final estimated H I signal cube is calculated from the ensemble mean of the 50 recovered H I cubes.

For the CP model, we follow the setup described in S22. For the NP model, although the foreground GP kernel parameters should be independent for every LoS, the assumption of a global H I kernel requires the full data cube to be fit simultaneously. The memory requirements of the NP model in this case render it intractable, demanding approximately 500 GB in memory. Therefore, to reduce the memory requirements of the NP model, we split the whole data set into 64 subsets of the size $(32,32,256)$ . We run 64 NP models on each subset separately, with the H I kernel being shared across every $32{\rm pix}\times 32{\rm pix}$ LoS in each subset. For each data subset, we have $\sim 4\times 10^{3}$ hyperparameters to sample for the NP3 model, a task that is rendered tractable through the use of the NUTS algorithm. Whilst we would ideally allow the foreground GP kernel parameters to vary along every LoS in the HGP models, the memory requirements of this approach also render such a model intractable using the exact GPs we consider in this work. For the HGP models, we therefore grid the data set into a set of superpixels along the LoS axis. The foreground GP kernel parameters are assumed to be constant for every LoS within a superpixel, with the H I kernel treated as global in the entire datacube. The default size of a single superpixel is set to be $16\times 16$ normal pixels, a trade-off between the accuracy and computational cost of the model (in both memory and clock time) in our experiments. The performance of HGP models with different superpixel sizes is compared in Section 5.2.

4 Results

In this section, we show the H I signal recovery results, using the various GP models discussed in Section 2.3. In Section 4.1 we discuss the recovery of the H I signal cubes in the pixel domain and the predictive performance of the models in this domain. In Section 4.2 we compare the recovery of the power spectrum and scattering transform coefficient summary statistics.

4.1 Recovered H I Image Cubes

GP model	Normalized result
CP3	1.0000
HGP3	1.0027
HGP2	1.0015
NP3^^Only 1% of the Pareto- $\hat{k}$ diagnostics are good ( $\hat{k}\leq 0.5$ ), suggesting this value is unreliable and the NP model is misspecified.	1.0146

Table 2: Leave-One-Out Cross-Validation (LOO-CV) Results: This table presents the ratio of estimated the log pointwise predictive density (

\mathrm{elpd}_{\mathrm{loo}}

) between different GP models and CP3 results. These results provide insights into each model’s predictive performance, with higher values indicating better predictive performance. The standard error on the

\mathrm{elpd}_{\mathrm{loo}}

estimates was

\sim 1\%

for all the models.

We first consider our benchmark CP models. In the second column of Figure 3 we show the recovered H I brightness temperature field at a frequency of $\nu=1019\,\rm{MHz}$ . For the CP3 model, we can see good recovery over much of the observed field. However, there is a region of visible undercleaning in the upper right corner of the field. This is further illustrated in Figure 4, where we show the recovered H I signal along a given LoS. In this case, it can be seen that the CP3 model results in a significant excess H I signal, caused by residual foregrounds after cleaning. We also show the recovered H I field for the CP2 model. In this case, we see a poor recovery of much of the sky, which is consistent with previous analyses in S22. Given the poor performance of the CP2 model, we do not consider it further in this work.

The recovered H I brightness temperature fields are shown for the NP and HGP models in the third and fourth columns of Figure 3 respectively. For the NP3 and HGP3 models we again obtain good recovery over much of the sky, albeit with superpixel boundary artifacts due to the division of the datacube. This is particularly apparent for the NP3 model, where the H I GP kernel had to be divided over the data subsets used for inference. Whilst we still obtain improved residuals and recovery of summary statistics compared to the CP3 model, these artifacts could be reduced or removed by considering memory-efficient GP approximations that would allow us to avoid the need to use data subsets or superpixels (Solin & Särkkä, 2020; Riutort-Mayol et al., 2023). We defer the analysis of potential GP approximations to future work. The visible undercleaning seen in the upper right of the H I field for the CP3 model is still apparent with the NP3 model. In contrast, the HGP3 model eliminates this region of undercleaning. This can be seen clearly in Figure 4, where the CP3 and NP3 signals are significantly undercleaned for the relevant LoS, whilst the HGP3 signal much more closely matches the true H I signal from the simulations. Whilst both the NP3 and HGP3 models allow for spatial variation in foreground kernel parameters, the NP3 model lacks any regularization from the global hyper-prior in the HGP3 model, and does not share information between different LoS when fitting the foreground signal. The extra freedom of the NP model and lack of regularization can result in overfitting in reaction to noisy and/or anomalous data.

The HGP2 model is able to recover the H I signal better than the CP2 model, although with a degraded performance compared to the HGP3 model. The HGP2 model has more freedom than the CP2 model, by allowing foreground kernel parameters to vary spatially, whilst regularizing that variation through the global hyper-prior, resulting in a model that is more robust to misspecification in this instance. We were unable to reach convergence with the NP2 model after extensive runs with the NUTS algorithm. This is not unexpected, given that the NP model does not share information across all LoS, which can result in convergence challenges for the misspecified NP2 model. We therefore do not consider the NP2 results further in this work.

In Figure 5 we show the pixel-level residual distributions for the CP3 model, the NP3 model, and the HGP2 and HGP3 models. We find that the NP3 model has the narrowest residual distribution, followed by the HGP3 model, the CP3 model, and the HGP2 model respectively. However, the total residual distribution alone is not wholly indicative of the model performance. Indeed, it was clear that there were regions of the sky where both the NP3 and CP3 models struggled to recover the correct H I signal, whilst the HGP models could. These features are hidden in the residual distribution for all pixels in the data set. We also see that the HGP2 residual distribution is wider than that of the CP3 model, demonstrating the importance of including all relevant kernel contributions in foreground modeling. In Figure 6 we also show the recovered $1\sigma$ uncertainty on the recovered H I signal for the HGP3 and NP3 models at a given frequency slice, estimated by evaluating the ensemble variance of the posterior predictive H I signal samples. The typical uncertainty on the recovered NP3 signal is about ten times larger than the uncertainty on the signal recovered by the HGP3 model. This is to be expected, given that the NP3 model has a larger number of free parameters that are not regularized through a global hyper-prior, resulting in a noisier signal recovery. Indeed, such behavior is typical of hierarchical models compared to no-pooling models (Gelman & Hill, 2006).

To make a more rigorous comparison of the model performance, we consider the leave-one-out cross-validation (LOO-CV) statistics (Vehtari et al., 2017). LOO-CV is a widely used method in statistical modeling and machine learning for evaluating the predictive performance of models. In addition to model performance, LOO-CV can also provide insights into the stability and robustness of a model. Given some data set $y=\{y_{1},\ldots,y_{N}\}$ , and a model with parameters $\theta$ , LOO-CV seeks to estimate the leave-one-out log predictive density of the model,

\mathrm{elpd}_{\mathrm{loo}}=\sum_{i=1}^{N}\log p(y_{i}|y_{-i}),

(24)

where $y_{-i}$ is the data set left after removing the datapoint $y_{i}$ . The pointwise predictive density is given by

p(y_{i}|y_{-i})=\int p(y_{i}|\theta)p(\theta|y_{-i})\mathrm{d}\theta.

(25)

An estimator for $\rm{elpd}_{\rm loo}$ can be constructed using posterior samples and Pareto smoothed importance sampling (Vehtari et al., 2015). In essence, $\rm{elpd}_{\rm loo}$ evaluates the performance of the model in predicting held out data.

In Table 2 we show the $\rm{elpd}_{\rm loo}$ values for the CP3, NP3, HGP3 and HGP2 models, all normalized by the $\rm{elpd}_{\rm loo}$ value for the CP3 model. The NP3, HGP3 and HGP2 models all have higher $\rm{elpd}_{\rm loo}$ values than the CP3 model. Whilst the NP3 model has a higher value for $\rm{elpd}_{\rm loo}$ than the HGP3 and HGP2 models, this estimate should be treated with caution. Only $\sim 1\%$ of the Pareto- $\hat{k}$ statistics are less than 0.5 for the NP3 model. The Pareto- $\hat{k}$ statistic can be used as a diagnostic for the variance of the importance sampling estimator used to evaluate $\rm{elpd}_{\rm loo}$ , with values of $\hat{k}<0.5$ being necessary to ensure finite variance of the importance ratios and that the central limit theorem holds for estimating $\rm{elpd}_{\rm loo}$ (Vehtari et al., 2015). Therefore, the $\rm{elpd}_{\rm loo}$ values for the NP model should be viewed as unreliable. This behavior commonly manifests as a result of strong model misspecification. This could be a result of assumptions such as the total independence of foreground emission along every LoS in the NP model. In reality, we know foreground emission will have spatial correlations. Furthermore, the division of the full data set into subsets due to the memory limitations means that the H I kernel was treated independently for each data subset, which does not conform to our expectations for the cosmological signal.

4.2 Summary Statistics Recovery

In this section, we consider the recovery of summary statistics from the foreground-cleaned H I maps. In Section 4.2.1 we study the power spectrum (PS) recovery, and in Section 4.2.2 we examine the recovery of scattering transform coefficients.

4.2.1 The 21 cm Power Spectrum

The 21 cm PS is perhaps the most well-studied summary statistic, capturing rich cosmological and astrophysical information (e.g., Zaldarriaga et al., 2004; Greig & Mesinger, 2017). Here we investigated the PS recovery from the foreground cleaned H I maps, showing the recovered PS in Figure 7. For the spherically averaged PS, the smallest relative error is obtained with the HGP2 model. The NP3 model achieves better recovery than the HGP3 model, which obtains results comparable to the CP3 model.

For the 1D PS along LoS, all four models attain comparable recovery on small scales, with differences only being apparent on large scales. The NP3 model achieves the lowest relative error in this case, and the CP3 and HGP3 models have comparable relative errors on large scales. The HGP2 model significantly overestimates the PS on large scales, indicating significant uncleaned foreground components in the recovered H I signal. Using one foreground kernel often fails to describe the polarization leakage effects, which have a different frequency behavior compared to the smooth astrophysical foregrounds in total intensity.

For the 2D transverse PS, all methods underestimated the power on scales $k_{\perp}\lesssim 0.1h/\rm Mpc$ , indicating mismodeling of foreground spatial variations. The HGP2 model gives the best recovery on the scales $k_{\perp}\lesssim 0.1h/\rm Mpc$ , indicating that it was better able to preserve Gaussian spatial fluctuations on these scales. The HGP3 and CP3 models attain comparable relative error in this case, with the NP3 model attaining a smaller relative error. This would suggest that the regularization through the hyper-prior with the HGP3 model, and the complete pooling of all foreground kernel parameters for the CP3 model, act to reduce spatial fluctuations compared to the NP3 model. On smaller scales, the HGP3 and CP3 models attain smaller relative errors compared to the HGP2 and NP3 models. However, on these small scales, the instrumental beam suppresses the PS close to zero, with the limitations of floating point accuracy making comparisons through the relative error unreliable.

Whilst the PS recovery seen here would indicate good performance for the HGP2 and NP3 models, this should be treated with some caution, given that these PS estimates have not undergone any bias correction. This issue is considered in more detail in Section 5.1. However, it is clear from the recovered 2D transverse PS that all methods fail to properly describe spatial variations in the foreground signal. Whilst the NP3 model allows for greater freedom, and hence spatial variations, this method still mismodels the true foreground emission by assuming total independence between every LoS, as discussed previously in the context of the LOO-CV results. Whilst the HGP3 model allows for spatial variations in the foreground emission kernels, it is assumed that these kernel parameters are all drawn from some global hyper-prior. This is an overly simplistic model. In reality, we would expect foreground kernel parameters to display local spatial correlations. An improved spatial prior should therefore account for these local correlations. Given that modeling local spatial correlations introduces significant additional computational challenges, we leave an exploration of such models to future work.

4.2.2 Wavelet Scattering Transform Coefficients

The PS characterizes Gaussian information in the relevant field. However, the full H I field is non-Gaussian, with any non-Gaussian information beyond the PS summary. Recently, the wavelet scattering transformation (WST) has emerged as a new approach to extract information from cosmological fields (e.g. Cheng et al., 2020; Greig et al., 2023; Dai & Seljak, 2023) and is more informative than the poly-spectrum in Fourier space (Sui et al., 2023; Cheng & Ménard, 2021). To examine the recovery of non-Gaussian summary statistics, we calculate the three-dimensional solid harmonic WST coefficients (e.g. Zhao et al., 2023; Eickenberg et al., 2018) of the recovered H I signal cubes.

We briefly summarize the solid harmonic WST here. The solid harmonic WST convolves the original field $\mathbf{d}(\mathbf{x})$ with a cascade of solid harmonic wavelets defined as

		$\displaystyle\psi_{\ell}^{m}(\mathbf{x})=\frac{1}{(\sqrt{2\pi})^{3}}e^{-\|x\|^{2% }/2}\|\mathbf{x}\|^{\ell}Y_{\ell}^{m}\left(\frac{\mathbf{x}}{\|\mathbf{x}\|}\right),$		(26)
		$\displaystyle\psi_{j,\ell}^{m}(\mathbf{x})=2^{-3j}\psi_{\ell}^{m}\left(2^{-j}% \mathbf{x}\right),$		(26)

where $\mathbf{x}$ is the spatial coordinates, $Y_{\ell}^{m}$ is the three-dimensional spherical harmonic function and $\psi_{j,\ell}^{m}(\mathbf{x})$ is the convolution kernel parameterized by $j,\ell$ and $m$ . Here, $j$ gives the spatial scale of the kernel as $2^{j}$ , whilst $\ell$ and $m$ characterize the angular scale. Nonlinear moduli are applied to the convolved field as

U[j,\ell]\mathbf{d}(\mathbf{x})=\left(\sum_{m=-\ell}^{\ell}\left|\mathbf{d}*% \psi_{j,\ell}^{m}(\mathbf{x})\right|^{2}\right)^{1/2}\,,

(27)

where the field $\mathbf{d}(\mathbf{x})$ is convolved (denoted by “ $*$ ”) with $\psi_{j,\ell}^{m}(\mathbf{x})$ . The first-order solid harmonic WST coefficients (hereafter first-order ST coefficients) are obtained by integrating the field as

S_{1}\left[\mathbf{d};j,\ell,q\right]=\int_{\mathbb{R}^{3}}|U[j,\ell]\mathbf{d% }(\mathbf{x})|^{q}\mathrm{d}^{3}\mathbf{x}.

(28)

In order to capture information across multiple scales, the first-order solid harmonic WST field $U[j,\ell]\mathbf{d}(\mathbf{x})$ is convolved again with a kernel with different $j^{\prime}>j$ but with the same $\ell$ . The second-order ST coefficients are then given by

S_{2}\left[\mathbf{d};j,j^{\prime},\ell,q\right]=\int_{\mathbb{R}^{3}}\left|U[% j^{\prime},\ell]U[j,\ell]\mathbf{d}(\mathbf{x})\right|^{q}\mathrm{d}^{3}% \mathbf{x},\quad j<j^{\prime}.

(29)

In our analysis, we choose $j_{\max}=6$ to explore features on small to large scales, select $\ell_{\max}=6$ to explore different angular directions, and set $q=1$ in order to characterize the convolved and modulated field without any amplification. The ST coefficients are calculated using Kymatio⁷⁷7https://www.kymat.io/ (Andreux et al., 2018). The ST coefficients are averaged over the angular scales $\ell$ , as the averaged ST coefficients are informative (Zhao et al., 2023).

The recovered ST coefficients obtained with the different GP models are shown in Figure 8. For the first-order ST coefficients $S_{1}$ , the HGP2 model obtains the smallest relative error on small scales when $j\leq 4$ . The relative errors obtained with the NP3 and HGP3 models are worse than the HGP2 model, albeit better than those obtained with the CP3 model. For $j=5$ , the NP3 model achieves the smallest relative error, with the CP3 model having the worst relative error. For the largest scale $j=6$ , the HGP2 and NP3 models have almost identical recovery, attaining the smallest relative error, with the HGP3 model still having a smaller relative error than the CP3 model.

For the second-order ST coefficients $S_{2}$ , the results follow a similar pattern. For small-scale first-order maps with $j\leq 2$ , the HGP2 first-order maps are almost identical to the true-field first-order map. For $j\geq 4$ the NP3 model has the smallest relative error, with the worst relative error obtained by the CP3 model for all scales.

As with the PS estimates presented in Section 4.2.1, the ST coefficients show good results for the HGP2 model and NP3 model, with the HGP3 model also having better recovery of non-Gaussian summary statistics than the CP3 model. In making complete comparisons of the model performance in recovering the full non-Gaussian field, the LOO-CV summaries in Section 4.1 should be considered with the discussion therein. However, the recovered ST coefficients do provide a useful check on the non-Gaussian information recovery. As discussed with the PS recovery, the performance of the HGP3 model would be improved by using a more realistic spatial prior, with the global hyperprior used in this work being an incomplete model of foreground spatial correlations.

5 Discussions

In this section, we consider issues related to the cosmological signal recovery along with modeling and computational choices. In Section 5.1, we study the effect of bias corrections on power spectrum recovery. In Section 5.2, we discuss the choice of superpixel size used for the HGP2 and HGP3 models. In Section 5.3, we compare the H I signal recovery when using MAP-based inference of the kernel hyperparameters compared to the fully Bayesian approach we use throughout this work.

5.1 Bias Correction

As discussed in Section 2, the 21 cm signal is recovered by subtracting $\mathbb{E}[f_{\rm fg}]$ from the total observed signal, whilst the foreground covariance $\mathrm{cov}[f_{\rm{fg}}]$ is ignored. However, Mertens et al. (2020) shows that this can result in a bias in the recovery as

	$\displaystyle\left<f_{\rm 21cm,\,est}f_{\rm 21cm,\,est}^{T}\right>$	$\displaystyle=$		(30)
	$\displaystyle\left<f_{\rm 21cm,\,est}\right.$	$\displaystyle\left.f_{\rm 21cm,\,est}^{T}\right>_{\rm unbiasd}-{\rm cov}[f_{% \rm fg}].$		(30)

Mertens et al. (2020) shows that to compensate, an additive bias correction for the radial PS can be performed with the Monte Carlo procedure as follows.

1.

Generate several random samples from a multivariate Gaussian distribution, with zero mean and a covariance given by ${\rm cov}[f_{\rm fg}]$ . Each sample has the same size as our datacube.
2.

Calculate and average the radial PS over these realizations.
3.

Add the averaged PS to the residual PS obtained with GP foreground subtraction.

The procedure for the spherically averaged PS is similar, except that in the second step the calculated radial PS is binned into the $k$ -bins. This ignores the $k_{\perp}$ modes when estimating the bias for the spherically averaged PS. For the modeling considered in this paper, the covariance matrix $\mathrm{cov}[f_{\rm{fg}}]$ does not contain information regarding the foreground signal covariance in the transverse direction (S22).

In this work, we adopt the method described in Mertens et al. (2020) to examine the effects of bias correction. However, Kern & Liu (2021) highlighted the potential failure of this correction in accurately retrieving the true 21 cm PS from the EoR due to misestimation of the EoR signal covariance, and instead proposed a multiplicative correction that cannot under-predict the EoR power spectrum, but this approach requires perfect knowledge of the foreground covariance. However, in this work our methods are based on the idealized simulations used in S22, which justifies our estimation of the data covariance. We defer a detailed exploration of the corrections proposed by Kern & Liu (2021) to future work.

For the HGP and NP models, we adopted multiple covariance matrices to model the foreground emission. We should therefore calculate the bias correction for superpixels along each LoS (for HGP) or for pixels (for NP) independently. However, in practice small values of ${\rm cov}[f_{\rm fg}]$ require better computational precision than double precision. This significantly slows the calculation down, making a full bias correction for the NP model intractable with our available computational resources. For simplicity, we calculate a global ${\rm cov}[f_{\rm fg}]$ with averaged hyperparameters as an approximation for the NP3 model and leave the analysis of full corrections to future work.

The results of the PS bias correction are shown in Figure 9. On scales $k\gtrsim 0.05\,h/\rm Mpc$ , the spherically averaged PS for all models are in agreement, and the effect of bias correction is negligible. However, the bias correction results in significant overestimation on larger scales. This correction is most prominent for the CP3 model and least significant for the HGP3 model. The impact of bias correction can be seen more directly in the radial PS, where bias correction results in overestimation of the radial PS with the HGP2, NP3, and CP3 models on all scales. This is not the case for the HGP3 model, where the bias correction is again small, and we do not observe an overestimation of the radial PS on large scales. The overcorrection through the $k_{\parallel}$ modes applied to the HGP2, NP3, and CP3 models manifests itself as an apparently improved recovery of the spherically averaged PS on intermediate scales. However, in this case, the over-correction applied to the $k_{\parallel}$ modes acts against the poor recovery of the $k_{\perp}$ modes for all methods on these scales. As discussed previously, further improvements in PS recovery, particularly for transverse PS, can be made by considering more physically motivated spatial priors.

5.2 Choice of Superpixel Size

To study the impact of choosing different sizes of a single superpixel for the HGP models, we ran the HGP3 with a single superpixel that contains from $8\times 8$ normal pixels to $64\times 64$ normal pixels. The configuration of a superpixel with $8\times 8$ normal pixels corresponds to the minimum superpixel size that fits our memory constraint because a smaller superpixel size means a larger number of superpixels, each of which has a set of free GP kernel parameters. In Figure 10 we show the residual distributions obtained with various superpixel sizes. The residual distributions are broadly the same for all superpixel sizes we test here. In this work, we therefore select a superpixel size of $16\times 16$ normal pixels, a trade-off between the computational demands of the model and allowing the model to account for more detailed spatial variations in the foreground emission. In future work, it will be interesting to consider GP approximations that would allow us to remove the need for superpixels, and hence remove any superpixel boundary effects (Solin & Särkkä, 2020; Riutort-Mayol et al., 2023).

5.3 Maximum-A-Posteriori inference vs. sampling

Throughout this work we used a fully Bayesian approach, sampling the GP posterior and calculating posterior expectations. Whilst this is computationally challenging, it has been found to result in improved inference in GP models, and in Bayesian models more generally (MacKay, 2003; Betancourt, 2017; Lalchand & Rasmussen, 2020). To compare the impact of Maximum-A-Posteriori (MAP) inference with full posterior sampling in our case, we consider performing MAP optimization of the GP kernel parameters for the HGP3 model, and compare the recovered H I signal cubes with those obtained through sampling. The residual distributions for the two approaches are shown in Figure 11, alongside the residual distribution for the CP3 model obtained with full sampling. We find that using MAP optimization to obtain a point estimate of the GP kernel parameters results in a residual distribution that is even broader than the CP3 residual distribution. This is consistent with the known properties of the MAP estimators. For the complex, high-dimensional geometry of the HGP3 model, we expect that the MAP estimate of the kernel parameters is far from the typical set, where the posterior probability mass is concentrated (Betancourt, 2017).

6 Conclusions

In this work, we consider the impact of allowing for spatial variations in GP kernel parameters when modeling foreground emission in 21 cm component separation analyses. For this purpose, we consider a range of natural modeling variations that allow for different levels of spatial variation. As our baseline, we consider the CP approach adopted in previous analyses, where the foreground GP kernel parameters are assumed to be identical for every LoS. At the next level of spatial variation, we consider the HGP models, where foreground kernel parameters were allowed to vary over a set of superpixels, each with a size of $16\times 16$ normal pixels, with the kernel parameters in turn being regularized through a global hyper-prior. Finally, we consider the NP models, where foreground kernel parameters are assumed to be completely independent along every LoS, with no regularization from any hyper-prior. In all cases, we consider models with two GP kernels, abbreviated as CP2, HGP2 and NP2, and models with three GP kernels, abbreviated as CP3, HGP3 and NP3, respectively. The two-kernel models assume one GP kernel for the H I emission and one GP kernel for the foreground emission. The three-kernel models have an additional foreground kernel to model polarization leakage.

We test the performance of these models against the simulated MeerKLASS-like observations, with a focus on the recovery of the H I signal cubes. At the level of pixels, we find that both HGP3 and NP3 models have smaller residual distributions than the CP3 model, e.g. the standard deviation of the NP3 residual distribution is approximately $30\%$ smaller than for the CP3 residuals. The CP2 model fails to accurately recover the H I signal over the sky, and the NP2 model fails to converge. In these cases, the misspecification of the two-kernel models leads to failures in foreground modeling. The HGP2 model can converge, with a comparable residual distribution to the CP3 model. In this case, the hierarchical model is robust to model misspecification.

Whilst the NP3 model has the sharpest residual distribution, there are still areas of the sky where the foreground spectrum is incorrectly modeled, similar to the CP3 model. These areas of poor signal recovery are eliminated by the HGP3 model. A more rigorous model comparison is made by performing a LOO-CV analysis and calculating the expected log predictive density of held-out data, $\mathrm{elpd}_{\mathrm{loo}}$ . The NP3 model has the largest value of $\mathrm{elpd}_{\mathrm{loo}}$ , followed by the HGP3, HGP2, and CP3 models. Whilst this seems to indicate that the NP3 model has the best predictive performance, the Pareto- $\hat{k}$ statistics indicate that the importance weighting procedure used to estimate $\mathrm{elpd}_{\mathrm{loo}}$ for the NP3 model is unreliable. This is often caused by serious model misspecification. In the case of the NP3 model, the assumption of complete independence of foreground kernel parameters along every LoS is unrealistic. Moreover, memory constraint means that the full datacube has to be divided into data subsets, with the H I kernel treated independently in each subset. The lack of regularization from a hyper-prior also means that the NP3 model is prone to overfitting. The Pareto- $\hat{k}$ statistics for the remaining models are good, with the best predictive performance obtained with the HGP3 model.

In addition to the LOO-CV analysis, we also examine the recovery of summary statistics for the H I field. For the PS recovery, the results for all methods agree on small scales ( $k\gtrsim 0.1\,h/\rm{Mpc}$ ) for spherically averaged PS and radial PS, with percentage level accuracy. At large scales, the relative error increases to about $\mathcal{O}(10\%)$ . We also investigate the effect of bias correction applied through the $k_{\parallel}$ modes. Its impact is apparent in the bias-corrected radial PS, where all models except the HGP3 model overestimate the radial PS on all scales. The transverse PS is underestimated by all methods on scales $k_{\perp}\lesssim 0.1\,h/\rm{Mpc}$ , which is caused by simplified modeling of foreground kernel spatial variations. The NP3, HGP3, and HGP2 models all attain better recovery of ST coefficients compared to the CP3 model, which indicates better recovery of non-Gaussian features in the H I field.

All these results demonstrate the importance of accounting for spatial variations in foreground emission. Indeed, the HGP2 model outperforms the CP3 model in predictive performance and recovery of H I field summary statistics, despite the kernel model being misspecified. However, both NP and HGP models use highly simplified models for these spatial variations, assuming either complete independence or variation through a global hyper-prior. Improving the signal recovery requires more physically motivated spatial priors that can account for local correlations in the foreground emission. This will enable improved recovery of the transverse PS and ST coefficients. However, improved spatial modeling will likely result in significant computational demands in both memory and clock time. To address this, it will be interesting to consider memory-efficient alternatives to exact GPs (Solin & Särkkä, 2020; Riutort-Mayol et al., 2023), and also more efficient inference algorithms to integrate over the high-dimensional target distribution (Robnik et al., 2022). In addition to using better motivated spatial priors, performance improvements can also be obtained through improved foreground frequency covariance modeling (Mertens et al., 2024). Alongside these modeling considerations, in future work we will test our method against more realistic mock observations e.g., considering improved noise modeling and complications from radio frequency interference.

Acknowledgments

This work is supported by the National SKA Program of China (grant No. 2020SKA0110401) and NSFC (grant No. 12250410240 and 11821303). We thank Zhaoting Chen, Meng Zhou, Ce Sui, Siyi Zhao and Yidong Xu for their useful discussions and help. We acknowledge the Tsinghua Astrophysics High-Performance Computing platform at Tsinghua University for providing computational and data storage resources that have contributed to the research results reported within this paper.

References

Alonso et al. (2015) Alonso, D., Bull, P., Ferreira, P. G., & Santos, M. G. 2015, MNRAS, 447, 400, doi: 10.1093/mnras/stu2474
Alonso et al. (2014) Alonso, D., Ferreira, P. G., & Santos, M. G. 2014, MNRAS, 444, 3183, doi: 10.1093/mnras/stu1666
Andreux et al. (2018) Andreux, M., Angles, T., Exarchakis, G., et al. 2018, arXiv e-prints, arXiv:1812.11214, doi: 10.48550/arXiv.1812.11214
Battye et al. (2013) Battye, R. A., Browne, I. W. A., Dickinson, C., et al. 2013, MNRAS, 434, 1239, doi: 10.1093/mnras/stt1082
Battye et al. (2004) Battye, R. A., Davies, R. D., & Weller, J. 2004, MNRAS, 355, 1339, doi: 10.1111/j.1365-2966.2004.08416.x
Bernardi et al. (2009) Bernardi, G., de Bruyn, A. G., Brentjens, M. A., et al. 2009, A&A, 500, 965, doi: 10.1051/0004-6361/200911627
Bernardi et al. (2010) Bernardi, G., de Bruyn, A. G., Harker, G., et al. 2010, A&A, 522, A67, doi: 10.1051/0004-6361/200913420
Betancourt (2017) Betancourt, M. 2017, arXiv preprint arXiv:1701.02434
Betancourt & Girolami (2015) Betancourt, M., & Girolami, M. 2015, Current trends in Bayesian methodology with applications, 79, 2
Bharadwaj et al. (2001) Bharadwaj, S., Nath, B. B., & Sethi, S. K. 2001, Journal of Astrophysics and Astronomy, 22, 21, doi: 10.1007/BF02933588
Bigot-Sazy et al. (2015) Bigot-Sazy, M. A., Dickinson, C., Battye, R. A., et al. 2015, MNRAS, 454, 3240, doi: 10.1093/mnras/stv2153
Bingham et al. (2019) Bingham, E., Chen, J. P., Jankowiak, M., et al. 2019, J. Mach. Learn. Res., 20, 28:1. http://jmlr.org/papers/v20/18-403.html
Bradbury et al. (2018) Bradbury, J., Frostig, R., Hawkins, P., et al. 2018, JAX: composable transformations of Python+NumPy programs, 0.3.13. http://github.com/google/jax
Chang et al. (2008) Chang, T.-C., Pen, U.-L., Peterson, J. B., & McDonald, P. 2008, Phys. Rev. Lett., 100, 091303, doi: 10.1103/PhysRevLett.100.091303
Cheng & Ménard (2021) Cheng, S., & Ménard, B. 2021, arXiv e-prints, arXiv:2112.01288, doi: 10.48550/arXiv.2112.01288
Cheng et al. (2020) Cheng, S., Ting, Y.-S., Ménard, B., & Bruna, J. 2020, MNRAS, 499, 5902, doi: 10.1093/mnras/staa3165
Chluba et al. (2017) Chluba, J., Hill, J. C., & Abitbol, M. H. 2017, MNRAS, 472, 1195, doi: 10.1093/mnras/stx1982
Cunnington et al. (2021) Cunnington, S., Irfan, M. O., Carucci, I. P., Pourtsidou, A., & Bobin, J. 2021, MNRAS, 504, 208, doi: 10.1093/mnras/stab856
Dai & Seljak (2023) Dai, B., & Seljak, U. 2023, arXiv e-prints, arXiv:2306.04689, doi: 10.48550/arXiv.2306.04689
Dewdney et al. (2019) Dewdney, P., et al. 2019, SKA1 design baseline description, Tech. rep., SKA-TEL-SKO-0001075, internal SKA document
Di Matteo et al. (2002) Di Matteo, T., Perna, R., Abel, T., & Rees, M. J. 2002, ApJ, 564, 576, doi: 10.1086/324293
Dickinson et al. (2003) Dickinson, C., Davies, R. D., & Davis, R. J. 2003, MNRAS, 341, 369, doi: 10.1046/j.1365-8711.2003.06439.x
Eickenberg et al. (2018) Eickenberg, M., Exarchakis, G., Hirn, M., Mallat, S., & Thiry, L. 2018, J. Chem. Phys., 148, 241732, doi: 10.1063/1.5023798
Eriksen et al. (2008) Eriksen, H. K., Jewell, J. B., Dickinson, C., et al. 2008, ApJ, 676, 10, doi: 10.1086/525277
Flaxman et al. (2015) Flaxman, S., Gelman, A., Neill, D., et al. 2015, Manuscript in preparation
Furlanetto (2016) Furlanetto, S. R. 2016, The 21-cm Line as a Probe of Reionization, ed. A. Mesinger (Cham: Springer International Publishing), 247–280, doi: 10.1007/978-3-319-21957-8_9
Gabry et al. (2019) Gabry, J., Simpson, D., Vehtari, A., Betancourt, M., & Gelman, A. 2019, Journal of the Royal Statistical Society Series A: Statistics in Society, 182, 389
Gehlot et al. (2019) Gehlot, B. K., Mertens, F. G., Koopmans, L. V. E., et al. 2019, MNRAS, 488, 4271, doi: 10.1093/mnras/stz1937
Gelman (2006) Gelman, A. 2006, Technometrics, 48, 432, doi: 10.1198/004017005000000661
Gelman et al. (2013) Gelman, A., Carlin, J. B., Stern, H. S., et al. 2013, Bayesian data analysis, 3rd edn., Chapman & Hall/CRC Texts in Statistical Science Series (Boca Raton, Florida: CRC). https://www.worldcat.org/title/bayesian-data-analysis/oclc/966614951?referer=br&ht=edition
Gelman & Hennig (2017) Gelman, A., & Hennig, C. 2017, Journal of the Royal Statistical Society Series A: Statistics in Society, 180, 967
Gelman & Hill (2006) Gelman, A., & Hill, J. 2006, Data analysis using regression and multilevel/hierarchical models (Cambridge university press)
Gelman & Rubin (1992) Gelman, A., & Rubin, D. B. 1992, Statistical Science, 7, 457, doi: 10.1214/ss/1177011136
Gelman et al. (2017) Gelman, A., Simpson, D., & Betancourt, M. 2017, Entropy, 19, 555, doi: 10.3390/e19100555
Ghosh et al. (2020) Ghosh, A., Mertens, F., Bernardi, G., et al. 2020, MNRAS, 495, 2813, doi: 10.1093/mnras/staa1331
Górski et al. (2005) Górski, K. M., Hivon, E., Banday, A. J., et al. 2005, ApJ, 622, 759, doi: 10.1086/427976
Greig & Mesinger (2017) Greig, B., & Mesinger, A. 2017, MNRAS, 472, 2651, doi: 10.1093/mnras/stx2118
Greig et al. (2023) Greig, B., Ting, Y.-S., & Kaurov, A. A. 2023, MNRAS, 519, 5288, doi: 10.1093/mnras/stac3822
Grumitt et al. (2020) Grumitt, R. D. P., Jew, L. R. P., & Dickinson, C. 2020, MNRAS, 496, 4383, doi: 10.1093/mnras/staa1857
Haslam et al. (1981) Haslam, C. G. T., Klein, U., Salter, C. J., et al. 1981, A&A, 100, 209
Haslam et al. (1982) Haslam, C. G. T., Salter, C. J., Stoffel, H., & Wilson, W. E. 1982, A&AS, 47, 1
Hibbard et al. (2023) Hibbard, J. J., Rapetti, D., Burns, J. O., Mahesh, N., & Bassett, N. 2023, arXiv e-prints, arXiv:2304.09959, doi: 10.48550/arXiv.2304.09959
Hoffman et al. (2019) Hoffman, M., Sountsov, P., Dillon, J. V., et al. 2019, arXiv preprint arXiv:1903.03704
Hoffman et al. (2014) Hoffman, M. D., Gelman, A., et al. 2014, J. Mach. Learn. Res., 15, 1593
Hothi et al. (2021) Hothi, I., Chapman, E., Pritchard, J. R., et al. 2021, MNRAS, 500, 2264, doi: 10.1093/mnras/staa3446
Jelić et al. (2008) Jelić, V., Zaroubi, S., Labropoulos, P., et al. 2008, MNRAS, 389, 1319, doi: 10.1111/j.1365-2966.2008.13634.x
Jew & Grumitt (2020) Jew, L., & Grumitt, R. D. P. 2020, MNRAS, 495, 578, doi: 10.1093/mnras/staa1233
Kern & Liu (2021) Kern, N. S., & Liu, A. 2021, MNRAS, 501, 1463, doi: 10.1093/mnras/staa3736
Klypin et al. (2016) Klypin, A., Yepes, G., Gottlöber, S., Prada, F., & Heß, S. 2016, MNRAS, 457, 4340
Kumar et al. (2019) Kumar, R., Carroll, C., Hartikainen, A., & Martin, O. 2019, Journal of Open Source Software, 4, 1143, doi: 10.21105/joss.01143
Lalchand & Rasmussen (2020) Lalchand, V., & Rasmussen, C. E. 2020, in Symposium on Advances in Approximate Bayesian Inference, PMLR, 1–12
Liu & Shaw (2020) Liu, A., & Shaw, J. R. 2020, PASP, 132, 062001, doi: 10.1088/1538-3873/ab5bfd
Liu & Tegmark (2011) Liu, A., & Tegmark, M. 2011, Phys. Rev. D, 83, 103006, doi: 10.1103/PhysRevD.83.103006
Liu & Tegmark (2012) —. 2012, MNRAS, 419, 3491, doi: 10.1111/j.1365-2966.2011.19989.x
Livingstone et al. (2016) Livingstone, S., Betancourt, M., Byrne, S., & Girolami, M. 2016, Bernoulli, 25, doi: 10.3150/18-BEJ1083
MacKay (2003) MacKay, D. 2003, Information Theory, Inference and Learning Algorithms (Cambridge University Press). https://books.google.co.uk/books?id=AKuMj4PN_EMC
Masui et al. (2013) Masui, K. W., Switzer, E. R., Banavar, N., et al. 2013, ApJ, 763, L20, doi: 10.1088/2041-8205/763/1/L20
Mertens et al. (2024) Mertens, F. G., Bobin, J., & Carucci, I. P. 2024, MNRAS, 527, 3517, doi: 10.1093/mnras/stad3430
Mertens et al. (2018) Mertens, F. G., Ghosh, A., & Koopmans, L. V. E. 2018, MNRAS, 478, 3640, doi: 10.1093/mnras/sty1207
Mertens et al. (2020) Mertens, F. G., Mevius, M., Koopmans, L. V. E., et al. 2020, MNRAS, 493, 1662, doi: 10.1093/mnras/staa327
Offringa et al. (2019) Offringa, A. R., Mertens, F., & Koopmans, L. V. E. 2019, MNRAS, 484, 2866, doi: 10.1093/mnras/stz175
Oh & Mack (2003) Oh, S. P., & Mack, K. J. 2003, MNRAS, 346, 871, doi: 10.1111/j.1365-2966.2003.07133.x
Olivari et al. (2018) Olivari, L. C., Dickinson, C., Battye, R. A., et al. 2018, MNRAS, 473, 4242, doi: 10.1093/mnras/stx2621
Phan et al. (2019) Phan, D., Pradhan, N., & Jankowiak, M. 2019, arXiv preprint arXiv:1912.11554
Planck Collaboration et al. (2016a) Planck Collaboration, Adam, R., Ade, P. A. R., et al. 2016a, A&A, 594, A10, doi: 10.1051/0004-6361/201525967
Planck Collaboration et al. (2016b) Planck Collaboration, Ade, P. A. R., Aghanim, N., et al. 2016b, A&A, 594, A13, doi: 10.1051/0004-6361/201525830
Pritchard & Loeb (2012) Pritchard, J. R., & Loeb, A. 2012, Reports on Progress in Physics, 75, 086901, doi: 10.1088/0034-4885/75/8/086901
Rasmussen & Williams (2006) Rasmussen, C. E., & Williams, C. K. I. 2006, Gaussian processes for machine learning., Adaptive computation and machine learning (MIT Press), I–XVIII, 1–248
Remazeilles et al. (2015) Remazeilles, M., Dickinson, C., Banday, A. J., Bigot-Sazy, M. A., & Ghosh, T. 2015, MNRAS, 451, 4311, doi: 10.1093/mnras/stv1274
Remazeilles et al. (2016) Remazeilles, M., Dickinson, C., Eriksen, H. K. K., & Wehus, I. K. 2016, MNRAS, 458, 2032, doi: 10.1093/mnras/stw441
Remazeilles et al. (2018) Remazeilles, M., Banday, A. J., Baccigalupi, C., et al. 2018, J. Cosmology Astropart. Phys, 2018, 023, doi: 10.1088/1475-7516/2018/04/023
Riutort-Mayol et al. (2023) Riutort-Mayol, G., Bürkner, P.-C., Andersen, M. R., Solin, A., & Vehtari, A. 2023, Statistics and Computing, 33, 17
Robnik et al. (2022) Robnik, J., De Luca, G. B., Silverstein, E., & Seljak, U. 2022, arXiv e-prints, arXiv:2212.08549, doi: 10.48550/arXiv.2212.08549
Ruiz-Zapatero et al. (2022) Ruiz-Zapatero, J., García-García, C., Alonso, D., Ferreira, P. G., & Grumitt, R. D. P. 2022, MNRAS, 512, 1967, doi: 10.1093/mnras/stac431
Santos et al. (2016) Santos, M., Bull, P., Camera, S., et al. 2016, in MeerKAT Science: On the Pathway to the SKA, 32, doi: 10.22323/1.277.0032
Shafieloo et al. (2012) Shafieloo, A., Kim, A. G., & Linder, E. V. 2012, Phys. Rev. D, 85, 123530, doi: 10.1103/PhysRevD.85.123530
Shaw et al. (2015) Shaw, J. R., Sigurdson, K., Sitwell, M., Stebbins, A., & Pen, U.-L. 2015, Phys. Rev. D, 91, 083514, doi: 10.1103/PhysRevD.91.083514
Soares et al. (2022) Soares, P. S., Watkinson, C. A., Cunnington, S., & Pourtsidou, A. 2022, MNRAS, 510, 5872, doi: 10.1093/mnras/stab2594
Solin & Särkkä (2020) Solin, A., & Särkkä, S. 2020, Statistics and Computing, 30, 419
Stan Development Team (2012) Stan Development Team. 2012, Stan Modeling Language User’s Guide and Reference Manual, Version 1.0. http://mc-stan.org/
Sui et al. (2023) Sui, C., Zhao, X., Jing, T., & Mao, Y. 2023, arXiv e-prints, arXiv:2307.04994, doi: 10.48550/arXiv.2307.04994
Vehtari et al. (2017) Vehtari, A., Gelman, A., & Gabry, J. 2017, Statistics and computing, 27, 1413
Vehtari et al. (2015) Vehtari, A., Simpson, D., Gelman, A., Yao, Y., & Gabry, J. 2015, arXiv e-prints, arXiv:1507.02646, doi: 10.48550/arXiv.1507.02646
Villaescusa-Navarro et al. (2018) Villaescusa-Navarro, F., Genel, S., Castorina, E., et al. 2018, ApJ, 866, 135, doi: 10.3847/1538-4357/aadba0
Wang et al. (2006) Wang, X., Tegmark, M., Santos, M. G., & Knox, L. 2006, ApJ, 650, 529, doi: 10.1086/506597
Wolz et al. (2014) Wolz, L., Abdalla, F. B., Blake, C., et al. 2014, MNRAS, 441, 3271, doi: 10.1093/mnras/stu792
Zaldarriaga et al. (2004) Zaldarriaga, M., Furlanetto, S. R., & Hernquist, L. 2004, ApJ, 608, 622, doi: 10.1086/386327
Zhao et al. (2023) Zhao, X., Mao, Y., Zuo, S., & Wandelt, B. D. 2023, arXiv e-prints, arXiv:2310.17602, doi: 10.48550/arXiv.2310.17602

$\displaystyle\mathbb{E}\left[f(x_{*})\|\Phi,\sigma_{\eta}\right]$	$\displaystyle=K(x_{*},x\|\Phi)\left(K(x,x\|\Phi)+\sigma_{\eta}^{2}I_{N}\right)^{% -1}y,$	(6)
$\displaystyle\mathrm{cov}\left[f(x_{*})\|\Phi,\sigma_{\eta}\right]$	$\displaystyle=K(x_{},x_{}\|\Phi)$
$\displaystyle-K(x_{*},x\|\Phi)$	$\displaystyle\left(K(x,x\|\Phi)+\sigma_{\eta}^{2}I_{N}\right)^{-1}K(x,x_{*}\|% \Phi).$