Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Modeling Foreground Spatial Variations in 21 cm Gaussian Process Component Separation

Kangning Diao Department of Astronomy, Tsinghua University, Beijing 100084, China Richard D.P. Grumitt Department of Astronomy, Tsinghua University, Beijing 100084, China Yi Mao Department of Astronomy, Tsinghua University, Beijing 100084, China Richard D.P. Grumitt, Yi Mao rgrumitt@tsinghua.edu.cn (RDPG), ymao@tsinghua.edu.cn (YM)
(Received XXX; Revised YYY; Accepted ZZZ)
Abstract

Gaussian processes (GPs) have been extensively utilized as nonparametric models for component separation in 21 cm data analyses. This exploits the distinct spectral behavior of the cosmological and foreground signals, which are modeled through the GP covariance kernel. Previous approaches have employed a global GP kernel along all lines of sight (LoS). In this work, we study Bayesian approaches that allow for spatial variations in foreground kernel parameters, testing them against simulated H I intensity mapping observations. We consider a no-pooling (NP) model, which treats each LoS independently by fitting for separate covariance kernels, and a hierarchical Gaussian Process (HGP) model that allows for variation in kernel parameters between different LoS, regularized through a global hyperprior. We find that accounting for spatial variations in the GP kernel parameters results in a significant improvement in cosmological signal recovery, achieving up to a 30% reduction in the standard deviation of the residual distribution and improved model predictive performance. Allowing for spatial variations in GP kernel parameters also improves the recovery of the H I power spectra and wavelet scattering transform coefficients. Whilst the NP model achieves superior recovery as measured by the residual distribution, it demands extensive computational resources, faces formidable convergence challenges, and is prone to overfitting. Conversely, the HGP model strikes a balance between the accuracy and robustness of the signal recovery. Further improvements to the HGP model will require more physically motivated modeling of foreground spatial variations. Our code can be found in this repo.

Cosmology(343) — Large-scale structure of the universe(902) — 21 cm lines(690) — Gaussian Processes regression(1930)
journal: ApJsoftware: Kymatio (Andreux et al., 2018), numpyro (Phan et al., 2019; Bingham et al., 2019), JAX (Bradbury et al., 2018), GPR4IM (Soares et al., 2022), ArviZ (Kumar et al., 2019)

1 Introduction

Measurements of the 21 cm signal from neutral hydrogen provide a powerful cosmological probe. At high redshifts (z6greater-than-or-equivalent-to𝑧6z\gtrsim 6italic_z ≳ 6), these observations are a key window into the astrophysics of the Epoch of Reionization (EoR) and Cosmic Dawn (CD), as well as allowing for independent constraints on cosmological parameters (Pritchard & Loeb, 2012; Furlanetto, 2016). At lower redshifts, the neutral hydrogen (H I) intensity mapping can be used as a tracer for the large-scale structure of the Universe (Bharadwaj et al., 2001; Battye et al., 2004; Chang et al., 2008). However, the accurate detection and characterization of such cosmological signals face a severe challenge in foreground removal. Any 21 cm signal will be orders of magnitude weaker than the foreground emission, consisting of extragalactic emission along with diffuse and partially polarized Galactic emission (Di Matteo et al., 2002; Oh & Mack, 2003; Wang et al., 2006; Jelić et al., 2008; Bernardi et al., 2009, 2010; Liu & Tegmark, 2011, 2012; Wolz et al., 2014; Alonso et al., 2015; Ghosh et al., 2020; Liu & Shaw, 2020; Cunnington et al., 2021).

Foreground removal methods typically proceed by exploiting the distinct spectral behaviors of the foreground and 21 cm components. The foreground emission here is usually assumed to vary much more smoothly with frequency than the 21 cm signal (Liu & Tegmark, 2011; Alonso et al., 2015; Shaw et al., 2015; Mertens et al., 2018). This can come in the form of parametric modeling where, e.g., synchrotron emission is modeled as following a power law with frequency (Eriksen et al., 2008; Bigot-Sazy et al., 2015; Cunnington et al., 2021; Hibbard et al., 2023). Parametric modeling in this form has advantages insofar as the parameters can be easily interpreted, allowing us to assign physically motivated priors to spectral parameters. However, the spectral energy distributions (SEDs) of the foreground components are more complicated than such simple parametric models. Systematics in such a modeling has the potential to induce significant biases in the final signal extraction when the signals of interest are orders of magnitude weaker than the foreground components (Remazeilles et al., 2016, 2018). It is worth noting that these problems arise even if the underlying emission mechanism follows an exact power law with frequency. Given multiple bodies along the line of sight (LoS) emitting as power laws, averaging effects along the LoS, along with beam and pixel averaging will induce spectral distortions which have the potential to bias simple parametric fits (Chluba et al., 2017).

As an alternative, one can adopt non-parametric methods where one does not assume an explicit form for the emission component SEDs. In this vein, Gaussian processes (GPs) offer an attractive solution and have seen previous applications to 21 cm component separation both on simulations and real data (Mertens et al., 2018, 2020, 2024; Gehlot et al., 2019; Offringa et al., 2019; Ghosh et al., 2020; Hothi et al., 2021; Kern & Liu, 2021; Soares et al., 2022). GPs belong to a class of non-parametric Bayesian models, where one effectively assigns a prior over the space of functions that describe the emission component SEDs (Rasmussen & Williams, 2006). The distinct spectral behaviors of the emission components can be accounted for through the GP covariance kernel, enabling a flexible component separation that can account for departures from simple parametric models. One can also fully marginalize the GP model parameters, allowing for a full propagation of uncertainty through the component separation process (Flaxman et al., 2015; Lalchand & Rasmussen, 2020).

In this work, we consider adapting the GP component separation framework to allow spatial variations in the foreground kernel parameters, so that we can better model spatial variations in the foreground spectral properties (Planck Collaboration et al., 2016a; Remazeilles et al., 2018; Jew & Grumitt, 2020). In previous work, the foreground GP kernel parameters have been assumed to be identical along every LoS, i.e. a “complete pooling” model (e.g., Soares et al. 2022; hereafter S22). We study the impact of allowing spatial variations through a “no pooling” model, where the foreground kernel parameters are allowed to vary independently along every LoS, and a hierarchical model where the kernel parameters are allowed to vary but are assumed to be drawn from some underlying hyper-distribution, whose parameters are jointly fit during the inference process. The hyper-distribution acts to regularize pixel-to-pixel variations in GP kernel parameters and allows for the sharing of information between different LoS (Gelman, 2006; Gelman & Hill, 2006). Hierarchical foreground modeling has previously been used in the context of parametric cosmic microwave background (CMB) component separation in Grumitt et al. (2020).

The structure of the rest of this paper is as follows. In Section 2 we give an overview of GP models, their specific application to 21 cm component separation, and the form of the various GP models we consider in this work. In Section 3 we describe the simulated H I intensity mapping observations, against which we test our component separation method, consisting of a mock MeerKLASS-type survey (Santos et al., 2016), and the computational setup used to perform inference over our models. In Section 4 we describe the results from these simulations, including the uncompressed pixel-level recovery and predictive performance, along with the recovery of H I field summary statistics. In Section 5 we discuss notable modeling and recovery details, covering the effect of bias correction on power spectrum recovery, the modeling choices used to define the hierarchical GP models, and the GP inference approach used in this work. Finally, in Section 6 we summarize our method and results and discuss improvements that can be made in the modeling of foreground spatial variations for component separation.

2 Gaussian Processes

2.1 Overview

Refer to caption
Figure 1: Samples from different GP kernels with different length scales l𝑙litalic_l. It can be seen that decreasing the length scale results in more rapidly varying GP realizations. Functions allowed by the RBF kernel are infinitely differentiable, which manifests in the smooth GP realizations seen in the left panels, irrespective of the length scale. The functions allowed by the exponential kernel are non-differentiable, which manifests itself in the spiky function realizations seen in the right panels.
Refer to caption
Figure 2: An illustration of different GP model variants. In the CP model, which has been used extensively in previous 21 cm component separation analyses, we sample a set of global kernel parameters for all LoS. For the NP model, we sample independent kernel parameters for every LoS. In the HGP model, the full datacube is divided into a set of superpixels, with kernel parameters being shared amongst all LoS within a superpixel. The kernel hyper-parameters are drawn from a learned prior, which is inferred through the use of a global hyper-prior over the prior parameters. This hyper-prior acts to regularize spatial variations in the kernel parameters and allows the sharing of information between different LoS.

GP models are a class of Bayesian non-parametric models that assign a prior distribution over the space of functions (Gelman et al., 2013; Rasmussen & Williams, 2006). Broadly, a GP prior consists of an infinite set of random variables for each of the possible inputs of the function. A GP prior may be written as

f𝒢𝒫(m,K),similar-to𝑓𝒢𝒫𝑚𝐾f\sim\mathcal{GP}(m,K),italic_f ∼ caligraphic_G caligraphic_P ( italic_m , italic_K ) , (1)

where m𝑚mitalic_m is the mean function and K𝐾Kitalic_K is the GP covariance kernel. The mean function defines the function around which realizations from the GP will be distributed, whilst the covariance kernel describes how GP realizations vary around the mean function. The mean function can be decoupled from the GP, writing f=m+g𝑓𝑚𝑔f=m+gitalic_f = italic_m + italic_g where g𝒢𝒫(0,K)similar-to𝑔𝒢𝒫0𝐾g\sim\mathcal{GP}(0,K)italic_g ∼ caligraphic_G caligraphic_P ( 0 , italic_K ), and is commonly chosen to be the zero-function. However, informed choice of the mean function, e.g., using some parametric model around which the GP models deviations, can significantly improve regression performance (Shafieloo et al., 2012; Ruiz-Zapatero et al., 2022). The choice of covariance kernel is key to assigning a prior over the expected behavior of the regression function. For example, when one expects smooth functions, a kernel that allows for infinitely differentiable functions, such as the radial basis function (RBF) kernel, can be selected as follows

K(x,x)=σ2exp(xx22l2).𝐾𝑥superscript𝑥superscript𝜎2superscriptdelimited-∥∥𝑥superscript𝑥22superscript𝑙2K(x,x^{\prime})=\sigma^{2}\exp\left(-\frac{\lVert x-x^{\prime}\rVert^{2}}{2l^{% 2}}\right).italic_K ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_exp ( - divide start_ARG ∥ italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) . (2)

The GP variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT determines the expected amplitude over which the GP realizations vary, whilst the length-scale parameter l𝑙litalic_l determines the typical length-scale over which a GP realization varies with x𝑥xitalic_x. In settings where the function is expected to vary in a more spiky fashion, a kernel from the Matern class may be selected,

K(x,x)𝐾𝑥superscript𝑥\displaystyle K(x,x^{\prime})italic_K ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =σ221ρΓ(ρ)(2ρxxl)ρabsentsuperscript𝜎2superscript21𝜌Γ𝜌superscript2𝜌delimited-∥∥𝑥superscript𝑥𝑙𝜌\displaystyle=\sigma^{2}\frac{2^{1-\rho}}{\Gamma(\rho)}\left(\frac{\sqrt{2\rho% }\lVert x-x^{\prime}\rVert}{l}\right)^{\rho}= italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 2 start_POSTSUPERSCRIPT 1 - italic_ρ end_POSTSUPERSCRIPT end_ARG start_ARG roman_Γ ( italic_ρ ) end_ARG ( divide start_ARG square-root start_ARG 2 italic_ρ end_ARG ∥ italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ end_ARG start_ARG italic_l end_ARG ) start_POSTSUPERSCRIPT italic_ρ end_POSTSUPERSCRIPT (3)
×Kρ(2ρxxl),absentsubscript𝐾𝜌2𝜌delimited-∥∥𝑥superscript𝑥𝑙\displaystyle\times K_{\rho}\left(\frac{\sqrt{2\rho}\lVert x-x^{\prime}\rVert}% {l}\right),× italic_K start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( divide start_ARG square-root start_ARG 2 italic_ρ end_ARG ∥ italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ end_ARG start_ARG italic_l end_ARG ) ,

where σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and l𝑙litalic_l have analogous interpretations as for the squared-exponential kernel, and Kρ()subscript𝐾𝜌K_{\rho}(\cdot)italic_K start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( ⋅ ) is a modified Bessel function of the second kind. The parameter ρ>0𝜌0\rho>0italic_ρ > 0 controls the smoothness of the functions allowed by the Matern kernel, with GPs described by Matern kernels being ρ1𝜌1\lceil\rho\rceil-1⌈ italic_ρ ⌉ - 1 times differentiable, where ρ𝜌\lceil\rho\rceil⌈ italic_ρ ⌉ is the ceiling of ρ𝜌\rhoitalic_ρ. In the limit ρ𝜌\rho\rightarrow\inftyitalic_ρ → ∞, the Matern kernel converges to the RBF kernel.

In the setting where we have some finite set of observed data points yi,i{1,,N}subscript𝑦𝑖𝑖1𝑁y_{i},\,i\in\{1,\ldots,N\}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ { 1 , … , italic_N }, corresponding to covariates xi,i{1,,N}subscript𝑥𝑖𝑖1𝑁x_{i},\,i\in\{1,\ldots,N\}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ { 1 , … , italic_N }, GP regression is made tractable by the fact that the joint distribution over the GP realizations at these points is a multivariate Gaussian with mean mi=m(xi),i{1,,N}formulae-sequencesubscript𝑚𝑖𝑚subscript𝑥𝑖𝑖1𝑁m_{i}=m(x_{i}),\,i\in\{1,\ldots,N\}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_m ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_i ∈ { 1 , … , italic_N } and covariance Kij=K(xi,xj),i,j{1,,N}formulae-sequencesubscript𝐾𝑖𝑗𝐾subscript𝑥𝑖subscript𝑥𝑗𝑖𝑗1𝑁K_{ij}=K(x_{i},x_{j}),\,i,j\in\{1,\ldots,N\}italic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_K ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , italic_i , italic_j ∈ { 1 , … , italic_N }. We may write a generic GP regression model (assuming zero mean function) as,

ΦΦ\displaystyle\Phiroman_Φ πΦ(Φ),similar-toabsentsubscript𝜋ΦΦ\displaystyle\sim\pi_{\Phi}(\Phi),∼ italic_π start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ( roman_Φ ) , (4)
f(x)𝑓𝑥\displaystyle f(x)italic_f ( italic_x ) 𝒢𝒫(0,K(x,x|Φ)),similar-toabsent𝒢𝒫0𝐾𝑥conditionalsuperscript𝑥Φ\displaystyle\sim\mathcal{GP}(0,K(x,x^{\prime}|\Phi)),∼ caligraphic_G caligraphic_P ( 0 , italic_K ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | roman_Φ ) ) ,
σηsubscript𝜎𝜂\displaystyle\sigma_{\eta}italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT πση(ση),similar-toabsentsubscript𝜋subscript𝜎𝜂subscript𝜎𝜂\displaystyle\sim\pi_{\sigma_{\eta}}(\sigma_{\eta}),∼ italic_π start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ) ,
y𝑦\displaystyle yitalic_y πy(y|f(x),ση),similar-toabsentsubscript𝜋𝑦conditional𝑦𝑓𝑥subscript𝜎𝜂\displaystyle\sim\pi_{y}(y|f(x),\sigma_{\eta}),∼ italic_π start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( italic_y | italic_f ( italic_x ) , italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ) ,

where ΦΦ\Phiroman_Φ are the GP covariance parameters with corresponding prior πΦ(Φ)subscript𝜋ΦΦ\pi_{\Phi}(\Phi)italic_π start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ( roman_Φ ). The regression function f(x)𝑓𝑥f(x)italic_f ( italic_x ) is assigned a GP prior, 𝒢𝒫(0,K(x,x|Φ))𝒢𝒫0𝐾𝑥conditionalsuperscript𝑥Φ\mathcal{GP}(0,K(x,x^{\prime}|\Phi))caligraphic_G caligraphic_P ( 0 , italic_K ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | roman_Φ ) ). The parameter σηsubscript𝜎𝜂\sigma_{\eta}italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT is chosen to describe the observation noise process and is assigned some prior πση(ση)subscript𝜋subscript𝜎𝜂subscript𝜎𝜂\pi_{\sigma_{\eta}}(\sigma_{\eta})italic_π start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ). The data is distributed as yπy(y|f(x),ση)similar-to𝑦subscript𝜋𝑦conditional𝑦𝑓𝑥subscript𝜎𝜂y\sim\pi_{y}(y|f(x),\sigma_{\eta})italic_y ∼ italic_π start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( italic_y | italic_f ( italic_x ) , italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ), depending on the GP function realizations and the observation noise process. In the case where the likelihood is Gaussian, the GP prior is conjugate to the likelihood, and the GP function realizations can be analytically marginalized. Taking σηsubscript𝜎𝜂\sigma_{\eta}italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT to be the Gaussian noise standard deviation, the analytic marginalization yields the simpler model,

ΦΦ\displaystyle\Phiroman_Φ πΦ(Φ),similar-toabsentsubscript𝜋ΦΦ\displaystyle\sim\pi_{\Phi}(\Phi),∼ italic_π start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ( roman_Φ ) , (5)
σηsubscript𝜎𝜂\displaystyle\sigma_{\eta}italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT πση(ση),similar-toabsentsubscript𝜋subscript𝜎𝜂subscript𝜎𝜂\displaystyle\sim\pi_{\sigma_{\eta}}(\sigma_{\eta}),∼ italic_π start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ) ,
y𝑦\displaystyle yitalic_y 𝒩(y|0,K(x,x|Φ)+ση2IN),\displaystyle\sim\mathcal{N}(y|0,K(x,x|\Phi)+\sigma_{\eta}^{2}I_{N}),∼ caligraphic_N ( italic_y | 0 , italic_K ( italic_x , italic_x | roman_Φ ) + italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ,

where 𝒩𝒩\mathcal{N}caligraphic_N is a multivariate Gaussian distribution and INsubscript𝐼𝑁I_{N}italic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT is the N×N𝑁𝑁N\times Nitalic_N × italic_N identity matrix. This amounts to a significant dimensionality reduction, with posterior inference only needing to be performed over the GP covariance and noise parameters.

A typical approach at this point is to optimize the log-marginal posterior with respect to ΦΦ\Phiroman_Φ and σ𝜎\sigmaitalic_σ, then fixing their values at the maximum a-posteriori (MAP) values for the downstream inference of f(x)𝑓𝑥f(x)italic_f ( italic_x ) (Rasmussen & Williams, 2006). This can be viewed under the empirical Bayes framework, where the data are used to set the prior parameters, which are fixed for inference over the remaining model parameters. This approach has the advantage of being fast compared to full integration over the marginal posterior. However, it comes with two significant drawbacks. By fixing ΦΦ\Phiroman_Φ and σ𝜎\sigmaitalic_σ at their MAP values we do not fully propagate the posterior uncertainty on them to our inferences over f(x)𝑓𝑥f(x)italic_f ( italic_x ). In addition, the MAP estimate for parameter values is generally not a good estimator. For many models, using the MAP value as an estimator can lead to poor predictive performance and can result in significantly biased parameter inferences (MacKay, 2003; Betancourt, 2017). This has previously been noted in GP models, where MAP estimation has been observed to tend to result in overfitting (Lalchand & Rasmussen, 2020). Problems with MAP estimation are particularly apparent for hierarchical Bayesian models, which we consider in this work, where correlations between model hyperparameters and latent variables result in posterior geometries where the MAP is very far from the typical set, i.e., where the probability mass of the posterior is concentrated (Betancourt & Girolami, 2015; Betancourt, 2017; Hoffman et al., 2019). Given these problems with MAP estimation, we consider fully Bayesian GP models throughout this work, and demonstrate the degradation in our cosmological signal recovery using MAP estimation in Section 5.3.

For fixed ΦΦ\Phiroman_Φ and σηsubscript𝜎𝜂\sigma_{\eta}italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT, the expectation and covariance of the GP function realizations at some points xsubscript𝑥x_{*}italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT may be obtained through the standard prediction formulae,

𝔼[f(x)|Φ,ση]𝔼delimited-[]conditional𝑓subscript𝑥Φsubscript𝜎𝜂\displaystyle\mathbb{E}\left[f(x_{*})|\Phi,\sigma_{\eta}\right]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) | roman_Φ , italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ] =K(x,x|Φ)(K(x,x|Φ)+ση2IN)1y,absent𝐾subscript𝑥conditional𝑥Φsuperscript𝐾𝑥conditional𝑥Φsuperscriptsubscript𝜎𝜂2subscript𝐼𝑁1𝑦\displaystyle=K(x_{*},x|\Phi)\left(K(x,x|\Phi)+\sigma_{\eta}^{2}I_{N}\right)^{% -1}y,= italic_K ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x | roman_Φ ) ( italic_K ( italic_x , italic_x | roman_Φ ) + italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_y , (6)
cov[f(x)|Φ,ση]covdelimited-[]conditional𝑓subscript𝑥Φsubscript𝜎𝜂\displaystyle\mathrm{cov}\left[f(x_{*})|\Phi,\sigma_{\eta}\right]roman_cov [ italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) | roman_Φ , italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ] =K(x,x|Φ)absent𝐾subscript𝑥conditionalsubscript𝑥Φ\displaystyle=K(x_{*},x_{*}|\Phi)= italic_K ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT | roman_Φ )
K(x,x|Φ)𝐾subscript𝑥conditional𝑥Φ\displaystyle-K(x_{*},x|\Phi)- italic_K ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_x | roman_Φ ) (K(x,x|Φ)+ση2IN)1K(x,x|Φ).superscript𝐾𝑥conditional𝑥Φsuperscriptsubscript𝜎𝜂2subscript𝐼𝑁1𝐾𝑥conditionalsubscript𝑥Φ\displaystyle\left(K(x,x|\Phi)+\sigma_{\eta}^{2}I_{N}\right)^{-1}K(x,x_{*}|% \Phi).( italic_K ( italic_x , italic_x | roman_Φ ) + italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_K ( italic_x , italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT | roman_Φ ) .

Posterior samples for ΦΦ\Phiroman_Φ and σηsubscript𝜎𝜂\sigma_{\eta}italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT obtained from the marginalized model in Equation (5) can be used to construct an ensemble estimate for the posterior predictive distribution over f(x)𝑓subscript𝑥f(x_{*})italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) (Lalchand & Rasmussen, 2020). The expected signal is obtained by calculating the ensemble average,

𝔼[f(x)]=1Mj=1M𝔼[f(x)|Φj,σηj],𝔼delimited-[]𝑓subscript𝑥1𝑀superscriptsubscript𝑗1𝑀𝔼delimited-[]conditional𝑓subscript𝑥superscriptΦ𝑗superscriptsubscript𝜎𝜂𝑗\mathbb{E}\left[f(x_{*})\right]=\frac{1}{M}\sum_{j=1}^{M}\mathbb{E}\left[f(x_{% *})|\Phi^{j},\sigma_{\eta}^{j}\right],blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ] = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) | roman_Φ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ] , (7)

where the sum is over the M𝑀Mitalic_M posterior samples, indexed by j𝑗jitalic_j. The covariance of the posterior predictive samples can be evaluated by drawing samples from the posterior predictive corresponding to each of the posterior samples, fppj(x)𝒩(𝔼[f(x)|Φj,σηj],cov[f(x)|Φj,σηj])similar-tosuperscriptsubscript𝑓𝑝𝑝𝑗subscript𝑥𝒩𝔼delimited-[]conditional𝑓subscript𝑥superscriptΦ𝑗superscriptsubscript𝜎𝜂𝑗covdelimited-[]conditional𝑓subscript𝑥superscriptΦ𝑗superscriptsubscript𝜎𝜂𝑗f_{pp}^{j}(x_{*})\sim\mathcal{N}(\mathbb{E}\left[f(x_{*})|\Phi^{j},\sigma_{% \eta}^{j}\right],\mathrm{cov}\left[f(x_{*})|\Phi^{j},\sigma_{\eta}^{j}\right])italic_f start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∼ caligraphic_N ( blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) | roman_Φ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ] , roman_cov [ italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) | roman_Φ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ] ), and evaluating the ensemble covariance,

cov[f(x)]=1M1j=1M{(fppj(x)𝔼[f(x)])\displaystyle\mathrm{cov}\left[f(x_{*})\right]=\frac{1}{M-1}\sum_{j=1}^{M}% \left\{\left(f_{pp}^{j}(x_{*})-\mathbb{E}[f(x_{*})]\right)\right.roman_cov [ italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ] = divide start_ARG 1 end_ARG start_ARG italic_M - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT { ( italic_f start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ] ) (8)
(fppj(x)𝔼[f(x)])}.\displaystyle\left.\otimes\left(f_{pp}^{j}(x_{*})-\mathbb{E}[f(x_{*})]\right)% \right\}.⊗ ( italic_f start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ] ) } .

2.2 Gaussian Processes for Component Separation

GP regression can be utilized in component separation. In this paper, we focus on 21 cm component separation, and as such consider a sky signal consisting of astrophysical foreground emission, polarization leakage, and the cosmological H I signal. These will be described in detail in Section 3. The core idea behind using GP models for component separation is to model the total sky signal as an additive GP realization, i.e., a GP realization that is the sum of foreground, polarization leakage, and H I terms which are each assigned their own GP prior. Each emission component can then be separated by exploiting the fact that they each display different behavior as a function of frequency, which can be modeled through their GP covariance kernels.

For our sky model, we follow S22, decomposing the sky emission into a cosmological H I component, fHIsubscript𝑓HIf_{\rm HI}italic_f start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT and a foreground component, ffgsubscript𝑓fgf_{\rm fg}italic_f start_POSTSUBSCRIPT roman_fg end_POSTSUBSCRIPT. The observed sky signal along a given LoS may then be written as

dobs(ν)=fHI+ffg+ην,subscript𝑑obs𝜈subscript𝑓HIsubscript𝑓fgsubscript𝜂𝜈d_{\rm obs}(\nu)=f_{\rm HI}+f_{\rm fg}+\eta_{\nu},italic_d start_POSTSUBSCRIPT roman_obs end_POSTSUBSCRIPT ( italic_ν ) = italic_f start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT + italic_f start_POSTSUBSCRIPT roman_fg end_POSTSUBSCRIPT + italic_η start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT , (9)

where ην𝒩(0,σν2)similar-tosubscript𝜂𝜈𝒩0superscriptsubscript𝜎𝜈2\eta_{\nu}\sim\mathcal{N}(0,\sigma_{\nu}^{2})italic_η start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) is additive Gaussian noise at the observing frequency ν𝜈\nuitalic_ν. The foreground signal is further decomposed as

ffg=fsm+fpol,subscript𝑓fgsubscript𝑓smsubscript𝑓polf_{\rm fg}=f_{\rm sm}+f_{\rm pol},italic_f start_POSTSUBSCRIPT roman_fg end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT + italic_f start_POSTSUBSCRIPT roman_pol end_POSTSUBSCRIPT , (10)

where fsmsubscript𝑓smf_{\rm sm}italic_f start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT is the signal due to astrophysical emission that varies smoothly with frequency, and fpolsubscript𝑓polf_{\rm pol}italic_f start_POSTSUBSCRIPT roman_pol end_POSTSUBSCRIPT is the signal due to polarization leakage. The total GP covariance kernel is given by

K=KHI+Kfg=KHI+Ksm+Kpol,𝐾subscript𝐾HIsubscript𝐾fgsubscript𝐾HIsubscript𝐾smsubscript𝐾polK=K_{\rm HI}+K_{\rm fg}=K_{\rm HI}+K_{\rm sm}+K_{\rm pol},italic_K = italic_K start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT + italic_K start_POSTSUBSCRIPT roman_fg end_POSTSUBSCRIPT = italic_K start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT + italic_K start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT + italic_K start_POSTSUBSCRIPT roman_pol end_POSTSUBSCRIPT , (11)

where KHIsubscript𝐾HIK_{\rm HI}italic_K start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT, Ksmsubscript𝐾smK_{\rm sm}italic_K start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT, and Kpolsubscript𝐾polK_{\rm pol}italic_K start_POSTSUBSCRIPT roman_pol end_POSTSUBSCRIPT are the covariance kernels for the HI, smooth astrophysical foreground, and polarization leakage terms respectively.

For the Gaussian observation model defined through Equation (9), and assuming a fixed noise variance for all frequencies σν2=ση2superscriptsubscript𝜎𝜈2superscriptsubscript𝜎𝜂2\sigma_{\nu}^{2}=\sigma_{\eta}^{2}italic_σ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we can write down the prediction formulae for the expectation and covariance of the foreground signal for the fixed GP covariance kernel parameters,

𝔼[ffg|K,ση]𝔼delimited-[]conditionalsubscript𝑓fg𝐾subscript𝜎𝜂\displaystyle\mathbb{E}\left[f_{\rm fg}|K,\sigma_{\eta}\right]blackboard_E [ italic_f start_POSTSUBSCRIPT roman_fg end_POSTSUBSCRIPT | italic_K , italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ] =Kfg(Kfg+KHI+ση2IN)1dobs,absentsubscript𝐾fgsuperscriptsubscript𝐾fgsubscript𝐾HIsuperscriptsubscript𝜎𝜂2subscript𝐼𝑁1subscript𝑑obs\displaystyle=K_{\rm fg}(K_{\rm fg}+K_{\rm HI}+\sigma_{\eta}^{2}I_{N})^{-1}d_{% \rm obs},= italic_K start_POSTSUBSCRIPT roman_fg end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT roman_fg end_POSTSUBSCRIPT + italic_K start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_obs end_POSTSUBSCRIPT , (12)
cov[ffg|K,ση]covdelimited-[]conditionalsubscript𝑓fg𝐾subscript𝜎𝜂\displaystyle\mathrm{cov}\left[f_{\rm fg}|K,\sigma_{\eta}\right]roman_cov [ italic_f start_POSTSUBSCRIPT roman_fg end_POSTSUBSCRIPT | italic_K , italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ] =KfgKfg(Kfg+KHI+ση2IN)1Kfg,absentsubscript𝐾fgsubscript𝐾fgsuperscriptsubscript𝐾fgsubscript𝐾HIsuperscriptsubscript𝜎𝜂2subscript𝐼𝑁1subscript𝐾fg\displaystyle=K_{\rm fg}-K_{\rm fg}\left(K_{\rm fg}+K_{\rm HI}+\sigma_{\eta}^{% 2}I_{N}\right)^{-1}K_{\rm fg},= italic_K start_POSTSUBSCRIPT roman_fg end_POSTSUBSCRIPT - italic_K start_POSTSUBSCRIPT roman_fg end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT roman_fg end_POSTSUBSCRIPT + italic_K start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT roman_fg end_POSTSUBSCRIPT ,

where N𝑁Nitalic_N denotes the number of observed frequencies. Given predictions for the foreground signal, 𝔼[ffg|K,ση]𝔼delimited-[]conditionalsubscript𝑓fg𝐾subscript𝜎𝜂\mathbb{E}[f_{\rm fg}|K,\sigma_{\eta}]blackboard_E [ italic_f start_POSTSUBSCRIPT roman_fg end_POSTSUBSCRIPT | italic_K , italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ], one can calculate the residual fressubscript𝑓resf_{\rm res}italic_f start_POSTSUBSCRIPT roman_res end_POSTSUBSCRIPT as our estimate of the 21 cm signal,

f21cm,est=fres=dobs𝔼[ffg|K,ση].subscript𝑓21cmestsubscript𝑓ressubscript𝑑obs𝔼delimited-[]conditionalsubscript𝑓fg𝐾subscript𝜎𝜂f_{\rm 21cm,est}=f_{\rm res}=d_{\rm obs}-\mathbb{E}[f_{\rm fg}|K,\sigma_{\eta}].italic_f start_POSTSUBSCRIPT 21 roman_c roman_m , roman_est end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT roman_res end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT roman_obs end_POSTSUBSCRIPT - blackboard_E [ italic_f start_POSTSUBSCRIPT roman_fg end_POSTSUBSCRIPT | italic_K , italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ] . (13)

The final expected foreground and 21 cm signals, along with their corresponding uncertainties, can be obtained by following the procedure outlined in Section 2.1 for evaluating the uncertainty over the GP posterior predictive samples, using the posterior samples for the GP kernel and noise parameters.

We follow S22 in selecting the kernels for our emission components. Astrophysical foreground emission is expected to vary smoothly as a function of frequency. As such, astrophysical foregrounds are modeled through an RBF kernel,

Ksm(ν,ν)=σsm2exp(νν22lsm2),K_{\rm sm}(\nu,\nu^{\prime})=\sigma^{2}_{\rm sm}\exp{\left(-\frac{\lVert\nu-% \nu^{\prime}\lVert^{2}}{2l_{\rm sm}^{2}}\right)},italic_K start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ( italic_ν , italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_ν - italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_l start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (14)

where σsmsubscript𝜎sm\sigma_{\rm sm}italic_σ start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT is a hyperparameter that controls the amplitude of the spectrum, whilst lsmsubscript𝑙sml_{\rm sm}italic_l start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT determines the correlation between frequency channels. Larger values of lsmsubscript𝑙sml_{\rm sm}italic_l start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT indicate data in all frequency channels are highly correlated, resulting in a smoother spectrum. The left panel of Figure 1 shows samples from a GP with an RBF kernel, with kernel variance σ2=1superscript𝜎21\sigma^{2}=1italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1 and length scales l=0.5𝑙0.5l=0.5italic_l = 0.5 and l=0.05𝑙0.05l=0.05italic_l = 0.05. It can be seen that the GP samples vary smoothly for both length scales, which is to be expected given that the functions allowed by the RBF kernel are infinitely differentiable.

For the 21 cm signal, we apply the exponential kernel,

KHI(ν,ν)=σHI2exp(νν2lHI).subscript𝐾HI𝜈superscript𝜈subscriptsuperscript𝜎2HIdelimited-∥∥𝜈superscript𝜈2subscript𝑙HIK_{\rm HI}(\nu,\nu^{\prime})=\sigma^{2}_{\rm HI}\exp{\left(-\frac{\lVert\nu-% \nu^{\prime}\rVert}{2l_{\rm HI}}\right)}.italic_K start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT ( italic_ν , italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_ν - italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ end_ARG start_ARG 2 italic_l start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT end_ARG ) . (15)

The exponential kernel corresponds to the Matern kernel with ρ=1/2𝜌12\rho=1/2italic_ρ = 1 / 2. The functions allowed by this kernel are non-differentiable, and therefore vary in a very spiky fashion with frequency, as demonstrated in the right panel of Figure 1. The σHIsubscript𝜎HI\sigma_{\rm HI}italic_σ start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT and lHIsubscript𝑙HIl_{\rm HI}italic_l start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT parameters control the amplitude and correlation length scale of the H I signal.

Alongside astrophysical foreground emission, we also consider the effects of polarization leakage in our simulations. This is also expected to vary smoothly with frequency and is therefore modeled with an RBF kernel,

Kpol(ν,ν)=σpol2exp(νν22lpol2),subscript𝐾pol𝜈superscript𝜈subscriptsuperscript𝜎2polsuperscriptdelimited-∥∥𝜈superscript𝜈22superscriptsubscript𝑙pol2K_{\rm pol}(\nu,\nu^{\prime})=\sigma^{2}_{\rm pol}\exp{\left(-\frac{\lVert\nu-% \nu^{\prime}\rVert^{2}}{2l_{\rm pol}^{2}}\right)},italic_K start_POSTSUBSCRIPT roman_pol end_POSTSUBSCRIPT ( italic_ν , italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_pol end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_ν - italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_l start_POSTSUBSCRIPT roman_pol end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (16)

where σpolsubscript𝜎pol\sigma_{\rm pol}italic_σ start_POSTSUBSCRIPT roman_pol end_POSTSUBSCRIPT and lpolsubscript𝑙poll_{\rm pol}italic_l start_POSTSUBSCRIPT roman_pol end_POSTSUBSCRIPT control the amplitude and correlation length scale of the polarization leakage respectively. In this work we will consider a three-kernel model, with the total GP kernel being given by Equation (11), and also a two-kernel model where we model the emission due to astrophysical foregrounds and polarization leakage through a single RBF kernel. For the two-kernel model, the total GP kernel can be written as K=KHI+Ksm𝐾subscript𝐾HIsubscript𝐾smK=K_{\rm HI}+K_{\rm sm}italic_K = italic_K start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT + italic_K start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT.

2.3 Gaussian Process Model Variants

In this section, we summarize the different GP models that we study in this work, which account for spatial variations in foreground emission in different ways. A graphical illustration of the models is shown in Figure 2.

2.3.1 Complete Pooling (CP) model

The standard GP model used for component separation (e.g., S22) uses the same set of kernels to describe the emission spectrum along every LoS, i.e., the kernel parameters for every LoS are pooled to a single value. In this work, we refer to this as the complete pooling (CP) model, which is used as our baseline throughout.

For the GP kernel parameters, we use a set of weakly informative priors, chosen to place the probability mass over the expected scales of the parameters. In general, when information on the expected scale of a parameter is available, it is appropriate to select a prior distribution that concentrates its probability mass over those scales. For example, for a strictly positive parameter θ𝜃\thetaitalic_θ that is expected to have a value 1similar-toabsent1\sim 1∼ 1, one might select the half-normal prior θHalfNormal(σ=1)similar-to𝜃HalfNormal𝜎1\theta\sim\rm{HalfNormal}(\sigma=1)italic_θ ∼ roman_HalfNormal ( italic_σ = 1 ). This has the benefit of helping to regularize our inferences using prior information and avoids the pitfalls associated with common choices for so-called uninformative prior distributions. For example, selecting a very broad uniform distribution as a default prior concentrates most of the prior probability mass at extreme values within the support of the distribution, which can bias the subsequent posterior inference. From an algorithmic perspective, this can also cause problems for Markov Chain Monte Carlo (MCMC) algorithms. Without regularization from a weakly informative prior, the posterior geometry can be such that it severely frustrates convergence, in some cases jeopardizing the geometric ergodicity of the MCMC algorithm. These issues appear frequently in the fully Bayesian GP and hierarchical models we consider in this work. For a detailed discussion of prior choice and the impacts on inference and sampling, we refer the reader to (Gelman et al., 2013; Stan Development Team, 2012; Betancourt & Girolami, 2015; Gelman & Hennig, 2017; Gelman et al., 2017; Gabry et al., 2019).

The weakly informative priors for the GP kernel parameters in the CP model are given by

σsm2/102superscriptsubscript𝜎sm2superscript102\displaystyle\sigma_{\rm sm}^{2}/10^{-2}italic_σ start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT LogNormal(0,4),similar-toabsentLogNormal04\displaystyle\sim\rm{LogNormal}(0,4),∼ roman_LogNormal ( 0 , 4 ) , (17)
lsmsubscript𝑙sm\displaystyle l_{\rm sm}italic_l start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT InverseGamma(2,1),similar-toabsentInverseGamma21\displaystyle\sim\rm{InverseGamma}(2,1),∼ roman_InverseGamma ( 2 , 1 ) ,
σpol2/106superscriptsubscript𝜎pol2superscript106\displaystyle\sigma_{\rm pol}^{2}/10^{-6}italic_σ start_POSTSUBSCRIPT roman_pol end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT LogNormal(0,4),similar-toabsentLogNormal04\displaystyle\sim\rm{LogNormal}(0,4),∼ roman_LogNormal ( 0 , 4 ) ,
lpolsubscript𝑙pol\displaystyle l_{\rm pol}italic_l start_POSTSUBSCRIPT roman_pol end_POSTSUBSCRIPT InverseGamma(5,1),similar-toabsentInverseGamma51\displaystyle\sim\rm{InverseGamma}(5,1),∼ roman_InverseGamma ( 5 , 1 ) ,
σHI2/108superscriptsubscript𝜎HI2superscript108\displaystyle\sigma_{\rm HI}^{2}/10^{-8}italic_σ start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT HalfNormal(1),similar-toabsentHalfNormal1\displaystyle\sim\rm{HalfNormal}(1),∼ roman_HalfNormal ( 1 ) ,
lHIsubscript𝑙HI\displaystyle l_{\rm HI}italic_l start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT HalfNormal(0.02).similar-toabsentHalfNormal0.02\displaystyle\sim\rm{HalfNormal}(0.02).∼ roman_HalfNormal ( 0.02 ) .

Here LogNormal(μ,ς)LogNormal𝜇𝜍\rm{LogNormal}(\mu,\varsigma)roman_LogNormal ( italic_μ , italic_ς ) corresponds to a log-normal distribution with mean parameter μ𝜇\muitalic_μ and scale parameter ς𝜍\varsigmaitalic_ς, InverseGamma(α,β)InverseGamma𝛼𝛽\rm{InverseGamma}(\alpha,\beta)roman_InverseGamma ( italic_α , italic_β ) is the inverse-gamma distribution with shape parameter α𝛼\alphaitalic_α and scale parameter β𝛽\betaitalic_β, and HalfNormal(ς)HalfNormal𝜍\rm{HalfNormal}(\varsigma)roman_HalfNormal ( italic_ς ) is the half-normal distribution with scale parameter ς𝜍\varsigmaitalic_ς.

The astrophysical foreground and polarization leakage components are expected to be orders of magnitude brighter than the H I signal. For the GP variance parameters of these components, we select zero-avoiding log-normal priors that concentrate over the expected scales of the two emission components. For the length-scale parameters of the foreground components, we use inverse-gamma priors. The inverse-gamma distribution is a common choice for GP length scale priors. The distribution places negligible probability masses on very small length scales. For length scales below the minimum covariate spacing, the GP likelihood will plateau, putting considerable probability mass at these small scales. This can result in overfitting and also induce convergence problems for the MCMC algorithms (Stan Development Team, 2012). For the smoothly varying foreground emission components we do not expect such high-frequency signals, and therefore use the inverse-gamma prior to remove these modes. The remainder of the prior concentrates around the expected length scales of the foreground components, whilst having a heavy right tail to allow for low-frequency modes.

For the H I signal, we choose a half-normal prior for the GP kernel variance. This allows for potentially zero cosmological signal whilst down-weighting very large amplitude scales. For the length scale parameter, in this case, we also use a half normal prior, allowing the H I signal to be potentially completely uncorrelated in frequency, and marginalizing over any very high-frequency modes. We find that using a half-normal prior for the faint H I component does not frustrate the convergence of our sampling algorithm.

We account for noise in our GP model by fitting for a single noise standard deviation σηsubscript𝜎𝜂\sigma_{\eta}italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT, to which we assign the prior

ση/107HalfNormal(10).similar-tosubscript𝜎𝜂superscript107HalfNormal10\sigma_{\eta}/10^{-7}\sim\rm{HalfNormal}(10).italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT / 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT ∼ roman_HalfNormal ( 10 ) . (18)

In real experiments, the true noise covariance can be challenging to estimate. This can be accounted for by marginalizing over the noise process, with some prior selected to concentrate over the expected noise scales. In more realistic setups, we may expect more complicated noise contributions e.g., heteroscedastic noise, where it would be appropriate to use a more sophisticated noise model than the uniform white noise we consider here. We leave more realistic noise modeling to future work, focusing here on spatial variations in foreground emission. We use the same noise model for all the GP models considered in this work.

2.3.2 No-Pooling (NP) model

The CP model assumes that the GP kernel parameters for the foreground emission are identical for every LoS. However, the spectral properties of diffuse Galactic emission are known to vary significantly across the sky (Planck Collaboration et al., 2016a; Remazeilles et al., 2018; Jew & Grumitt, 2020). Using a single set of GP kernel parameters for every LoS may fail to capture these variations, potentially resulting in mismodeling and a failure to accurately recover the underlying sky emission. At the opposite extreme from the CP model, we may consider a no-pooling (NP) model, where the foreground GP kernel parameters are assumed to vary completely independently along every LoS. We do not expect such variations in the H I GP prior along each LoS, and as such we retain the global kernel for the H I signal used in the CP model. The priors for the GP kernel parameters take the same form as in Equation (17), with the difference that now we have separate {σsm,i,lsm,i,σpol,i,lpol,i}subscript𝜎smisubscript𝑙smisubscript𝜎polisubscript𝑙poli\{\sigma_{\rm{sm},i},l_{\rm{sm},i},\sigma_{\rm{pol},i},l_{\rm{pol},i}\}{ italic_σ start_POSTSUBSCRIPT roman_sm , roman_i end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT roman_sm , roman_i end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT roman_pol , roman_i end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT roman_pol , roman_i end_POSTSUBSCRIPT } for each LoS i𝑖iitalic_i.

2.3.3 Hierarchical Gaussian Process (HGP) model

Sitting between the CP and NP models described above, we may consider a class of hierarchical Gaussian process (HGP) models. In this setting, we divide the whole data set into a set of superpixels containing multiple LoS, where the superpixel size is smaller than the full data set over the spatial dimensions. Within each superpixel, the smooth foreground and polarization leakage kernel parameters are assumed to be constant. Ideally, we would consider separate kernel parameters for each LoS. However, the memory requirements for such a setup render such a model intractable given our computational resources. We discuss the details of the HGP model setup in Section 3.2.

The priors for the foreground kernel parameters in a given superpixel p𝑝pitalic_p are given by,

σsm,p2/102superscriptsubscript𝜎sm𝑝2superscript102\displaystyle\sigma_{\mathrm{sm},p}^{2}/10^{-2}italic_σ start_POSTSUBSCRIPT roman_sm , italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT HalfNormal(ςsm),similar-toabsentHalfNormalsubscript𝜍sm\displaystyle\sim\rm{HalfNormal}(\varsigma_{\rm sm}),∼ roman_HalfNormal ( italic_ς start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ) , (19)
lsm,psubscript𝑙sm𝑝\displaystyle l_{\mathrm{sm},p}italic_l start_POSTSUBSCRIPT roman_sm , italic_p end_POSTSUBSCRIPT InverseGamma(αsm,βsm),similar-toabsentInverseGammasubscript𝛼smsubscript𝛽sm\displaystyle\sim\rm{InverseGamma}(\alpha_{\rm sm},\beta_{\rm sm}),∼ roman_InverseGamma ( italic_α start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ) ,
σpol,p2/106superscriptsubscript𝜎pol𝑝2superscript106\displaystyle\sigma_{\mathrm{pol},p}^{2}/10^{-6}italic_σ start_POSTSUBSCRIPT roman_pol , italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT HalfNormal(ςpol),similar-toabsentHalfNormalsubscript𝜍pol\displaystyle\sim\rm{HalfNormal}(\varsigma_{\rm pol}),∼ roman_HalfNormal ( italic_ς start_POSTSUBSCRIPT roman_pol end_POSTSUBSCRIPT ) ,
lpol,psubscript𝑙pol𝑝\displaystyle l_{\mathrm{pol},p}italic_l start_POSTSUBSCRIPT roman_pol , italic_p end_POSTSUBSCRIPT InverseGamma(αpol,βpol),similar-toabsentInverseGammasubscript𝛼polsubscript𝛽pol\displaystyle\sim\rm{InverseGamma}(\alpha_{\rm pol},\beta_{\rm pol}),∼ roman_InverseGamma ( italic_α start_POSTSUBSCRIPT roman_pol end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT roman_pol end_POSTSUBSCRIPT ) ,

where {ςsm,αsm,βsm,ςpol,αpol,βpol}subscript𝜍smsubscript𝛼smsubscript𝛽smsubscript𝜍polsubscript𝛼polsubscript𝛽pol\{\varsigma_{\rm sm},\alpha_{\rm sm},\beta_{\rm sm},\varsigma_{\rm pol},\alpha% _{\rm pol},\beta_{\rm pol}\}{ italic_ς start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT , italic_ς start_POSTSUBSCRIPT roman_pol end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT roman_pol end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT roman_pol end_POSTSUBSCRIPT } are a set of global hyper-parameters that define the prior distribution over the foreground kernel parameters in each superpixel.

To these global hyper-parameters, we assign a set of weakly informative hyper-priors

ςsmLogNormal(0,4),similar-tosubscript𝜍smLogNormal04\displaystyle\varsigma_{\rm sm}\sim\rm{LogNormal}(0,4),italic_ς start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ∼ roman_LogNormal ( 0 , 4 ) , (20)
αsmLogNormal(1,4),similar-tosubscript𝛼smLogNormal14\displaystyle\alpha_{\rm sm}\sim\rm{LogNormal}(1,4),italic_α start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ∼ roman_LogNormal ( 1 , 4 ) ,
βsmLogNormal(0,4),similar-tosubscript𝛽smLogNormal04\displaystyle\beta_{\rm sm}\sim\rm{LogNormal}(0,4),italic_β start_POSTSUBSCRIPT roman_sm end_POSTSUBSCRIPT ∼ roman_LogNormal ( 0 , 4 ) ,
ςpolLogNormal(0,4),similar-tosubscript𝜍polLogNormal04\displaystyle\varsigma_{\rm pol}\sim\rm{LogNormal}(0,4),italic_ς start_POSTSUBSCRIPT roman_pol end_POSTSUBSCRIPT ∼ roman_LogNormal ( 0 , 4 ) ,
αpolLogNormal(1,4),similar-tosubscript𝛼polLogNormal14\displaystyle\alpha_{\rm pol}\sim\rm{LogNormal}(1,4),italic_α start_POSTSUBSCRIPT roman_pol end_POSTSUBSCRIPT ∼ roman_LogNormal ( 1 , 4 ) ,
βpolLogNormal(0,4).similar-tosubscript𝛽polLogNormal04\displaystyle\beta_{\rm pol}\sim\rm{LogNormal}(0,4).italic_β start_POSTSUBSCRIPT roman_pol end_POSTSUBSCRIPT ∼ roman_LogNormal ( 0 , 4 ) .

These hyper-priors are chosen such that the potential priors over the foreground kernel parameters for each superpixel correspond broadly to the expected scales of those foreground parameters. A key feature of the hierarchical approach is that we jointly fit for these hyper-parameters, allowing us to learn the effective prior over the superpixel parameters and share information between superpixels. This allows for spatial variations in the kernel parameters, whilst regularizing those variations through the hyper-prior. This can help to reduce the tendency for NP models to overfit, particularly with very noisy data where the assumption of total independence means that there is no regularizing effect from the global hyperprior. As in the NP model, we assume a global H I kernel.

3 The 21 cm component separation simulations and computation

In this section, we describe the simulated data set used to test our GP models for 21 cm component separation, and the computational methods and setup we use for inference.

3.1 Mock data set

Our simulated sky signal consists of foreground emission, instrumental effects, and a cosmological 21 cm signal. We adopted the data set111https://www.dropbox.com/sh/9zftczeypu7xgt3/AABiiBw_0SBPrLgSHsjiISz8a?dl=0 provided by S22, which was originally used in Cunnington et al. (2021). We cut a box from the S22 data set on a grid with (Nx,Ny,Nz)=(256,256,256)subscript𝑁𝑥subscript𝑁𝑦subscript𝑁𝑧256256256(N_{x},N_{y},N_{z})=(256,256,256)( italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) = ( 256 , 256 , 256 ). The corresponding box size is (Lx,Ly,Lz)=(1436,1436,1193)Mpcsubscript𝐿𝑥subscript𝐿𝑦subscript𝐿𝑧143614361193Mpc(L_{x},L_{y},L_{z})=(1436,1436,1193)\>\rm Mpc( italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) = ( 1436 , 1436 , 1193 ) roman_Mpc, with a frequency range along the z𝑧zitalic_z-axis of 899MHz<ν<1155MHz899MHz𝜈1155MHz899\>\rm{MHz}<\nu<1155\>\rm{MHz}899 roman_MHz < italic_ν < 1155 roman_MHz, corresponding to a redshift range of 0.23<z<0.580.23𝑧0.580.23<z<0.580.23 < italic_z < 0.58. The frequency resolution of the simulated data set is δν=1MHz𝛿𝜈1MHz\delta\nu=1\>\rm MHzitalic_δ italic_ν = 1 roman_MHz.

The simulated data serve as a conceptual model for future H I intensity mapping (IM) surveys, akin to the MeerKLASS survey (Santos et al., 2016). A flat-sky approximation is applied here, with curved-sky effects over the small survey area being negligible. In the following we summarize the simulation procedure for each of the sky components, referring the readers to Appendix A in S22 and Cunnington et al. (2021) for further details.

3.1.1 Foreground Model

The total foreground emission, δTFG𝛿subscript𝑇FG\delta T_{\rm FG}italic_δ italic_T start_POSTSUBSCRIPT roman_FG end_POSTSUBSCRIPT consists of four components,

δTFG=δTsync+δTfree+δTpoint+δTpol,𝛿subscript𝑇FG𝛿subscript𝑇sync𝛿subscript𝑇free𝛿subscript𝑇point𝛿subscript𝑇pol\delta T_{\rm FG}=\delta T_{\rm sync}+\delta T_{\rm free}+\delta T_{\rm point}% +\delta T_{\rm pol},italic_δ italic_T start_POSTSUBSCRIPT roman_FG end_POSTSUBSCRIPT = italic_δ italic_T start_POSTSUBSCRIPT roman_sync end_POSTSUBSCRIPT + italic_δ italic_T start_POSTSUBSCRIPT roman_free end_POSTSUBSCRIPT + italic_δ italic_T start_POSTSUBSCRIPT roman_point end_POSTSUBSCRIPT + italic_δ italic_T start_POSTSUBSCRIPT roman_pol end_POSTSUBSCRIPT , (21)

where δTsync𝛿subscript𝑇sync\delta T_{\rm sync}italic_δ italic_T start_POSTSUBSCRIPT roman_sync end_POSTSUBSCRIPT is due to Galactic synchrotron emission, δTfree𝛿subscript𝑇free\delta T_{\rm free}italic_δ italic_T start_POSTSUBSCRIPT roman_free end_POSTSUBSCRIPT is due to Galactic free-free emission, δTpoint𝛿subscript𝑇point\delta T_{\rm point}italic_δ italic_T start_POSTSUBSCRIPT roman_point end_POSTSUBSCRIPT is due to extragalactic point sources, and δTpol𝛿subscript𝑇pol\delta T_{\rm pol}italic_δ italic_T start_POSTSUBSCRIPT roman_pol end_POSTSUBSCRIPT is due to polarization leakage.

Galactic synchrotron emission is caused by cosmic-ray electrons spiraling in the Galactic magnetic field. The Planck Legacy Archive222http://pla.esac.esa.int/pla/ FFP10 simulations were used to generate synchrotron emission maps at 217 and 353 GHz. These simulations used the 408 MHz all-sky map with 56arcmin56arcmin56\ \rm arcmin56 roman_arcmin resolution as a synchrotron emission template, enhanced to a Healpix (Górski et al., 2005) resolution of Nside=2048subscript𝑁side2048N_{\rm side}=2048italic_N start_POSTSUBSCRIPT roman_side end_POSTSUBSCRIPT = 2048 using Gaussian random field realizations (Haslam et al., 1981, 1982; Remazeilles et al., 2015). A spectral index map can be derived from the simulated 217 and 353 GHz maps, assuming a power-law emission model, which can then be used to extrapolate our simulated map over the observational frequency range.

Galactic free-free emission is caused by free electrons scattering off ions in an ionized gas. The simulated data set uses the 217 GHz FFP10 free-free simulation at Nside=2048subscript𝑁side2048N_{\rm side}=2048italic_N start_POSTSUBSCRIPT roman_side end_POSTSUBSCRIPT = 2048. This free-free map was generated using templates from Dickinson et al. (2003), and WMAP Maximum Entropy Method (MEM) derived free-free maps. The simulated map is extrapolated over the observed frequency range using a power-law emission model with a constant spectral index over all pixels.

Extragalactic point sources, such as Active Galactic Nuclei, are included by applying a polynomial fit to 1.4 GHz radio sources as in Battye et al. (2013), and scaling to the target observational frequencies using the method described in Olivari et al. (2018).

Polarization leakage is caused by Faraday rotation of polarized synchrotron emission in the Galactic interstellar medium (Alonso et al., 2014; Cunnington et al., 2021). Some fraction of the synchrotron Stokes Q𝑄Qitalic_Q and U𝑈Uitalic_U will leak into Stokes I𝐼Iitalic_I. The degree of Faraday rotation is frequency dependent, which induces additional structure on the synchrotron emission spectrum, complicating GP-based component separation methods that exploit the spectral smoothness of the foreground emission. Polarization leakage was simulated using the CRIME333http://intensitymapping.physics.ox.ac.uk/CRIME.html software package. This was used to produce Stokes Q𝑄Qitalic_Q emission maps over the observational frequency range, subsequently assuming a polarization leakage level of 0.5% from Stokes Q𝑄Qitalic_Q.

3.1.2 Instrument and Noise Model

In S22 a MeerKLASS-like (Santos et al., 2016) survey is simulated. The corresponding observational noise is assumed to be uncorrelated in frequency, and drawn from a Gaussian distribution with standard deviation at each frequency given by

ση(ν)=Tsys(ν)(δνttotΩpΩaNdish)1/2,subscript𝜎𝜂𝜈subscript𝑇sys𝜈superscriptsubscript𝛿𝜈subscript𝑡totsubscriptΩ𝑝subscriptΩ𝑎subscript𝑁dish12\sigma_{\eta}(\nu)=T_{\rm sys}(\nu)\left(\delta_{\nu}t_{\rm tot}\frac{\Omega_{% p}}{\Omega_{a}N_{\rm dish}}\right)^{-1/2},italic_σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_ν ) = italic_T start_POSTSUBSCRIPT roman_sys end_POSTSUBSCRIPT ( italic_ν ) ( italic_δ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT roman_tot end_POSTSUBSCRIPT divide start_ARG roman_Ω start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG roman_Ω start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT roman_dish end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT , (22)

where Tsys(ν)subscript𝑇sys𝜈T_{\rm sys}(\nu)italic_T start_POSTSUBSCRIPT roman_sys end_POSTSUBSCRIPT ( italic_ν ) is the frequency-dependent system temperature, δν=1MHzsubscript𝛿𝜈1MHz\delta_{\nu}=1\,\rm{MHz}italic_δ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT = 1 roman_MHz is the frequency resolution of the observations, ttot=1000hourssubscript𝑡tot1000hourst_{\rm tot}=1000\>\rm{hours}italic_t start_POSTSUBSCRIPT roman_tot end_POSTSUBSCRIPT = 1000 roman_hours is the total observing time, ΩpsubscriptΩ𝑝\Omega_{p}roman_Ω start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is the pixel solid angle, ΩasubscriptΩ𝑎\Omega_{a}roman_Ω start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is the survey solid angle and Ndish=64subscript𝑁dish64N_{\rm dish}=64italic_N start_POSTSUBSCRIPT roman_dish end_POSTSUBSCRIPT = 64 is the number of dishes (identical to the MeerKAT survey). The noise root-mean-square (RMS) in the simulation ranges from 0.0320.0320.0320.032 to 0.040mK0.040mK0.040\>\rm mK0.040 roman_mK over the observed frequency range.

The total sky signal is smoothed with a constant Gaussian beam corresponding the the angular resolution of the instrument. The beam full width at half maximum (FWHM) is given by

θFWHM=1.22cνDdish,subscript𝜃FWHM1.22𝑐𝜈subscript𝐷dish\theta_{\rm FWHM}=\frac{1.22c}{\nu D_{\rm dish}},italic_θ start_POSTSUBSCRIPT roman_FWHM end_POSTSUBSCRIPT = divide start_ARG 1.22 italic_c end_ARG start_ARG italic_ν italic_D start_POSTSUBSCRIPT roman_dish end_POSTSUBSCRIPT end_ARG , (23)

where c𝑐citalic_c is the speed of light, ν𝜈\nuitalic_ν is the observed frequency, and Ddishsubscript𝐷dishD_{\rm dish}italic_D start_POSTSUBSCRIPT roman_dish end_POSTSUBSCRIPT is the telescope dish diameter. The telescope dish diameter is set to 15 m, consistent with MeerKAT and SKA1-MID dishes (Dewdney et al., 2019).

3.1.3 Cosmological Signal

The cosmological signal used here was obtained from the MultiDarkPlanck (MDPL2) N-body simulation (Klypin et al., 2016), which simulates 38403superscript384033840^{3}3840 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT dark matter particles within a cubic volume with a size of 1Gpc/h1Gpc1\>{\rm Gpc}/h1 roman_Gpc / italic_h on each side, using the cosmological parameters of Planck 2015 (Planck Collaboration et al., 2016b). The MultiDark-SAGE catalog derived from this simulation, available on the Skies & Universe444skiesanduniverses.org website, was selected for our analysis. This simulation provides redshift snapshots, each representing the state of the cosmological density field and galaxies at different redshifts. For the simulated data set used in this work, the snapshot z=0.39𝑧0.39z=0.39italic_z = 0.39 was used, assuming a redshift range of z=0.2𝑧0.2z=0.2italic_z = 0.2 to z=0.58𝑧0.58z=0.58italic_z = 0.58. Galaxies in this snapshot were organized into voxels using Nearest Grid Point (NGP) assignment.

Following Cunnington et al. (2021), the H I mass was calculated from the cold gas mass of each galaxy in the catalog and combined within each voxel to form an H I intensity map. However, this approach excludes halos lighter than 1010h1Msuperscript1010superscript1subscript𝑀direct-product10^{10}h^{-1}M_{\odot}10 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT. To compensate, the mean H I temperature was adjusted based on measurements from the GBTWiggleZ cross-correlation analysis (Masui et al., 2013) and the fitting in Villaescusa-Navarro et al. (2018). Each redshift slice in the H I intensity mapping simulation was then converted into an overtemperature field by subtracting the mean temperature, with the overtemperature field directly tracing the dark-matter overdensity field.

3.2 GP Component Separation Setup

      GP model       Model description
      abbreviation
      CP2       Complete pooling model
      with 2 kernels
      CP3       Complete pooling model
      with 3 kernels
      HGP2       Hierarchical model
      with 2 kernels
      HGP3       Hierarchical model
      with 3 kernels
      NP2       No pooling model
      with 2 kernels
      NP3       No pooling model
      with 3 kernels
Table 1: The model name abbreviations used throughout this work. Models with two kernels include an H I kernel and a single foreground kernel. Models with three kernels include an H I kernel and two foreground kernels, with one foreground kernel modeling smooth astrophysical emission, and the other foreground kernel modeling polarization leakage.

In Table 1 we list the model name abbreviations used in this work. As discussed in Section 2.1, we consider fully Bayesian GP models throughout this work. This requires us to perform sampling over the full GP hyperparameter posterior. In the case of the NP and HGP models, this is a very high-dimensional sampling problem, which is made tractable by using the No-U-Turn Sampler (NUTS) algorithm (Hoffman et al., 2014). The NUTS algorithm is an adaptive variant of the Hamiltonian Monte Carlo (HMC) algorithm. HMC exploits gradient information of the target posterior to generate efficient sampling proposals by following Hamiltonian trajectories through the target phase space and has excellent scaling to high dimensions. Our code can be found in this repo.

For each GP model, we use NUTS to sample the parameters for 3000 iterations and discard the first 1000 iterations as burn-in. Convergence is checked by analyzing the Gelman-Rubin R^^𝑅\hat{R}over^ start_ARG italic_R end_ARG statistic (Gelman & Rubin, 1992), which compares the between-chain and within-chain variance. Values of R^1^𝑅1\hat{R}\approx 1over^ start_ARG italic_R end_ARG ≈ 1 are indicative of convergence, and in this work a strict requirement of R^1.01^𝑅1.01\hat{R}\leq 1.01over^ start_ARG italic_R end_ARG ≤ 1.01 is used. In addition to this, we also monitor for divergences during sampling. Divergences occur when the value of the Hamiltonian diverges whilst generating trajectories through the target phase space. This can be indicative of regions of the posterior where the curvature is such that it cannot be resolved by the sampler, and it can imply failures in geometric ergodicity (Betancourt & Girolami, 2015; Livingstone et al., 2016). Such issues are common in fully Bayesian GP and hierarchical models. We therefore ensure that zero divergences occur during sampling. Our framework is built with numpyro555https://github.com/pyro-ppl/numpyro (Phan et al., 2019; Bingham et al., 2019), a probabilistic programming language built on JAX666https://jax.readthedocs.io. After we obtain the desired hyperparameter samples, we thin the chain to 50 samples. For each sample, we recover the expected H I signal using Equation (13). Our final estimated H I signal cube is calculated from the ensemble mean of the 50 recovered H I cubes.

For the CP model, we follow the setup described in S22. For the NP model, although the foreground GP kernel parameters should be independent for every LoS, the assumption of a global H I kernel requires the full data cube to be fit simultaneously. The memory requirements of the NP model in this case render it intractable, demanding approximately 500 GB in memory. Therefore, to reduce the memory requirements of the NP model, we split the whole data set into 64 subsets of the size (32,32,256)3232256(32,32,256)( 32 , 32 , 256 ). We run 64 NP models on each subset separately, with the H I kernel being shared across every 32pix×32pix32pix32pix32{\rm pix}\times 32{\rm pix}32 roman_p roman_i roman_x × 32 roman_p roman_i roman_x LoS in each subset. For each data subset, we have 4×103similar-toabsent4superscript103\sim 4\times 10^{3}∼ 4 × 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT hyperparameters to sample for the NP3 model, a task that is rendered tractable through the use of the NUTS algorithm. Whilst we would ideally allow the foreground GP kernel parameters to vary along every LoS in the HGP models, the memory requirements of this approach also render such a model intractable using the exact GPs we consider in this work. For the HGP models, we therefore grid the data set into a set of superpixels along the LoS axis. The foreground GP kernel parameters are assumed to be constant for every LoS within a superpixel, with the H I kernel treated as global in the entire datacube. The default size of a single superpixel is set to be 16×16161616\times 1616 × 16 normal pixels, a trade-off between the accuracy and computational cost of the model (in both memory and clock time) in our experiments. The performance of HGP models with different superpixel sizes is compared in Section 5.2.

Refer to caption
Figure 3: A visual comparison of the recovered H I fields for the different models. Here we show 2D H I Tbsubscript𝑇𝑏T_{b}italic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT slices at 1019 MHz. The leftmost figure shows the true H I field, separated from the recovered fields by a dashed line. Proceeding from left to right we show the recovered fields for the CP, NP, and HGP models respectively. The top row shows the recovered fields for the three kernel models, and the bottom row shows the recovered fields for the two kernel models. The CP3 and NP3 models have visible foreground residuals in the top right of the field, but these residuals are removed by the HGP3 model. The CP2 model gives poor recovery over much of the observed field, and the NP2 model failed to converge. The HGP2 model obtains similar residuals to the CP3 model. Whilst the NP3, HGP3, and HGP2 models have visible boundary effects from the use of data subsets and superpixels, the models still achieve improved recovery compared to the CP3 model as measured through metrics such as predictive performance (for the HGP3 and HGP2 models), and the recovery of summary statistics.
Refer to caption
Figure 4: A comparison between the CP3, NP3 and HGP3 models of the recovered H I spectrum along a single LoS, where the CP3 and NP3 models show significant undercleaning of the foreground emission. The top panel shows the recovered H I spectrum for each of the models, plotted alongside the true H I spectrum, shown as a brown dashed line. The bottom panel shows the corresponding residuals with respect to the truth for each model.
Refer to caption
Figure 5: The probability density functions for the residuals of the recovered H I signal cubes for the different GP models. We show the residual distribution for the CP3 (blue dotted line), HGP3 (black solid line), HGP2 (green solid line), and NP3 (red solid line), respectively. The HGP3 and NP3 residual distributions are narrower than the CP3 residual distribution. However, this global distribution does not fully highlight the minority of regions where the CP3 and NP3 models result in significant undercleaning.
Refer to caption
Figure 6: The 1σ1𝜎1\sigma1 italic_σ uncertainty on the recovered H I signal with the HGP3 (left) and NP3 models (right) in each pixel at 1019 MHz. Both plots are shown on the same color scale. The NP3 model uncertainty is about ten times greater than the HGP3 uncertainty. This is expected behavior when comparing hierarchical and no pooling approaches, with the lack of hyper-prior regularization resulting in a noisier signal recovery for the NP3 model.

4 Results

In this section, we show the H I signal recovery results, using the various GP models discussed in Section 2.3. In Section 4.1 we discuss the recovery of the H I signal cubes in the pixel domain and the predictive performance of the models in this domain. In Section 4.2 we compare the recovery of the power spectrum and scattering transform coefficient summary statistics.

4.1 Recovered H I Image Cubes

GP model Normalized result
CP3 1.0000
HGP3 1.0027
HGP2 1.0015
NP3**Only 1% of the Pareto-k^^𝑘\hat{k}over^ start_ARG italic_k end_ARG diagnostics are good (k^0.5^𝑘0.5\hat{k}\leq 0.5over^ start_ARG italic_k end_ARG ≤ 0.5), suggesting this value is unreliable and the NP model is misspecified. 1.0146
Table 2: Leave-One-Out Cross-Validation (LOO-CV) Results: This table presents the ratio of estimated the log pointwise predictive density (elpdloosubscriptelpdloo\mathrm{elpd}_{\mathrm{loo}}roman_elpd start_POSTSUBSCRIPT roman_loo end_POSTSUBSCRIPT) between different GP models and CP3 results. These results provide insights into each model’s predictive performance, with higher values indicating better predictive performance. The standard error on the elpdloosubscriptelpdloo\mathrm{elpd}_{\mathrm{loo}}roman_elpd start_POSTSUBSCRIPT roman_loo end_POSTSUBSCRIPT estimates was 1%similar-toabsentpercent1\sim 1\%∼ 1 % for all the models.

We first consider our benchmark CP models. In the second column of Figure 3 we show the recovered H I brightness temperature field at a frequency of ν=1019MHz𝜈1019MHz\nu=1019\,\rm{MHz}italic_ν = 1019 roman_MHz. For the CP3 model, we can see good recovery over much of the observed field. However, there is a region of visible undercleaning in the upper right corner of the field. This is further illustrated in Figure 4, where we show the recovered H I signal along a given LoS. In this case, it can be seen that the CP3 model results in a significant excess H I signal, caused by residual foregrounds after cleaning. We also show the recovered H I field for the CP2 model. In this case, we see a poor recovery of much of the sky, which is consistent with previous analyses in S22. Given the poor performance of the CP2 model, we do not consider it further in this work.

The recovered H I brightness temperature fields are shown for the NP and HGP models in the third and fourth columns of Figure 3 respectively. For the NP3 and HGP3 models we again obtain good recovery over much of the sky, albeit with superpixel boundary artifacts due to the division of the datacube. This is particularly apparent for the NP3 model, where the H I GP kernel had to be divided over the data subsets used for inference. Whilst we still obtain improved residuals and recovery of summary statistics compared to the CP3 model, these artifacts could be reduced or removed by considering memory-efficient GP approximations that would allow us to avoid the need to use data subsets or superpixels (Solin & Särkkä, 2020; Riutort-Mayol et al., 2023). We defer the analysis of potential GP approximations to future work. The visible undercleaning seen in the upper right of the H I field for the CP3 model is still apparent with the NP3 model. In contrast, the HGP3 model eliminates this region of undercleaning. This can be seen clearly in Figure 4, where the CP3 and NP3 signals are significantly undercleaned for the relevant LoS, whilst the HGP3 signal much more closely matches the true H I signal from the simulations. Whilst both the NP3 and HGP3 models allow for spatial variation in foreground kernel parameters, the NP3 model lacks any regularization from the global hyper-prior in the HGP3 model, and does not share information between different LoS when fitting the foreground signal. The extra freedom of the NP model and lack of regularization can result in overfitting in reaction to noisy and/or anomalous data.

The HGP2 model is able to recover the H I signal better than the CP2 model, although with a degraded performance compared to the HGP3 model. The HGP2 model has more freedom than the CP2 model, by allowing foreground kernel parameters to vary spatially, whilst regularizing that variation through the global hyper-prior, resulting in a model that is more robust to misspecification in this instance. We were unable to reach convergence with the NP2 model after extensive runs with the NUTS algorithm. This is not unexpected, given that the NP model does not share information across all LoS, which can result in convergence challenges for the misspecified NP2 model. We therefore do not consider the NP2 results further in this work.

In Figure 5 we show the pixel-level residual distributions for the CP3 model, the NP3 model, and the HGP2 and HGP3 models. We find that the NP3 model has the narrowest residual distribution, followed by the HGP3 model, the CP3 model, and the HGP2 model respectively. However, the total residual distribution alone is not wholly indicative of the model performance. Indeed, it was clear that there were regions of the sky where both the NP3 and CP3 models struggled to recover the correct H I signal, whilst the HGP models could. These features are hidden in the residual distribution for all pixels in the data set. We also see that the HGP2 residual distribution is wider than that of the CP3 model, demonstrating the importance of including all relevant kernel contributions in foreground modeling. In Figure 6 we also show the recovered 1σ1𝜎1\sigma1 italic_σ uncertainty on the recovered H I signal for the HGP3 and NP3 models at a given frequency slice, estimated by evaluating the ensemble variance of the posterior predictive H I signal samples. The typical uncertainty on the recovered NP3 signal is about ten times larger than the uncertainty on the signal recovered by the HGP3 model. This is to be expected, given that the NP3 model has a larger number of free parameters that are not regularized through a global hyper-prior, resulting in a noisier signal recovery. Indeed, such behavior is typical of hierarchical models compared to no-pooling models (Gelman & Hill, 2006).

To make a more rigorous comparison of the model performance, we consider the leave-one-out cross-validation (LOO-CV) statistics (Vehtari et al., 2017). LOO-CV is a widely used method in statistical modeling and machine learning for evaluating the predictive performance of models. In addition to model performance, LOO-CV can also provide insights into the stability and robustness of a model. Given some data set y={y1,,yN}𝑦subscript𝑦1subscript𝑦𝑁y=\{y_{1},\ldots,y_{N}\}italic_y = { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, and a model with parameters θ𝜃\thetaitalic_θ, LOO-CV seeks to estimate the leave-one-out log predictive density of the model,

elpdloo=i=1Nlogp(yi|yi),subscriptelpdloosuperscriptsubscript𝑖1𝑁𝑝conditionalsubscript𝑦𝑖subscript𝑦𝑖\mathrm{elpd}_{\mathrm{loo}}=\sum_{i=1}^{N}\log p(y_{i}|y_{-i}),roman_elpd start_POSTSUBSCRIPT roman_loo end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ) , (24)

where yisubscript𝑦𝑖y_{-i}italic_y start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT is the data set left after removing the datapoint yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The pointwise predictive density is given by

p(yi|yi)=p(yi|θ)p(θ|yi)dθ.𝑝conditionalsubscript𝑦𝑖subscript𝑦𝑖𝑝conditionalsubscript𝑦𝑖𝜃𝑝conditional𝜃subscript𝑦𝑖differential-d𝜃p(y_{i}|y_{-i})=\int p(y_{i}|\theta)p(\theta|y_{-i})\mathrm{d}\theta.italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ) = ∫ italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ ) italic_p ( italic_θ | italic_y start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ) roman_d italic_θ . (25)

An estimator for elpdloosubscriptelpdloo\rm{elpd}_{\rm loo}roman_elpd start_POSTSUBSCRIPT roman_loo end_POSTSUBSCRIPT can be constructed using posterior samples and Pareto smoothed importance sampling (Vehtari et al., 2015). In essence, elpdloosubscriptelpdloo\rm{elpd}_{\rm loo}roman_elpd start_POSTSUBSCRIPT roman_loo end_POSTSUBSCRIPT evaluates the performance of the model in predicting held out data.

In Table 2 we show the elpdloosubscriptelpdloo\rm{elpd}_{\rm loo}roman_elpd start_POSTSUBSCRIPT roman_loo end_POSTSUBSCRIPT values for the CP3, NP3, HGP3 and HGP2 models, all normalized by the elpdloosubscriptelpdloo\rm{elpd}_{\rm loo}roman_elpd start_POSTSUBSCRIPT roman_loo end_POSTSUBSCRIPT value for the CP3 model. The NP3, HGP3 and HGP2 models all have higher elpdloosubscriptelpdloo\rm{elpd}_{\rm loo}roman_elpd start_POSTSUBSCRIPT roman_loo end_POSTSUBSCRIPT values than the CP3 model. Whilst the NP3 model has a higher value for elpdloosubscriptelpdloo\rm{elpd}_{\rm loo}roman_elpd start_POSTSUBSCRIPT roman_loo end_POSTSUBSCRIPT than the HGP3 and HGP2 models, this estimate should be treated with caution. Only 1%similar-toabsentpercent1\sim 1\%∼ 1 % of the Pareto-k^^𝑘\hat{k}over^ start_ARG italic_k end_ARG statistics are less than 0.5 for the NP3 model. The Pareto-k^^𝑘\hat{k}over^ start_ARG italic_k end_ARG statistic can be used as a diagnostic for the variance of the importance sampling estimator used to evaluate elpdloosubscriptelpdloo\rm{elpd}_{\rm loo}roman_elpd start_POSTSUBSCRIPT roman_loo end_POSTSUBSCRIPT, with values of k^<0.5^𝑘0.5\hat{k}<0.5over^ start_ARG italic_k end_ARG < 0.5 being necessary to ensure finite variance of the importance ratios and that the central limit theorem holds for estimating elpdloosubscriptelpdloo\rm{elpd}_{\rm loo}roman_elpd start_POSTSUBSCRIPT roman_loo end_POSTSUBSCRIPT (Vehtari et al., 2015). Therefore, the elpdloosubscriptelpdloo\rm{elpd}_{\rm loo}roman_elpd start_POSTSUBSCRIPT roman_loo end_POSTSUBSCRIPT values for the NP model should be viewed as unreliable. This behavior commonly manifests as a result of strong model misspecification. This could be a result of assumptions such as the total independence of foreground emission along every LoS in the NP model. In reality, we know foreground emission will have spatial correlations. Furthermore, the division of the full data set into subsets due to the memory limitations means that the H I kernel was treated independently for each data subset, which does not conform to our expectations for the cosmological signal.

4.2 Summary Statistics Recovery

In this section, we consider the recovery of summary statistics from the foreground-cleaned H I maps. In Section 4.2.1 we study the power spectrum (PS) recovery, and in Section 4.2.2 we examine the recovery of scattering transform coefficients.

4.2.1 The 21 cm Power Spectrum

Refer to caption
Figure 7: Left: The upper panel shows the recovered spherically averaged power spectra for the different GP models at 0.23<z<0.580.23𝑧0.580.23<z<0.580.23 < italic_z < 0.58, and the relative error compared to the true PS is shown in the lower panel. We show the true H I power spectrum (brown dashed line), the result for the CP3 model (blue dotted line), the HGP3 model (black solid line), the HGP2 model (green solid line), and the NP3 model (red solid line), respectively. Middle: Recovered radial power spectra and relative error. Right: Recovered transverse power spectra and relative error. All of these power spectra are shown here without bias correction.

The 21 cm PS is perhaps the most well-studied summary statistic, capturing rich cosmological and astrophysical information (e.g., Zaldarriaga et al., 2004; Greig & Mesinger, 2017). Here we investigated the PS recovery from the foreground cleaned H I maps, showing the recovered PS in Figure 7. For the spherically averaged PS, the smallest relative error is obtained with the HGP2 model. The NP3 model achieves better recovery than the HGP3 model, which obtains results comparable to the CP3 model.

For the 1D PS along LoS, all four models attain comparable recovery on small scales, with differences only being apparent on large scales. The NP3 model achieves the lowest relative error in this case, and the CP3 and HGP3 models have comparable relative errors on large scales. The HGP2 model significantly overestimates the PS on large scales, indicating significant uncleaned foreground components in the recovered H I signal. Using one foreground kernel often fails to describe the polarization leakage effects, which have a different frequency behavior compared to the smooth astrophysical foregrounds in total intensity.

For the 2D transverse PS, all methods underestimated the power on scales k0.1h/Mpcless-than-or-similar-tosubscript𝑘perpendicular-to0.1Mpck_{\perp}\lesssim 0.1h/\rm Mpcitalic_k start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT ≲ 0.1 italic_h / roman_Mpc, indicating mismodeling of foreground spatial variations. The HGP2 model gives the best recovery on the scales k0.1h/Mpcless-than-or-similar-tosubscript𝑘perpendicular-to0.1Mpck_{\perp}\lesssim 0.1h/\rm Mpcitalic_k start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT ≲ 0.1 italic_h / roman_Mpc, indicating that it was better able to preserve Gaussian spatial fluctuations on these scales. The HGP3 and CP3 models attain comparable relative error in this case, with the NP3 model attaining a smaller relative error. This would suggest that the regularization through the hyper-prior with the HGP3 model, and the complete pooling of all foreground kernel parameters for the CP3 model, act to reduce spatial fluctuations compared to the NP3 model. On smaller scales, the HGP3 and CP3 models attain smaller relative errors compared to the HGP2 and NP3 models. However, on these small scales, the instrumental beam suppresses the PS close to zero, with the limitations of floating point accuracy making comparisons through the relative error unreliable.

Whilst the PS recovery seen here would indicate good performance for the HGP2 and NP3 models, this should be treated with some caution, given that these PS estimates have not undergone any bias correction. This issue is considered in more detail in Section 5.1. However, it is clear from the recovered 2D transverse PS that all methods fail to properly describe spatial variations in the foreground signal. Whilst the NP3 model allows for greater freedom, and hence spatial variations, this method still mismodels the true foreground emission by assuming total independence between every LoS, as discussed previously in the context of the LOO-CV results. Whilst the HGP3 model allows for spatial variations in the foreground emission kernels, it is assumed that these kernel parameters are all drawn from some global hyper-prior. This is an overly simplistic model. In reality, we would expect foreground kernel parameters to display local spatial correlations. An improved spatial prior should therefore account for these local correlations. Given that modeling local spatial correlations introduces significant additional computational challenges, we leave an exploration of such models to future work.

4.2.2 Wavelet Scattering Transform Coefficients

Refer to caption
Refer to caption
Figure 8: The first-order (S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, left) and second-order (S2subscript𝑆2S_{2}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, right) ST coefficients for the different GP models. The ST coefficients are averaged over 06060\leq\ell\leq 60 ≤ roman_ℓ ≤ 6 and evaluated at q=1𝑞1q=1italic_q = 1. We show the ST coefficients for the true field (brown crosses), the CP3 model (blue circles), the HGP3 model (black circles), the HGP2 model (green circles), and the NP3 model (red circles), respectively. The top panel shows the ST coefficients and the bottom panels show the relative error with respect to the ST coefficients for the true field.

The PS characterizes Gaussian information in the relevant field. However, the full H I field is non-Gaussian, with any non-Gaussian information beyond the PS summary. Recently, the wavelet scattering transformation (WST) has emerged as a new approach to extract information from cosmological fields (e.g. Cheng et al., 2020; Greig et al., 2023; Dai & Seljak, 2023) and is more informative than the poly-spectrum in Fourier space (Sui et al., 2023; Cheng & Ménard, 2021). To examine the recovery of non-Gaussian summary statistics, we calculate the three-dimensional solid harmonic WST coefficients (e.g. Zhao et al., 2023; Eickenberg et al., 2018) of the recovered H I signal cubes.

We briefly summarize the solid harmonic WST here. The solid harmonic WST convolves the original field 𝐝(𝐱)𝐝𝐱\mathbf{d}(\mathbf{x})bold_d ( bold_x ) with a cascade of solid harmonic wavelets defined as

ψm(𝐱)=1(2π)3e|x|2/2|𝐱|Ym(𝐱|𝐱|),superscriptsubscript𝜓𝑚𝐱1superscript2𝜋3superscript𝑒superscript𝑥22superscript𝐱superscriptsubscript𝑌𝑚𝐱𝐱\displaystyle\psi_{\ell}^{m}(\mathbf{x})=\frac{1}{(\sqrt{2\pi})^{3}}e^{-|x|^{2% }/2}|\mathbf{x}|^{\ell}Y_{\ell}^{m}\left(\frac{\mathbf{x}}{|\mathbf{x}|}\right),italic_ψ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( bold_x ) = divide start_ARG 1 end_ARG start_ARG ( square-root start_ARG 2 italic_π end_ARG ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG italic_e start_POSTSUPERSCRIPT - | italic_x | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_POSTSUPERSCRIPT | bold_x | start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( divide start_ARG bold_x end_ARG start_ARG | bold_x | end_ARG ) , (26)
ψj,m(𝐱)=23jψm(2j𝐱),superscriptsubscript𝜓𝑗𝑚𝐱superscript23𝑗superscriptsubscript𝜓𝑚superscript2𝑗𝐱\displaystyle\psi_{j,\ell}^{m}(\mathbf{x})=2^{-3j}\psi_{\ell}^{m}\left(2^{-j}% \mathbf{x}\right),italic_ψ start_POSTSUBSCRIPT italic_j , roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( bold_x ) = 2 start_POSTSUPERSCRIPT - 3 italic_j end_POSTSUPERSCRIPT italic_ψ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( 2 start_POSTSUPERSCRIPT - italic_j end_POSTSUPERSCRIPT bold_x ) ,

where 𝐱𝐱\mathbf{x}bold_x is the spatial coordinates, Ymsuperscriptsubscript𝑌𝑚Y_{\ell}^{m}italic_Y start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is the three-dimensional spherical harmonic function and ψj,m(𝐱)superscriptsubscript𝜓𝑗𝑚𝐱\psi_{j,\ell}^{m}(\mathbf{x})italic_ψ start_POSTSUBSCRIPT italic_j , roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( bold_x ) is the convolution kernel parameterized by j,𝑗j,\ellitalic_j , roman_ℓ and m𝑚mitalic_m. Here, j𝑗jitalic_j gives the spatial scale of the kernel as 2jsuperscript2𝑗2^{j}2 start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, whilst \ellroman_ℓ and m𝑚mitalic_m characterize the angular scale. Nonlinear moduli are applied to the convolved field as

U[j,]𝐝(𝐱)=(m=|𝐝ψj,m(𝐱)|2)1/2,𝑈𝑗𝐝𝐱superscriptsuperscriptsubscript𝑚superscript𝐝superscriptsubscript𝜓𝑗𝑚𝐱212U[j,\ell]\mathbf{d}(\mathbf{x})=\left(\sum_{m=-\ell}^{\ell}\left|\mathbf{d}*% \psi_{j,\ell}^{m}(\mathbf{x})\right|^{2}\right)^{1/2}\,,italic_U [ italic_j , roman_ℓ ] bold_d ( bold_x ) = ( ∑ start_POSTSUBSCRIPT italic_m = - roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT | bold_d ∗ italic_ψ start_POSTSUBSCRIPT italic_j , roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( bold_x ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT , (27)

where the field 𝐝(𝐱)𝐝𝐱\mathbf{d}(\mathbf{x})bold_d ( bold_x ) is convolved (denoted by “*”) with ψj,m(𝐱)superscriptsubscript𝜓𝑗𝑚𝐱\psi_{j,\ell}^{m}(\mathbf{x})italic_ψ start_POSTSUBSCRIPT italic_j , roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( bold_x ). The first-order solid harmonic WST coefficients (hereafter first-order ST coefficients) are obtained by integrating the field as

S1[𝐝;j,,q]=3|U[j,]𝐝(𝐱)|qd3𝐱.subscript𝑆1𝐝𝑗𝑞subscriptsuperscript3superscript𝑈𝑗𝐝𝐱𝑞superscriptd3𝐱S_{1}\left[\mathbf{d};j,\ell,q\right]=\int_{\mathbb{R}^{3}}|U[j,\ell]\mathbf{d% }(\mathbf{x})|^{q}\mathrm{d}^{3}\mathbf{x}.italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ bold_d ; italic_j , roman_ℓ , italic_q ] = ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_U [ italic_j , roman_ℓ ] bold_d ( bold_x ) | start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT roman_d start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT bold_x . (28)

In order to capture information across multiple scales, the first-order solid harmonic WST field U[j,]𝐝(𝐱)𝑈𝑗𝐝𝐱U[j,\ell]\mathbf{d}(\mathbf{x})italic_U [ italic_j , roman_ℓ ] bold_d ( bold_x ) is convolved again with a kernel with different j>jsuperscript𝑗𝑗j^{\prime}>jitalic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > italic_j but with the same \ellroman_ℓ. The second-order ST coefficients are then given by

S2[𝐝;j,j,,q]=3|U[j,]U[j,]𝐝(𝐱)|qd3𝐱,j<j.formulae-sequencesubscript𝑆2𝐝𝑗superscript𝑗𝑞subscriptsuperscript3superscript𝑈superscript𝑗𝑈𝑗𝐝𝐱𝑞superscriptd3𝐱𝑗superscript𝑗S_{2}\left[\mathbf{d};j,j^{\prime},\ell,q\right]=\int_{\mathbb{R}^{3}}\left|U[% j^{\prime},\ell]U[j,\ell]\mathbf{d}(\mathbf{x})\right|^{q}\mathrm{d}^{3}% \mathbf{x},\quad j<j^{\prime}.italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ bold_d ; italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_ℓ , italic_q ] = ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_U [ italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_ℓ ] italic_U [ italic_j , roman_ℓ ] bold_d ( bold_x ) | start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT roman_d start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT bold_x , italic_j < italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT . (29)

In our analysis, we choose jmax=6subscript𝑗6j_{\max}=6italic_j start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 6 to explore features on small to large scales, select max=6subscript6\ell_{\max}=6roman_ℓ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 6 to explore different angular directions, and set q=1𝑞1q=1italic_q = 1 in order to characterize the convolved and modulated field without any amplification. The ST coefficients are calculated using Kymatio777https://www.kymat.io/ (Andreux et al., 2018). The ST coefficients are averaged over the angular scales \ellroman_ℓ, as the averaged ST coefficients are informative (Zhao et al., 2023).

The recovered ST coefficients obtained with the different GP models are shown in Figure 8. For the first-order ST coefficients S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the HGP2 model obtains the smallest relative error on small scales when j4𝑗4j\leq 4italic_j ≤ 4. The relative errors obtained with the NP3 and HGP3 models are worse than the HGP2 model, albeit better than those obtained with the CP3 model. For j=5𝑗5j=5italic_j = 5, the NP3 model achieves the smallest relative error, with the CP3 model having the worst relative error. For the largest scale j=6𝑗6j=6italic_j = 6, the HGP2 and NP3 models have almost identical recovery, attaining the smallest relative error, with the HGP3 model still having a smaller relative error than the CP3 model.

For the second-order ST coefficients S2subscript𝑆2S_{2}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, the results follow a similar pattern. For small-scale first-order maps with j2𝑗2j\leq 2italic_j ≤ 2, the HGP2 first-order maps are almost identical to the true-field first-order map. For j4𝑗4j\geq 4italic_j ≥ 4 the NP3 model has the smallest relative error, with the worst relative error obtained by the CP3 model for all scales.

As with the PS estimates presented in Section 4.2.1, the ST coefficients show good results for the HGP2 model and NP3 model, with the HGP3 model also having better recovery of non-Gaussian summary statistics than the CP3 model. In making complete comparisons of the model performance in recovering the full non-Gaussian field, the LOO-CV summaries in Section 4.1 should be considered with the discussion therein. However, the recovered ST coefficients do provide a useful check on the non-Gaussian information recovery. As discussed with the PS recovery, the performance of the HGP3 model would be improved by using a more realistic spatial prior, with the global hyperprior used in this work being an incomplete model of foreground spatial correlations.

5 Discussions

In this section, we consider issues related to the cosmological signal recovery along with modeling and computational choices. In Section 5.1, we study the effect of bias corrections on power spectrum recovery. In Section 5.2, we discuss the choice of superpixel size used for the HGP2 and HGP3 models. In Section 5.3, we compare the H I signal recovery when using MAP-based inference of the kernel hyperparameters compared to the fully Bayesian approach we use throughout this work.

5.1 Bias Correction

Refer to caption
Figure 9: Spherically averaged PS (left) and radial PS (right) with and without bias correction for different GP models. The uncorrected and true PS are shown with the same color type and line type as in Figure 7, while the bias-corrected PS are shown with the dash-dotted lines and with the same color type as the uncorrected results for each model. Bias correction results in overestimation of the spherically averaged PS on large scales for the CP3, HGP2 and NP3 models. The bias-corrected radial PS is overestimated on all scales for the CP3, HGP2 and NP3 models.

As discussed in Section 2, the 21 cm signal is recovered by subtracting 𝔼[ffg]𝔼delimited-[]subscript𝑓fg\mathbb{E}[f_{\rm fg}]blackboard_E [ italic_f start_POSTSUBSCRIPT roman_fg end_POSTSUBSCRIPT ] from the total observed signal, whilst the foreground covariance cov[ffg]covdelimited-[]subscript𝑓fg\mathrm{cov}[f_{\rm{fg}}]roman_cov [ italic_f start_POSTSUBSCRIPT roman_fg end_POSTSUBSCRIPT ] is ignored. However, Mertens et al. (2020) shows that this can result in a bias in the recovery as

f21cm,estf21cm,estTdelimited-⟨⟩subscript𝑓21cmestsuperscriptsubscript𝑓21cmest𝑇\displaystyle\left<f_{\rm 21cm,\,est}f_{\rm 21cm,\,est}^{T}\right>⟨ italic_f start_POSTSUBSCRIPT 21 roman_c roman_m , roman_est end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT 21 roman_c roman_m , roman_est end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⟩ =\displaystyle== (30)
f21cm,est\displaystyle\left<f_{\rm 21cm,\,est}\right.⟨ italic_f start_POSTSUBSCRIPT 21 roman_c roman_m , roman_est end_POSTSUBSCRIPT f21cm,estTunbiasdcov[ffg].\displaystyle\left.f_{\rm 21cm,\,est}^{T}\right>_{\rm unbiasd}-{\rm cov}[f_{% \rm fg}].italic_f start_POSTSUBSCRIPT 21 roman_c roman_m , roman_est end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT roman_unbiasd end_POSTSUBSCRIPT - roman_cov [ italic_f start_POSTSUBSCRIPT roman_fg end_POSTSUBSCRIPT ] .

Mertens et al. (2020) shows that to compensate, an additive bias correction for the radial PS can be performed with the Monte Carlo procedure as follows.

  • 1.

    Generate several random samples from a multivariate Gaussian distribution, with zero mean and a covariance given by cov[ffg]covdelimited-[]subscript𝑓fg{\rm cov}[f_{\rm fg}]roman_cov [ italic_f start_POSTSUBSCRIPT roman_fg end_POSTSUBSCRIPT ]. Each sample has the same size as our datacube.

  • 2.

    Calculate and average the radial PS over these realizations.

  • 3.

    Add the averaged PS to the residual PS obtained with GP foreground subtraction.

The procedure for the spherically averaged PS is similar, except that in the second step the calculated radial PS is binned into the k𝑘kitalic_k-bins. This ignores the ksubscript𝑘perpendicular-tok_{\perp}italic_k start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT modes when estimating the bias for the spherically averaged PS. For the modeling considered in this paper, the covariance matrix cov[ffg]covdelimited-[]subscript𝑓fg\mathrm{cov}[f_{\rm{fg}}]roman_cov [ italic_f start_POSTSUBSCRIPT roman_fg end_POSTSUBSCRIPT ] does not contain information regarding the foreground signal covariance in the transverse direction (S22).

In this work, we adopt the method described in Mertens et al. (2020) to examine the effects of bias correction. However, Kern & Liu (2021) highlighted the potential failure of this correction in accurately retrieving the true 21 cm PS from the EoR due to misestimation of the EoR signal covariance, and instead proposed a multiplicative correction that cannot under-predict the EoR power spectrum, but this approach requires perfect knowledge of the foreground covariance. However, in this work our methods are based on the idealized simulations used in S22, which justifies our estimation of the data covariance. We defer a detailed exploration of the corrections proposed by Kern & Liu (2021) to future work.

For the HGP and NP models, we adopted multiple covariance matrices to model the foreground emission. We should therefore calculate the bias correction for superpixels along each LoS (for HGP) or for pixels (for NP) independently. However, in practice small values of cov[ffg]covdelimited-[]subscript𝑓fg{\rm cov}[f_{\rm fg}]roman_cov [ italic_f start_POSTSUBSCRIPT roman_fg end_POSTSUBSCRIPT ] require better computational precision than double precision. This significantly slows the calculation down, making a full bias correction for the NP model intractable with our available computational resources. For simplicity, we calculate a global cov[ffg]covdelimited-[]subscript𝑓fg{\rm cov}[f_{\rm fg}]roman_cov [ italic_f start_POSTSUBSCRIPT roman_fg end_POSTSUBSCRIPT ] with averaged hyperparameters as an approximation for the NP3 model and leave the analysis of full corrections to future work.

The results of the PS bias correction are shown in Figure 9. On scales k0.05h/Mpcgreater-than-or-equivalent-to𝑘0.05Mpck\gtrsim 0.05\,h/\rm Mpcitalic_k ≳ 0.05 italic_h / roman_Mpc, the spherically averaged PS for all models are in agreement, and the effect of bias correction is negligible. However, the bias correction results in significant overestimation on larger scales. This correction is most prominent for the CP3 model and least significant for the HGP3 model. The impact of bias correction can be seen more directly in the radial PS, where bias correction results in overestimation of the radial PS with the HGP2, NP3, and CP3 models on all scales. This is not the case for the HGP3 model, where the bias correction is again small, and we do not observe an overestimation of the radial PS on large scales. The overcorrection through the ksubscript𝑘parallel-tok_{\parallel}italic_k start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT modes applied to the HGP2, NP3, and CP3 models manifests itself as an apparently improved recovery of the spherically averaged PS on intermediate scales. However, in this case, the over-correction applied to the ksubscript𝑘parallel-tok_{\parallel}italic_k start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT modes acts against the poor recovery of the ksubscript𝑘perpendicular-tok_{\perp}italic_k start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT modes for all methods on these scales. As discussed previously, further improvements in PS recovery, particularly for transverse PS, can be made by considering more physically motivated spatial priors.

Refer to caption
Figure 10: Residual distributions of the recovered H I signal cubes for the HGP3 model with different sizes of a single superpixel, as indicated in the legend. In the test, for a single superpixel that contains from 8×8888\times 88 × 8 normal pixels to 64×64646464\times 6464 × 64 normal pixels, the residual distributions are broadly the same.
Refer to caption
Figure 11: Residual distributions of the recovered H I signal cubes for the HGP3 model using MAP inference (orange solid line) and full posterior sampling (black solid line), and, for comparison, for the CP3 model (blue dotted line), respectively.

5.2 Choice of Superpixel Size

To study the impact of choosing different sizes of a single superpixel for the HGP models, we ran the HGP3 with a single superpixel that contains from 8×8888\times 88 × 8 normal pixels to 64×64646464\times 6464 × 64 normal pixels. The configuration of a superpixel with 8×8888\times 88 × 8 normal pixels corresponds to the minimum superpixel size that fits our memory constraint because a smaller superpixel size means a larger number of superpixels, each of which has a set of free GP kernel parameters. In Figure 10 we show the residual distributions obtained with various superpixel sizes. The residual distributions are broadly the same for all superpixel sizes we test here. In this work, we therefore select a superpixel size of 16×16161616\times 1616 × 16 normal pixels, a trade-off between the computational demands of the model and allowing the model to account for more detailed spatial variations in the foreground emission. In future work, it will be interesting to consider GP approximations that would allow us to remove the need for superpixels, and hence remove any superpixel boundary effects (Solin & Särkkä, 2020; Riutort-Mayol et al., 2023).

5.3 Maximum-A-Posteriori inference vs. sampling

Throughout this work we used a fully Bayesian approach, sampling the GP posterior and calculating posterior expectations. Whilst this is computationally challenging, it has been found to result in improved inference in GP models, and in Bayesian models more generally (MacKay, 2003; Betancourt, 2017; Lalchand & Rasmussen, 2020). To compare the impact of Maximum-A-Posteriori (MAP) inference with full posterior sampling in our case, we consider performing MAP optimization of the GP kernel parameters for the HGP3 model, and compare the recovered H I signal cubes with those obtained through sampling. The residual distributions for the two approaches are shown in Figure 11, alongside the residual distribution for the CP3 model obtained with full sampling. We find that using MAP optimization to obtain a point estimate of the GP kernel parameters results in a residual distribution that is even broader than the CP3 residual distribution. This is consistent with the known properties of the MAP estimators. For the complex, high-dimensional geometry of the HGP3 model, we expect that the MAP estimate of the kernel parameters is far from the typical set, where the posterior probability mass is concentrated (Betancourt, 2017).

6 Conclusions

In this work, we consider the impact of allowing for spatial variations in GP kernel parameters when modeling foreground emission in 21 cm component separation analyses. For this purpose, we consider a range of natural modeling variations that allow for different levels of spatial variation. As our baseline, we consider the CP approach adopted in previous analyses, where the foreground GP kernel parameters are assumed to be identical for every LoS. At the next level of spatial variation, we consider the HGP models, where foreground kernel parameters were allowed to vary over a set of superpixels, each with a size of 16×16161616\times 1616 × 16 normal pixels, with the kernel parameters in turn being regularized through a global hyper-prior. Finally, we consider the NP models, where foreground kernel parameters are assumed to be completely independent along every LoS, with no regularization from any hyper-prior. In all cases, we consider models with two GP kernels, abbreviated as CP2, HGP2 and NP2, and models with three GP kernels, abbreviated as CP3, HGP3 and NP3, respectively. The two-kernel models assume one GP kernel for the H I emission and one GP kernel for the foreground emission. The three-kernel models have an additional foreground kernel to model polarization leakage.

We test the performance of these models against the simulated MeerKLASS-like observations, with a focus on the recovery of the H I signal cubes. At the level of pixels, we find that both HGP3 and NP3 models have smaller residual distributions than the CP3 model, e.g. the standard deviation of the NP3 residual distribution is approximately 30%percent3030\%30 % smaller than for the CP3 residuals. The CP2 model fails to accurately recover the H I signal over the sky, and the NP2 model fails to converge. In these cases, the misspecification of the two-kernel models leads to failures in foreground modeling. The HGP2 model can converge, with a comparable residual distribution to the CP3 model. In this case, the hierarchical model is robust to model misspecification.

Whilst the NP3 model has the sharpest residual distribution, there are still areas of the sky where the foreground spectrum is incorrectly modeled, similar to the CP3 model. These areas of poor signal recovery are eliminated by the HGP3 model. A more rigorous model comparison is made by performing a LOO-CV analysis and calculating the expected log predictive density of held-out data, elpdloosubscriptelpdloo\mathrm{elpd}_{\mathrm{loo}}roman_elpd start_POSTSUBSCRIPT roman_loo end_POSTSUBSCRIPT. The NP3 model has the largest value of elpdloosubscriptelpdloo\mathrm{elpd}_{\mathrm{loo}}roman_elpd start_POSTSUBSCRIPT roman_loo end_POSTSUBSCRIPT, followed by the HGP3, HGP2, and CP3 models. Whilst this seems to indicate that the NP3 model has the best predictive performance, the Pareto-k^^𝑘\hat{k}over^ start_ARG italic_k end_ARG statistics indicate that the importance weighting procedure used to estimate elpdloosubscriptelpdloo\mathrm{elpd}_{\mathrm{loo}}roman_elpd start_POSTSUBSCRIPT roman_loo end_POSTSUBSCRIPT for the NP3 model is unreliable. This is often caused by serious model misspecification. In the case of the NP3 model, the assumption of complete independence of foreground kernel parameters along every LoS is unrealistic. Moreover, memory constraint means that the full datacube has to be divided into data subsets, with the H I kernel treated independently in each subset. The lack of regularization from a hyper-prior also means that the NP3 model is prone to overfitting. The Pareto-k^^𝑘\hat{k}over^ start_ARG italic_k end_ARG statistics for the remaining models are good, with the best predictive performance obtained with the HGP3 model.

In addition to the LOO-CV analysis, we also examine the recovery of summary statistics for the H I field. For the PS recovery, the results for all methods agree on small scales (k0.1h/Mpcgreater-than-or-equivalent-to𝑘0.1Mpck\gtrsim 0.1\,h/\rm{Mpc}italic_k ≳ 0.1 italic_h / roman_Mpc) for spherically averaged PS and radial PS, with percentage level accuracy. At large scales, the relative error increases to about 𝒪(10%)𝒪percent10\mathcal{O}(10\%)caligraphic_O ( 10 % ). We also investigate the effect of bias correction applied through the ksubscript𝑘parallel-tok_{\parallel}italic_k start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT modes. Its impact is apparent in the bias-corrected radial PS, where all models except the HGP3 model overestimate the radial PS on all scales. The transverse PS is underestimated by all methods on scales k0.1h/Mpcless-than-or-similar-tosubscript𝑘perpendicular-to0.1Mpck_{\perp}\lesssim 0.1\,h/\rm{Mpc}italic_k start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT ≲ 0.1 italic_h / roman_Mpc, which is caused by simplified modeling of foreground kernel spatial variations. The NP3, HGP3, and HGP2 models all attain better recovery of ST coefficients compared to the CP3 model, which indicates better recovery of non-Gaussian features in the H I field.

All these results demonstrate the importance of accounting for spatial variations in foreground emission. Indeed, the HGP2 model outperforms the CP3 model in predictive performance and recovery of H I field summary statistics, despite the kernel model being misspecified. However, both NP and HGP models use highly simplified models for these spatial variations, assuming either complete independence or variation through a global hyper-prior. Improving the signal recovery requires more physically motivated spatial priors that can account for local correlations in the foreground emission. This will enable improved recovery of the transverse PS and ST coefficients. However, improved spatial modeling will likely result in significant computational demands in both memory and clock time. To address this, it will be interesting to consider memory-efficient alternatives to exact GPs (Solin & Särkkä, 2020; Riutort-Mayol et al., 2023), and also more efficient inference algorithms to integrate over the high-dimensional target distribution (Robnik et al., 2022). In addition to using better motivated spatial priors, performance improvements can also be obtained through improved foreground frequency covariance modeling (Mertens et al., 2024). Alongside these modeling considerations, in future work we will test our method against more realistic mock observations e.g., considering improved noise modeling and complications from radio frequency interference.

Acknowledgments

This work is supported by the National SKA Program of China (grant No. 2020SKA0110401) and NSFC (grant No. 12250410240 and 11821303). We thank Zhaoting Chen, Meng Zhou, Ce Sui, Siyi Zhao and Yidong Xu for their useful discussions and help. We acknowledge the Tsinghua Astrophysics High-Performance Computing platform at Tsinghua University for providing computational and data storage resources that have contributed to the research results reported within this paper.

References

  • Alonso et al. (2015) Alonso, D., Bull, P., Ferreira, P. G., & Santos, M. G. 2015, MNRAS, 447, 400, doi: 10.1093/mnras/stu2474
  • Alonso et al. (2014) Alonso, D., Ferreira, P. G., & Santos, M. G. 2014, MNRAS, 444, 3183, doi: 10.1093/mnras/stu1666
  • Andreux et al. (2018) Andreux, M., Angles, T., Exarchakis, G., et al. 2018, arXiv e-prints, arXiv:1812.11214, doi: 10.48550/arXiv.1812.11214
  • Battye et al. (2013) Battye, R. A., Browne, I. W. A., Dickinson, C., et al. 2013, MNRAS, 434, 1239, doi: 10.1093/mnras/stt1082
  • Battye et al. (2004) Battye, R. A., Davies, R. D., & Weller, J. 2004, MNRAS, 355, 1339, doi: 10.1111/j.1365-2966.2004.08416.x
  • Bernardi et al. (2009) Bernardi, G., de Bruyn, A. G., Brentjens, M. A., et al. 2009, A&A, 500, 965, doi: 10.1051/0004-6361/200911627
  • Bernardi et al. (2010) Bernardi, G., de Bruyn, A. G., Harker, G., et al. 2010, A&A, 522, A67, doi: 10.1051/0004-6361/200913420
  • Betancourt (2017) Betancourt, M. 2017, arXiv preprint arXiv:1701.02434
  • Betancourt & Girolami (2015) Betancourt, M., & Girolami, M. 2015, Current trends in Bayesian methodology with applications, 79, 2
  • Bharadwaj et al. (2001) Bharadwaj, S., Nath, B. B., & Sethi, S. K. 2001, Journal of Astrophysics and Astronomy, 22, 21, doi: 10.1007/BF02933588
  • Bigot-Sazy et al. (2015) Bigot-Sazy, M. A., Dickinson, C., Battye, R. A., et al. 2015, MNRAS, 454, 3240, doi: 10.1093/mnras/stv2153
  • Bingham et al. (2019) Bingham, E., Chen, J. P., Jankowiak, M., et al. 2019, J. Mach. Learn. Res., 20, 28:1. http://jmlr.org/papers/v20/18-403.html
  • Bradbury et al. (2018) Bradbury, J., Frostig, R., Hawkins, P., et al. 2018, JAX: composable transformations of Python+NumPy programs, 0.3.13. http://github.com/google/jax
  • Chang et al. (2008) Chang, T.-C., Pen, U.-L., Peterson, J. B., & McDonald, P. 2008, Phys. Rev. Lett., 100, 091303, doi: 10.1103/PhysRevLett.100.091303
  • Cheng & Ménard (2021) Cheng, S., & Ménard, B. 2021, arXiv e-prints, arXiv:2112.01288, doi: 10.48550/arXiv.2112.01288
  • Cheng et al. (2020) Cheng, S., Ting, Y.-S., Ménard, B., & Bruna, J. 2020, MNRAS, 499, 5902, doi: 10.1093/mnras/staa3165
  • Chluba et al. (2017) Chluba, J., Hill, J. C., & Abitbol, M. H. 2017, MNRAS, 472, 1195, doi: 10.1093/mnras/stx1982
  • Cunnington et al. (2021) Cunnington, S., Irfan, M. O., Carucci, I. P., Pourtsidou, A., & Bobin, J. 2021, MNRAS, 504, 208, doi: 10.1093/mnras/stab856
  • Dai & Seljak (2023) Dai, B., & Seljak, U. 2023, arXiv e-prints, arXiv:2306.04689, doi: 10.48550/arXiv.2306.04689
  • Dewdney et al. (2019) Dewdney, P., et al. 2019, SKA1 design baseline description, Tech. rep., SKA-TEL-SKO-0001075, internal SKA document
  • Di Matteo et al. (2002) Di Matteo, T., Perna, R., Abel, T., & Rees, M. J. 2002, ApJ, 564, 576, doi: 10.1086/324293
  • Dickinson et al. (2003) Dickinson, C., Davies, R. D., & Davis, R. J. 2003, MNRAS, 341, 369, doi: 10.1046/j.1365-8711.2003.06439.x
  • Eickenberg et al. (2018) Eickenberg, M., Exarchakis, G., Hirn, M., Mallat, S., & Thiry, L. 2018, J. Chem. Phys., 148, 241732, doi: 10.1063/1.5023798
  • Eriksen et al. (2008) Eriksen, H. K., Jewell, J. B., Dickinson, C., et al. 2008, ApJ, 676, 10, doi: 10.1086/525277
  • Flaxman et al. (2015) Flaxman, S., Gelman, A., Neill, D., et al. 2015, Manuscript in preparation
  • Furlanetto (2016) Furlanetto, S. R. 2016, The 21-cm Line as a Probe of Reionization, ed. A. Mesinger (Cham: Springer International Publishing), 247–280, doi: 10.1007/978-3-319-21957-8_9
  • Gabry et al. (2019) Gabry, J., Simpson, D., Vehtari, A., Betancourt, M., & Gelman, A. 2019, Journal of the Royal Statistical Society Series A: Statistics in Society, 182, 389
  • Gehlot et al. (2019) Gehlot, B. K., Mertens, F. G., Koopmans, L. V. E., et al. 2019, MNRAS, 488, 4271, doi: 10.1093/mnras/stz1937
  • Gelman (2006) Gelman, A. 2006, Technometrics, 48, 432, doi: 10.1198/004017005000000661
  • Gelman et al. (2013) Gelman, A., Carlin, J. B., Stern, H. S., et al. 2013, Bayesian data analysis, 3rd edn., Chapman & Hall/CRC Texts in Statistical Science Series (Boca Raton, Florida: CRC). https://www.worldcat.org/title/bayesian-data-analysis/oclc/966614951?referer=br&ht=edition
  • Gelman & Hennig (2017) Gelman, A., & Hennig, C. 2017, Journal of the Royal Statistical Society Series A: Statistics in Society, 180, 967
  • Gelman & Hill (2006) Gelman, A., & Hill, J. 2006, Data analysis using regression and multilevel/hierarchical models (Cambridge university press)
  • Gelman & Rubin (1992) Gelman, A., & Rubin, D. B. 1992, Statistical Science, 7, 457, doi: 10.1214/ss/1177011136
  • Gelman et al. (2017) Gelman, A., Simpson, D., & Betancourt, M. 2017, Entropy, 19, 555, doi: 10.3390/e19100555
  • Ghosh et al. (2020) Ghosh, A., Mertens, F., Bernardi, G., et al. 2020, MNRAS, 495, 2813, doi: 10.1093/mnras/staa1331
  • Górski et al. (2005) Górski, K. M., Hivon, E., Banday, A. J., et al. 2005, ApJ, 622, 759, doi: 10.1086/427976
  • Greig & Mesinger (2017) Greig, B., & Mesinger, A. 2017, MNRAS, 472, 2651, doi: 10.1093/mnras/stx2118
  • Greig et al. (2023) Greig, B., Ting, Y.-S., & Kaurov, A. A. 2023, MNRAS, 519, 5288, doi: 10.1093/mnras/stac3822
  • Grumitt et al. (2020) Grumitt, R. D. P., Jew, L. R. P., & Dickinson, C. 2020, MNRAS, 496, 4383, doi: 10.1093/mnras/staa1857
  • Haslam et al. (1981) Haslam, C. G. T., Klein, U., Salter, C. J., et al. 1981, A&A, 100, 209
  • Haslam et al. (1982) Haslam, C. G. T., Salter, C. J., Stoffel, H., & Wilson, W. E. 1982, A&AS, 47, 1
  • Hibbard et al. (2023) Hibbard, J. J., Rapetti, D., Burns, J. O., Mahesh, N., & Bassett, N. 2023, arXiv e-prints, arXiv:2304.09959, doi: 10.48550/arXiv.2304.09959
  • Hoffman et al. (2019) Hoffman, M., Sountsov, P., Dillon, J. V., et al. 2019, arXiv preprint arXiv:1903.03704
  • Hoffman et al. (2014) Hoffman, M. D., Gelman, A., et al. 2014, J. Mach. Learn. Res., 15, 1593
  • Hothi et al. (2021) Hothi, I., Chapman, E., Pritchard, J. R., et al. 2021, MNRAS, 500, 2264, doi: 10.1093/mnras/staa3446
  • Jelić et al. (2008) Jelić, V., Zaroubi, S., Labropoulos, P., et al. 2008, MNRAS, 389, 1319, doi: 10.1111/j.1365-2966.2008.13634.x
  • Jew & Grumitt (2020) Jew, L., & Grumitt, R. D. P. 2020, MNRAS, 495, 578, doi: 10.1093/mnras/staa1233
  • Kern & Liu (2021) Kern, N. S., & Liu, A. 2021, MNRAS, 501, 1463, doi: 10.1093/mnras/staa3736
  • Klypin et al. (2016) Klypin, A., Yepes, G., Gottlöber, S., Prada, F., & Heß, S. 2016, MNRAS, 457, 4340
  • Kumar et al. (2019) Kumar, R., Carroll, C., Hartikainen, A., & Martin, O. 2019, Journal of Open Source Software, 4, 1143, doi: 10.21105/joss.01143
  • Lalchand & Rasmussen (2020) Lalchand, V., & Rasmussen, C. E. 2020, in Symposium on Advances in Approximate Bayesian Inference, PMLR, 1–12
  • Liu & Shaw (2020) Liu, A., & Shaw, J. R. 2020, PASP, 132, 062001, doi: 10.1088/1538-3873/ab5bfd
  • Liu & Tegmark (2011) Liu, A., & Tegmark, M. 2011, Phys. Rev. D, 83, 103006, doi: 10.1103/PhysRevD.83.103006
  • Liu & Tegmark (2012) —. 2012, MNRAS, 419, 3491, doi: 10.1111/j.1365-2966.2011.19989.x
  • Livingstone et al. (2016) Livingstone, S., Betancourt, M., Byrne, S., & Girolami, M. 2016, Bernoulli, 25, doi: 10.3150/18-BEJ1083
  • MacKay (2003) MacKay, D. 2003, Information Theory, Inference and Learning Algorithms (Cambridge University Press). https://books.google.co.uk/books?id=AKuMj4PN_EMC
  • Masui et al. (2013) Masui, K. W., Switzer, E. R., Banavar, N., et al. 2013, ApJ, 763, L20, doi: 10.1088/2041-8205/763/1/L20
  • Mertens et al. (2024) Mertens, F. G., Bobin, J., & Carucci, I. P. 2024, MNRAS, 527, 3517, doi: 10.1093/mnras/stad3430
  • Mertens et al. (2018) Mertens, F. G., Ghosh, A., & Koopmans, L. V. E. 2018, MNRAS, 478, 3640, doi: 10.1093/mnras/sty1207
  • Mertens et al. (2020) Mertens, F. G., Mevius, M., Koopmans, L. V. E., et al. 2020, MNRAS, 493, 1662, doi: 10.1093/mnras/staa327
  • Offringa et al. (2019) Offringa, A. R., Mertens, F., & Koopmans, L. V. E. 2019, MNRAS, 484, 2866, doi: 10.1093/mnras/stz175
  • Oh & Mack (2003) Oh, S. P., & Mack, K. J. 2003, MNRAS, 346, 871, doi: 10.1111/j.1365-2966.2003.07133.x
  • Olivari et al. (2018) Olivari, L. C., Dickinson, C., Battye, R. A., et al. 2018, MNRAS, 473, 4242, doi: 10.1093/mnras/stx2621
  • Phan et al. (2019) Phan, D., Pradhan, N., & Jankowiak, M. 2019, arXiv preprint arXiv:1912.11554
  • Planck Collaboration et al. (2016a) Planck Collaboration, Adam, R., Ade, P. A. R., et al. 2016a, A&A, 594, A10, doi: 10.1051/0004-6361/201525967
  • Planck Collaboration et al. (2016b) Planck Collaboration, Ade, P. A. R., Aghanim, N., et al. 2016b, A&A, 594, A13, doi: 10.1051/0004-6361/201525830
  • Pritchard & Loeb (2012) Pritchard, J. R., & Loeb, A. 2012, Reports on Progress in Physics, 75, 086901, doi: 10.1088/0034-4885/75/8/086901
  • Rasmussen & Williams (2006) Rasmussen, C. E., & Williams, C. K. I. 2006, Gaussian processes for machine learning., Adaptive computation and machine learning (MIT Press), I–XVIII, 1–248
  • Remazeilles et al. (2015) Remazeilles, M., Dickinson, C., Banday, A. J., Bigot-Sazy, M. A., & Ghosh, T. 2015, MNRAS, 451, 4311, doi: 10.1093/mnras/stv1274
  • Remazeilles et al. (2016) Remazeilles, M., Dickinson, C., Eriksen, H. K. K., & Wehus, I. K. 2016, MNRAS, 458, 2032, doi: 10.1093/mnras/stw441
  • Remazeilles et al. (2018) Remazeilles, M., Banday, A. J., Baccigalupi, C., et al. 2018, J. Cosmology Astropart. Phys, 2018, 023, doi: 10.1088/1475-7516/2018/04/023
  • Riutort-Mayol et al. (2023) Riutort-Mayol, G., Bürkner, P.-C., Andersen, M. R., Solin, A., & Vehtari, A. 2023, Statistics and Computing, 33, 17
  • Robnik et al. (2022) Robnik, J., De Luca, G. B., Silverstein, E., & Seljak, U. 2022, arXiv e-prints, arXiv:2212.08549, doi: 10.48550/arXiv.2212.08549
  • Ruiz-Zapatero et al. (2022) Ruiz-Zapatero, J., García-García, C., Alonso, D., Ferreira, P. G., & Grumitt, R. D. P. 2022, MNRAS, 512, 1967, doi: 10.1093/mnras/stac431
  • Santos et al. (2016) Santos, M., Bull, P., Camera, S., et al. 2016, in MeerKAT Science: On the Pathway to the SKA, 32, doi: 10.22323/1.277.0032
  • Shafieloo et al. (2012) Shafieloo, A., Kim, A. G., & Linder, E. V. 2012, Phys. Rev. D, 85, 123530, doi: 10.1103/PhysRevD.85.123530
  • Shaw et al. (2015) Shaw, J. R., Sigurdson, K., Sitwell, M., Stebbins, A., & Pen, U.-L. 2015, Phys. Rev. D, 91, 083514, doi: 10.1103/PhysRevD.91.083514
  • Soares et al. (2022) Soares, P. S., Watkinson, C. A., Cunnington, S., & Pourtsidou, A. 2022, MNRAS, 510, 5872, doi: 10.1093/mnras/stab2594
  • Solin & Särkkä (2020) Solin, A., & Särkkä, S. 2020, Statistics and Computing, 30, 419
  • Stan Development Team (2012) Stan Development Team. 2012, Stan Modeling Language User’s Guide and Reference Manual, Version 1.0. http://mc-stan.org/
  • Sui et al. (2023) Sui, C., Zhao, X., Jing, T., & Mao, Y. 2023, arXiv e-prints, arXiv:2307.04994, doi: 10.48550/arXiv.2307.04994
  • Vehtari et al. (2017) Vehtari, A., Gelman, A., & Gabry, J. 2017, Statistics and computing, 27, 1413
  • Vehtari et al. (2015) Vehtari, A., Simpson, D., Gelman, A., Yao, Y., & Gabry, J. 2015, arXiv e-prints, arXiv:1507.02646, doi: 10.48550/arXiv.1507.02646
  • Villaescusa-Navarro et al. (2018) Villaescusa-Navarro, F., Genel, S., Castorina, E., et al. 2018, ApJ, 866, 135, doi: 10.3847/1538-4357/aadba0
  • Wang et al. (2006) Wang, X., Tegmark, M., Santos, M. G., & Knox, L. 2006, ApJ, 650, 529, doi: 10.1086/506597
  • Wolz et al. (2014) Wolz, L., Abdalla, F. B., Blake, C., et al. 2014, MNRAS, 441, 3271, doi: 10.1093/mnras/stu792
  • Zaldarriaga et al. (2004) Zaldarriaga, M., Furlanetto, S. R., & Hernquist, L. 2004, ApJ, 608, 622, doi: 10.1086/386327
  • Zhao et al. (2023) Zhao, X., Mao, Y., Zuo, S., & Wandelt, B. D. 2023, arXiv e-prints, arXiv:2310.17602, doi: 10.48550/arXiv.2310.17602