Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
11institutetext: 1School of Information Science and Engineering, Yunnan University, Kunming, China
2College of Nursing Health Sciences, Yunnan Open University, Kunming, China
3Shenzhen Research Institute of Big Data, Shenzhen, China
Correspondence author: minwenwen@ynu.edu.cn

stEnTrans: Transformer-based deep learning for spatial transcriptomics enhancement

Shuailin Xue1    Fangfang Zhu2    Changmiao Wang3    Wenwen Min1∗\orcidlink0000-0002-2558-2911
Abstract

The spatial location of cells within tissues and organs is crucial for the manifestation of their specific functions.Spatial transcriptomics technology enables comprehensive measurement of the gene expression patterns in tissues while retaining spatial information. However, current popular spatial transcriptomics techniques either have shallow sequencing depth or low resolution. We present stEnTrans, a deep learning method based on Transformer architecture that provides comprehensive predictions for gene expression in unmeasured areas or unexpectedly lost areas and enhances gene expression in original and inputed spots. Utilizing a self-supervised learning approach, stEnTrans establishes proxy tasks on gene expression profile without requiring additional data, mining intrinsic features of the tissues as supervisory information. We evaluate stEnTrans on six datasets and the results indicate superior performance in enhancing spots resolution and predicting gene expression in unmeasured areas compared to other deep learning and traditional interpolation methods. Additionally, Our method also can help the discovery of spatial patterns in Spatial Transcriptomics and enrich to more biologically significant pathways. Our source code is available at https://github.com/shuailinxue/stEnTrans.

Keywords:
Spatial transcriptomics Deep learning Self-supervised learning Transformer Imputation Prediction Enhancement

1 Introduction

Traditional gene expression researches typically focus only on the overall gene transcript counts in entire tissues, lacking the preservation of spatial positioning information within the tissue. The spatial position of cells within tissues can have a significant impact on their functions. Therefore, neglecting the spatial information of cells may lead to an insufficient understanding of gene functions and regulatory mechanisms. Spatial transcriptomics (ST) technology aims to detect the quantity of gene transcripts within tissues while preserving spatial location information [21]. ST technology provides the capability to examine in detail the spatial distribution of gene expression at the tissue or cellular level. This allows researchers to gain a more accurate understanding of the gene expression patterns in different regions [25, 20], delving deeper into the exploration of cellular heterogeneity and interactions between neighboring cells. In light of the substantial advantages offered by ST technoloies, it have been used in various field of biology, such as tumor heterogeneity [14], embryonic development [19] and Neuroanatomy [11].

Existing ST technologies mainly fall into two categories: (1)imaging-based approaches,directly observing and quantitatively analyzing mRNA or protein expression in tissues without the need for prior RNA sequencing, such as STARmap [28]. (2)next-generation sequencing (NGS)-based approaches [3],capturing RNA in tissues and sequencing it to obtain a global map of gene expression. However, both approaches have their respective drawbacks.Imaging-based approaches offer higher spatial resolution, suitable for observing cellular and subcellular details. But they may have limitations in detecting the number of genes. NGS-based approaches provide a comprehensive understanding of global gene expression in tissues and can simultaneously detect a large number of genes, suitable for comprehensive bioinformatics analysis [15]. But they face two major challenges [17]:1.lower spatial resolution. Such as 10X Visium [22], a spot with a diameter of 55 μ𝜇\muitalic_μm may contain 1 to 30 cells. In ST [26], spot is 100 μ𝜇\muitalic_μm in diameter and may contain hundreds of cells. This makes it challenging to study individual differences between cells, such as cell subtypes, mutations, and expression heterogeneity. 2.Gaps Between Spots. The center-to-center distance between spots in ST is 200 μ𝜇\muitalic_μm, and in Visium, it is 100 μ𝜇\muitalic_μm. Clearly, there are significant uncovered spatial between spots, leaving large areas in the tissue unmeasured. This can affect researchers’ comprehensive understanding of the entire cell population, hindering a full representation of the true state of the tissue, especially in scenarios where studying cell heterogeneity and local differences is crucial.

Here, we propose stEnTrans, a deep learning method using self-supervised learning [13] that enhances the resolution of gene expression through prediction in unmeasured areas between spots to obtain a high-resolution and high quality gene expression profiles. stEnTrans requires no additional data, such as histology images, relying solely on spatial gene expression data as input.It comprehensively integrates the relevant information across different positions within genes and the correlations among different genes within the tissue. We applied stEnTrans to six datasets, including four real datasets sourced from the ST and 10X Visium platforms and two standard array datasets were simulated from STARmap and Stereo-seq data using coordinate mapping. The experimental results demonstrate the superiority of stEnTrans compared to other methods in terms of prediction accuracy and clarity of gene expression profiles, while also manifesting more biologically meaningful pathways.

2 Method

Refer to caption
Figure 1: The network architecture of stEnTrans. (A) The schematic diagrams of down-sampling and imputation on honeycomb-based (left) and matrix-based (right) data. Down-sample: From left to right and top to bottom, alternately extract every spots and Extracted spots are non-adjacent to each other in the original gene expression profiles. Impute: Interpolate between adjacent spots. There are three types of adjacency: left-right, top-bottom and diagonal adjacency. (B) The self-supervised learning process of a network architecture. Pretraining the stEnTrans using downsampled gene expression as input and original gene pression as labels. After training the stEnTrans, inputting the original gene expression to obtain the high-resolution gene expression. (C) Details of network architecture. The trainable components mainly consist of Transformer Encoder modules.

Our model undergoes two distinct phases: Pretrain and Enhance (Fig. 1). During the Pretrain phase, We utilize down-sampling on the original gene expression profiles to generate low-resolution (LR) gene expression profiles, employing it as inputs, with the original gene expression profiles serving as the labels, and the outputs possesses the same resolution as the original gene expression profiles. After Pretrain phase, proceed to the Enhance phase, Using the original spatial gene expression profiles as inputs, high-resolution (HR) data is obtained. Our method is primarily designed for standard array data, focusing on ST and 10X Visium platforms. Additionally, for other platforms like STARmap and Stereo-seq, we can simulate them as array data through spatial coordinate mapping to achieve the purpose of data enhancement.

2.1 Data Pre-processing

stEnTrans only requires a gene expression matrix with spatial coordinates. We define the ST data as two matrices: (1) gene expression matrix,Xm×nsubscript𝑋𝑚𝑛\mathit{X}_{m\times n}italic_X start_POSTSUBSCRIPT italic_m × italic_n end_POSTSUBSCRIPT, which contains m𝑚\mathit{m}italic_m spots and n𝑛\mathit{n}italic_n genes.The value of Xijsubscript𝑋𝑖𝑗\mathit{X}_{ij}italic_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT represents the expression level of j𝑗\mathit{j}italic_j-th gene in the i𝑖\mathit{i}italic_i-th spot. (2) spatial coordinate matrix, Cm×2subscript𝐶𝑚2\mathit{C}_{m\times 2}italic_C start_POSTSUBSCRIPT italic_m × 2 end_POSTSUBSCRIPT. Its i𝑖\mathit{i}italic_i-th row corresponds to the same spot as the i𝑖\mathit{i}italic_i-th row of matrix Xm×nsubscript𝑋𝑚𝑛\mathit{X}_{m\times n}italic_X start_POSTSUBSCRIPT italic_m × italic_n end_POSTSUBSCRIPT. The values of Ci0subscript𝐶𝑖0\mathit{C}_{i0}italic_C start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT and Ci1subscript𝐶𝑖1\mathit{C}_{i1}italic_C start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT constitute the coordinate information of the i𝑖\mathit{i}italic_i-th spot in two-dimensional space. We treat the expression of each gene in space as a sample.

2.1.1 Data source

All datasets analyzed in this paper are existing and publicly available. The human melanoma ST data (HM) can be found at https://www.spatialresearch.org/resources-published-datasets/doi-10-1158-0008-5472-can-18-0747 [26]. The STARmap mouse placenta (MP) can be found at https://codeocean.com/capsule/9820099/tree/v1 [9]. The Stereo-seq data from the adult mouse hemi-brain (AMHB) can be found https://db.cngb.org/stomics/mosta/ [5]. The human breast cancer spatial transcriptomics data (HBC) is available from the https://www.10xgenomics.com/products/xenium-in-situ/preview-dataset-human-breast [12]. The Human Invasive ductal carcinoma spatial transcriptomics data (IDC) is available from the https://support.10xgenomics.com/spatialgene-expression/datasets/1.2.0/V1_Human_Invasive_Ductal_Carcinoma. The mouse brain sagittal posterior data (MBSP) is available from the https://www.10xgenomics.com/datasets/mouse-brain-serial-section-1-sagittal-posterior-1-standard-1-1-0.

2.1.2 Pre-processing for gene expression matrix

The j𝑗\mathit{j}italic_j-th column of X𝑋\mathit{X}italic_X represents the expression level of the j𝑗\mathit{j}italic_j-th gene in the tissue. We denote this m-dimensional vector as Xjsuperscript𝑋𝑗\mathit{X}^{j}italic_X start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT. Based on spatial coordinate information of all spots, Xjsuperscript𝑋𝑗\mathit{X}^{j}italic_X start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT can be mapped onto a plane and transformed into a matrix of shape (u𝑢\mathit{u}italic_u, v𝑣\mathit{v}italic_v). We denote this matrix as Gjsuperscript𝐺𝑗\mathit{G}^{j}italic_G start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, which can be considered as a gene expression profile. Specifically, the value at the risubscript𝑟𝑖\mathit{r}_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-th row and cjsubscript𝑐𝑗\mathit{c}_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT-th column of Gjsuperscript𝐺𝑗\mathit{G}^{j}italic_G start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT represents the expression level of the j𝑗\mathit{j}italic_j-th gene in the spot with coordinates (risubscript𝑟𝑖\mathit{r}_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, cjsubscript𝑐𝑗\mathit{c}_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT).In addition, we need whether-in-tissue matrix M𝑀\mathit{M}italic_M, which has the same shape as Gjsuperscript𝐺𝑗\mathit{G}^{j}italic_G start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, to indicate the presence or absence of spots. All its elements are either 0 or 1. For example, if Mricjsubscript𝑀subscript𝑟𝑖subscript𝑐𝑗\mathit{M}_{\mathit{r}_{i}\mathit{c}_{j}}italic_M start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0, it indicates that there is not a spot with spatial coordinates (risubscript𝑟𝑖\mathit{r}_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, cjsubscript𝑐𝑗\mathit{c}_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT); if Mricjsubscript𝑀subscript𝑟𝑖subscript𝑐𝑗\mathit{M}_{\mathit{r}_{i}\mathit{c}_{j}}italic_M start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1, there is a spot.We use function R(·) to transform gene expression matrix with spatial coordinates into profile:

Gj=R(Xj,C),j=0,,n1,formulae-sequencesuperscript𝐺𝑗𝑅superscript𝑋𝑗𝐶𝑗0𝑛1G^{j}=R(X^{j},C),j=0,\cdots,n-1,italic_G start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = italic_R ( italic_X start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_C ) , italic_j = 0 , ⋯ , italic_n - 1 , (1)

where all profiles are denoted as the profile set G={G0,,Gn1}𝐺superscript𝐺0superscript𝐺𝑛1\mathit{G}=\left\{G^{0},\cdots,G^{n-1}\right\}italic_G = { italic_G start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , ⋯ , italic_G start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT }.

We use function T(·) to obtain in whether-in-tissue matrix from spatial coordinate matrix:

M=T(C).𝑀𝑇𝐶M=T(C).italic_M = italic_T ( italic_C ) . (2)

where M𝑀Mitalic_M is the whether-in-tissue matrix corresponding to Gjsuperscript𝐺𝑗G^{j}italic_G start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT.

Due to irregular tissue shapes, there may be spots outside the tissue in the gene expression maps. Additionally, due to limitations in current technology, some spots within the tissue may be unintentionally lost. We uniformly fill these occurrences with zeros in Gjsuperscript𝐺𝑗G^{j}italic_G start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT.

2.1.3 Data downsampling

In the Pretrain phase, we need to create LR gene expression profiles through down-sampling (Fig. 1) after obtaining the original gene expression profiles. We assume that Gjsuperscript𝐺𝑗G^{j}italic_G start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT has u𝑢\mathit{u}italic_u rows and v𝑣\mathit{v}italic_v columns. This process can be described as:

Glrj(ri,cj)=Gj(2ri,2cj),ri=0,1,,u21,cj=0,1,,v21,j=0,1,,n1,formulae-sequencesubscriptsuperscript𝐺𝑗𝑙𝑟subscript𝑟𝑖subscript𝑐𝑗superscript𝐺𝑗2subscript𝑟𝑖2subscript𝑐𝑗formulae-sequencesubscript𝑟𝑖01𝑢21formulae-sequencesubscript𝑐𝑗01𝑣21𝑗01𝑛1\begin{gathered}G^{j}_{lr}(r_{i},c_{j})=G^{j}(2r_{i},2c_{j}),\\ r_{i}=0,1,\cdots,\lfloor\frac{u}{2}\rfloor-1,c_{j}=0,1,\cdots,\lfloor\frac{v}{% 2}\rfloor-1,j=0,1,\cdots,n-1,\end{gathered}start_ROW start_CELL italic_G start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_r end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_G start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( 2 italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 2 italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 , 1 , ⋯ , ⌊ divide start_ARG italic_u end_ARG start_ARG 2 end_ARG ⌋ - 1 , italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0 , 1 , ⋯ , ⌊ divide start_ARG italic_v end_ARG start_ARG 2 end_ARG ⌋ - 1 , italic_j = 0 , 1 , ⋯ , italic_n - 1 , end_CELL end_ROW (3)

where the shape of Glrjsubscriptsuperscript𝐺𝑗𝑙𝑟G^{j}_{lr}italic_G start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_r end_POSTSUBSCRIPT is (u2𝑢2\lfloor\frac{u}{2}\rfloor⌊ divide start_ARG italic_u end_ARG start_ARG 2 end_ARG ⌋, v2𝑣2\lfloor\frac{v}{2}\rfloor⌊ divide start_ARG italic_v end_ARG start_ARG 2 end_ARG ⌋). We use function D()𝐷D(\cdot)italic_D ( ⋅ ) to summarize the above process:

Glrj=D(Gj),j=0,1,,n1,formulae-sequencesubscriptsuperscript𝐺𝑗𝑙𝑟𝐷superscript𝐺𝑗𝑗01𝑛1G^{j}_{lr}=D(G^{j}),j=0,1,\cdots,n-1,italic_G start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_r end_POSTSUBSCRIPT = italic_D ( italic_G start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , italic_j = 0 , 1 , ⋯ , italic_n - 1 , (4)

where all profiles Glrjsubscriptsuperscript𝐺𝑗𝑙𝑟G^{j}_{lr}italic_G start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_r end_POSTSUBSCRIPT are denoted as the LR profile set Glr={Glr0,Glr1,,Glrn1}subscript𝐺𝑙𝑟subscriptsuperscript𝐺0𝑙𝑟subscriptsuperscript𝐺1𝑙𝑟subscriptsuperscript𝐺𝑛1𝑙𝑟\mathit{G}_{lr}=\left\{G^{0}_{lr},G^{1}_{lr},\cdots,G^{n-1}_{lr}\right\}italic_G start_POSTSUBSCRIPT italic_l italic_r end_POSTSUBSCRIPT = { italic_G start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_r end_POSTSUBSCRIPT , italic_G start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_r end_POSTSUBSCRIPT , ⋯ , italic_G start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_r end_POSTSUBSCRIPT }.

2.2 Self-supervised learning

In the realm of data enhancement for spatial transcriptomics, where no labels are available, so we employ a self-supervised learning approach [13]. We design auxiliary tasks that utilize the data itself as supervisory information. The entire process is divided into two phases, denoted as Pretrain and Enhance (Fig. 1). In Pretrain phase, we train the model using both LR and original gene expression profiles. Then in Enhance phase, we input the original data and obtain HR gene expression profiles.

2.2.1 Pretrain phase

stEnTrans achieves gene expression interpolation by employing Transformer Encoder [27] to extract global features of the spatial distribution of all genes within the tissue, thereby extending the size of the gene expression profiles from (u𝑢\mathit{u}italic_u, v𝑣\mathit{v}italic_v) to (2u𝑢\mathit{u}italic_u, 2v𝑣\mathit{v}italic_v) (Fig. 1). In this phase, we train model on the LR gene set Glr={Glr0,Glr1,,Glrn1}subscript𝐺𝑙𝑟subscriptsuperscript𝐺0𝑙𝑟subscriptsuperscript𝐺1𝑙𝑟subscriptsuperscript𝐺𝑛1𝑙𝑟\mathit{G}_{lr}=\left\{G^{0}_{lr},G^{1}_{lr},\cdots,G^{n-1}_{lr}\right\}italic_G start_POSTSUBSCRIPT italic_l italic_r end_POSTSUBSCRIPT = { italic_G start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_r end_POSTSUBSCRIPT , italic_G start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_r end_POSTSUBSCRIPT , ⋯ , italic_G start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_r end_POSTSUBSCRIPT } and original gene set G={G0,G1,,Gn1}𝐺superscript𝐺0superscript𝐺1superscript𝐺𝑛1\mathit{G}=\left\{G^{0},G^{1},\cdots,G^{n-1}\right\}italic_G = { italic_G start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_G start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , ⋯ , italic_G start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT }. original gene set G𝐺\mathit{G}italic_G is constructed from gene expression matrix Xm×nsubscript𝑋𝑚𝑛\mathit{X}_{m\times n}italic_X start_POSTSUBSCRIPT italic_m × italic_n end_POSTSUBSCRIPT and coordinate matrix Cm×2subscript𝐶𝑚2\mathit{C}_{m\times 2}italic_C start_POSTSUBSCRIPT italic_m × 2 end_POSTSUBSCRIPT by function R()𝑅R(\cdot)italic_R ( ⋅ ). LR gene set Glrsubscript𝐺𝑙𝑟\mathit{G}_{lr}italic_G start_POSTSUBSCRIPT italic_l italic_r end_POSTSUBSCRIPT is constructed from original gene set G𝐺\mathit{G}italic_G by function D()𝐷D(\cdot)italic_D ( ⋅ ). We use Φ()Φ\Phi(\cdot)roman_Φ ( ⋅ ) represent the network model, which takes input Glrsubscript𝐺𝑙𝑟G_{lr}italic_G start_POSTSUBSCRIPT italic_l italic_r end_POSTSUBSCRIPT and aims to recover G𝐺Gitalic_G. After filtering through whether-in-tissue matrix M𝑀\mathit{M}italic_M, we denote the output as Gsuperscript𝐺\mathit{G}^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which has the same shape as G𝐺Gitalic_G. We use θ𝜃\thetaitalic_θ represent the parameters of the network model. Finally, we use Mean Squared Error to train our network model. The above process can be described as:

G=Φ(Glr;θ)M,superscript𝐺Φsubscript𝐺𝑙𝑟𝜃𝑀G^{\prime}=\Phi(G_{lr};\theta)\cdot M,italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_Φ ( italic_G start_POSTSUBSCRIPT italic_l italic_r end_POSTSUBSCRIPT ; italic_θ ) ⋅ italic_M , (5)
Loss(θ)=1nj=0n1GjGj2.𝐿𝑜𝑠𝑠𝜃1𝑛superscriptsubscript𝑗0𝑛1superscriptnormsuperscriptsuperscript𝐺𝑗superscript𝐺𝑗2Loss(\theta)=\frac{1}{n}\sum_{j=0}^{n-1}\left\|{G^{j}}^{\prime}-G^{j}\right\|^% {2}.italic_L italic_o italic_s italic_s ( italic_θ ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ∥ italic_G start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_G start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (6)

2.2.2 Enhance phase

stEnTrans can adapt well to inputs of different sizes. Therefore, after the Pretrain phase, we can enhance original data to HR gene expression profiles using stEnTrans:

Ghr=Φ(G;θ)H(M),subscript𝐺𝑟Φ𝐺𝜃𝐻𝑀{G_{hr}}=\Phi(G;\theta)\cdot H(M),italic_G start_POSTSUBSCRIPT italic_h italic_r end_POSTSUBSCRIPT = roman_Φ ( italic_G ; italic_θ ) ⋅ italic_H ( italic_M ) , (7)

where Ghrsubscript𝐺𝑟G_{hr}italic_G start_POSTSUBSCRIPT italic_h italic_r end_POSTSUBSCRIPT, H(M)𝐻𝑀H(M)italic_H ( italic_M ) have shapes (n,2u,2v)𝑛2𝑢2𝑣(n,2u,2v)( italic_n , 2 italic_u , 2 italic_v ) and (2u,2v)2𝑢2𝑣(2u,2v)( 2 italic_u , 2 italic_v ) respectively, H()𝐻H(\cdot)italic_H ( ⋅ ) can learn the boundary on the imputed gene expression by M𝑀\mathit{M}italic_M to generate a new whether-in-tissue matrix corresponding to Ghrsubscript𝐺𝑟G_{hr}italic_G start_POSTSUBSCRIPT italic_h italic_r end_POSTSUBSCRIPT. if location (ci,rj)subscript𝑐𝑖subscript𝑟𝑗(c_{i},r_{j})( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is the initial spot or lies between the initial two spots, then H(M)(ci,rj)=1𝐻𝑀subscript𝑐𝑖subscript𝑟𝑗1H(M)(c_{i},r_{j})=1italic_H ( italic_M ) ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = 1, otherwise, it equals 0 (Fig. 1).

Finally, we can get their corresponding gene expression matrix Xhrsubscript𝑋𝑟X_{hr}italic_X start_POSTSUBSCRIPT italic_h italic_r end_POSTSUBSCRIPT and spatial coordinate matrix Chrsubscript𝐶𝑟C_{hr}italic_C start_POSTSUBSCRIPT italic_h italic_r end_POSTSUBSCRIPT from Ghrsubscript𝐺𝑟G_{hr}italic_G start_POSTSUBSCRIPT italic_h italic_r end_POSTSUBSCRIPT and H(M)𝐻𝑀H(M)italic_H ( italic_M ) by applying the inverse functions of R()𝑅R(\cdot)italic_R ( ⋅ ) .

2.3 Details of the proposed stEnTrans

Transformer uses self-attention mechanisms to capture dependencies among different positions in the input sequence, modeling interactions among all tokens [27] (Fig. 1). The standard Transformer architecture is designed for processing 1-D sequential data. To adapt it for handling 2-D gene expression maps, we reshape the maps into a series of patches [7]. P𝑃Pitalic_P represents the height and width of the patches. Suppose that size of the input x𝑥\mathit{x}italic_x is (H,W)𝐻𝑊(H,W)( italic_H , italic_W ), so the desired size of the output is (2H,2W)2𝐻2𝑊(2H,2W)( 2 italic_H , 2 italic_W ). We achieve this by applying four trainable convolutional structures, where the convolutional kernel size and stride are both set to patch size, followed by concat, and finally flattening to map the tokens to 4×P24superscript𝑃24\times P^{2}4 × italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT dimensions. In addition, if the size of the input profile are not multiples of the patch size, we need to perform zero-padding on the profile. This process corresponds to the patch embedding in stEnTrans:

xk=xWk+bk,k=0,1,2,3,formulae-sequencesubscript𝑥𝑘𝑥subscript𝑊𝑘subscript𝑏𝑘𝑘0123x_{k}=x\ast W_{k}+b_{k},k=0,1,2,3,italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_x ∗ italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k = 0 , 1 , 2 , 3 , (8)
x=concat(x0,x1,x2,x3),superscript𝑥concatsubscript𝑥0subscript𝑥1subscript𝑥2subscript𝑥3x^{\prime}=\text{concat}(x_{0},x_{1},x_{2},x_{3}),italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = concat ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) , (9)
x^=Flatten(x),^𝑥Flattensuperscript𝑥\hat{x}=\text{Flatten}(x^{\prime}),over^ start_ARG italic_x end_ARG = Flatten ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , (10)

where x𝑥\mathit{x}italic_x, xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG have shapes (H,W)𝐻𝑊(H,W)( italic_H , italic_W ), (P2,H/P,W/P)superscript𝑃2𝐻𝑃𝑊𝑃(P^{2},H/P,W/P)( italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_H / italic_P , italic_W / italic_P ), (4,P2,H/P,W/P)4superscript𝑃2𝐻𝑃𝑊𝑃(4,P^{2},H/P,W/P)( 4 , italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_H / italic_P , italic_W / italic_P ) and (N,4P2)𝑁4superscript𝑃2(N,4P^{2})( italic_N , 4 italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) respectively, Wksubscript𝑊𝑘W_{k}italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and bksubscript𝑏𝑘b_{k}italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are the filter parameters and biases of convolutional structures. Here, N=HW/P2𝑁𝐻𝑊superscript𝑃2N=HW/P^{2}italic_N = italic_H italic_W / italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, which represents the number of generated patches and is also the input sequence length for the Transformer Encoder. We refer to above process as Patch Embedding.

Due to the fact that the self-attention mechanism itself does not inherently contain information about the position of tokens in the sequence, it is necessary to introduce positional encoding to help the model understand the order of the input sequence. The relative positional embeddings as described in [24] is not applicable to our method as the sequence lengths differ between phases Pretrain and Enhance. so stEnTrans adopts sine and cosine functions with different frequencies to encode positional information:

Pos(x^)(p,q)={sin(p/100002q/d),pmod2=0cos(p/100002(q1)/d),pmod2=1p=0,1,,N1,q=0,1,,4P2,formulae-sequence𝑃𝑜𝑠^𝑥𝑝𝑞cases𝑝superscript100002𝑞𝑑modulo𝑝20𝑝superscript100002𝑞1𝑑modulo𝑝21𝑝01𝑁1𝑞014superscript𝑃2\begin{gathered}Pos(\hat{x})(p,q)=\begin{cases}\sin(p/10000^{2q/d}),&p\bmod 2=% 0\\ \cos(p/10000^{2(q-1)/d}),&p\bmod 2=1\end{cases}\\ p=0,1,\cdots,N-1,q=0,1,\cdots,4P^{2},\end{gathered}start_ROW start_CELL italic_P italic_o italic_s ( over^ start_ARG italic_x end_ARG ) ( italic_p , italic_q ) = { start_ROW start_CELL roman_sin ( italic_p / 10000 start_POSTSUPERSCRIPT 2 italic_q / italic_d end_POSTSUPERSCRIPT ) , end_CELL start_CELL italic_p roman_mod 2 = 0 end_CELL end_ROW start_ROW start_CELL roman_cos ( italic_p / 10000 start_POSTSUPERSCRIPT 2 ( italic_q - 1 ) / italic_d end_POSTSUPERSCRIPT ) , end_CELL start_CELL italic_p roman_mod 2 = 1 end_CELL end_ROW end_CELL end_ROW start_ROW start_CELL italic_p = 0 , 1 , ⋯ , italic_N - 1 , italic_q = 0 , 1 , ⋯ , 4 italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL end_ROW (11)
x^0=x^+Pos(x^),subscript^𝑥0^𝑥𝑃𝑜𝑠^𝑥\hat{x}_{0}=\hat{x}+Pos(\hat{x}),over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = over^ start_ARG italic_x end_ARG + italic_P italic_o italic_s ( over^ start_ARG italic_x end_ARG ) , (12)

where Pos(x^)𝑃𝑜𝑠^𝑥Pos(\hat{x})italic_P italic_o italic_s ( over^ start_ARG italic_x end_ARG ) and x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG have the same shape (N,4P2)𝑁4superscript𝑃2(N,4P^{2})( italic_N , 4 italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), p𝑝pitalic_p is the position, q𝑞qitalic_q is the dimension and d𝑑ditalic_d is the patch embedding dimension.

Then, stEnTrans extracts global features of the LR profiles through the Transformer Encoder, which consists of multiple Transformer blocks. The Transformer block is mainly composed of multiheaded self-attention (MSA) and MLP blocks. LayerNorm is applied before each block and residual connections are applied after each block [4, 18].

Assuming the input of the MSA block is U𝑈Uitalic_U and the output is Z𝑍Zitalic_Z, here is a description of the multi-head self-attention block:

[Qi,Ki,Vi]=UWi+bi,i=1h,formulae-sequencesubscript𝑄𝑖subscript𝐾𝑖subscript𝑉𝑖𝑈subscript𝑊𝑖subscript𝑏𝑖𝑖1[Q_{i},K_{i},V_{i}]=UW_{i}+b_{i},i=1\dots h,[ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] = italic_U italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 … italic_h , (13)
headi=softmax(QiKiTdi)Vi,i=1h,formulae-sequence𝑒𝑎subscript𝑑𝑖softmaxsubscript𝑄𝑖subscriptsuperscript𝐾𝑇𝑖subscript𝑑𝑖subscript𝑉𝑖𝑖1head_{i}=\text{softmax}(\frac{Q_{i}K^{T}_{i}}{\sqrt{d_{i}}})V_{i},i=1\dots h,italic_h italic_e italic_a italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = softmax ( divide start_ARG italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG ) italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 … italic_h , (14)
Z=concat(head1,,headh),𝑍concat𝑒𝑎subscript𝑑1𝑒𝑎subscript𝑑Z=\text{concat}(head_{1},\dots,head_{h}),italic_Z = concat ( italic_h italic_e italic_a italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_h italic_e italic_a italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) , (15)

where hhitalic_h denotes the numbers of different self-attention operations in multi-head self-attention, The queries, keys, and values are linearly projected hhitalic_h times with different learnable parameters, d𝑑ditalic_d is set to a fixed shape of 4P2/h4superscript𝑃24P^{2}/h4 italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_h and Wisubscript𝑊𝑖W_{i}italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT have the shape of (4P2,4P2/h)4superscript𝑃24superscript𝑃2(4P^{2},4P^{2}/h)( 4 italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , 4 italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_h ). This parallel processing allows the model to capture diverse patterns and relationships. We can represent the above process using the function MSA()(\cdot)( ⋅ ):

Z=MSA(U).𝑍MSA𝑈Z=\text{MSA}(U).italic_Z = MSA ( italic_U ) . (16)

In the MLP sub-layer, there are two linear transformations with a GELU activation function in between. Suppose that the input is Z𝑍Zitalic_Z, the process can be described as follows:

MLP(Z)=GeLu(ZW1+b1)W2+b2,MLP𝑍GeLu𝑍subscript𝑊1subscript𝑏1subscript𝑊2subscript𝑏2\text{MLP}(Z)=\text{GeLu}(ZW_{1}+b_{1})W_{2}+b_{2},MLP ( italic_Z ) = GeLu ( italic_Z italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (17)

where Z𝑍Zitalic_Z and MLP(Z)MLP𝑍\text{MLP}(Z)MLP ( italic_Z ) hava the same shape (N,4P2)𝑁4superscript𝑃2(N,4P^{2})( italic_N , 4 italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Assuming D=4P2𝐷4superscript𝑃2D=4P^{2}italic_D = 4 italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, then W1D×2Dsubscript𝑊1superscript𝐷2𝐷W_{1}\in\mathbb{R}^{D\times 2D}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × 2 italic_D end_POSTSUPERSCRIPT and W22D×Dsubscript𝑊2superscript2𝐷𝐷W_{2}\in\mathbb{R}^{2D\times D}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_D × italic_D end_POSTSUPERSCRIPT.

Assuming the input of the \ellroman_ℓ-th Transformer block is x^1subscript^𝑥1\hat{x}_{\ell-1}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT, we denote its output as x^subscript^𝑥\hat{x}_{\ell}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT, which is also the input of the +limit-from\mathit{\ell\!+}roman_ℓ +​1-th block. Specifically, x^0subscript^𝑥0\hat{x}_{0}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT serves as the input to the first block. The entire process of gene expression maps enhancement in stEnTrans within the Transformer Encoder can be described as follows:

x^=MSA(LN(x^1))+x^1,subscriptsuperscript^𝑥MSALNsubscript^𝑥1subscript^𝑥1\displaystyle\hat{x}^{\prime}_{\ell}=\text{MSA}(\text{LN}({\hat{x}_{\ell-1}}))% +\hat{x}_{\ell-1},over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = MSA ( LN ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT ) ) + over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT , =1,,N,1𝑁\displaystyle\ell=1,\dots,N,roman_ℓ = 1 , … , italic_N , (18)
x^=MLP(LN(x^))+x^,subscript^𝑥MLPLNsubscriptsuperscript^𝑥subscriptsuperscript^𝑥\displaystyle\hat{x}_{\ell}=\text{MLP}(\text{LN}(\hat{x}^{\prime}_{\ell}))+% \hat{x}^{\prime}_{\ell},over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = MLP ( LN ( over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) ) + over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , =1,,N,1𝑁\displaystyle\ell=1,\dots,N,roman_ℓ = 1 , … , italic_N , (19)

After passing through the Transformer Encoder, we perform a reverse embedding operation on the 1D sequence of token embeddings x^Nsubscript^𝑥𝑁\hat{x}_{N}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, transforming it into 4-channel HR gene expression profile:

y=Flatten1(x^N),superscript𝑦superscriptFlatten1subscript^𝑥𝑁y^{\prime}=\text{Flatten}^{-1}(\hat{x}_{N}),italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = Flatten start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) , (20)
Δy(c,i,j)=y(c,r,iP,jP),c=0,1,2,3,r=P(imodP)+(jmodP),i=0,,H1,j=0,,W1,formulae-sequenceΔ𝑦𝑐𝑖𝑗superscript𝑦𝑐𝑟𝑖𝑃𝑗𝑃formulae-sequence𝑐0123formulae-sequence𝑟𝑃modulo𝑖𝑃modulo𝑗𝑃formulae-sequence𝑖0𝐻1𝑗0𝑊1\begin{gathered}\Delta y(c,i,j)=y^{\prime}(c,r,\left\lfloor\frac{i}{P}\right% \rfloor,\left\lfloor\frac{j}{P}\right\rfloor),c=0,1,2,3,\\ r=P(i\bmod P)+(j\bmod P),i=0,\dots,H\!-\!1,j=0,\dots,W\!-\!1,\end{gathered}start_ROW start_CELL roman_Δ italic_y ( italic_c , italic_i , italic_j ) = italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_c , italic_r , ⌊ divide start_ARG italic_i end_ARG start_ARG italic_P end_ARG ⌋ , ⌊ divide start_ARG italic_j end_ARG start_ARG italic_P end_ARG ⌋ ) , italic_c = 0 , 1 , 2 , 3 , end_CELL end_ROW start_ROW start_CELL italic_r = italic_P ( italic_i roman_mod italic_P ) + ( italic_j roman_mod italic_P ) , italic_i = 0 , … , italic_H - 1 , italic_j = 0 , … , italic_W - 1 , end_CELL end_ROW (21)

where ysuperscript𝑦y^{\prime}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, ΔyΔ𝑦\Delta yroman_Δ italic_y have shapes (4,P2,HP,WP)4superscript𝑃2𝐻𝑃𝑊𝑃(4,P^{2},\frac{H}{P},\frac{W}{P})( 4 , italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , divide start_ARG italic_H end_ARG start_ARG italic_P end_ARG , divide start_ARG italic_W end_ARG start_ARG italic_P end_ARG ) and (4,H,W)4𝐻𝑊(4,H,W)( 4 , italic_H , italic_W ) respectively, r𝑟ritalic_r represents the position within a patch and its value increments from left to right and top to bottom within each patch, starting from 0.

Next, inspired by residual networks [8], we directly combines the original data and each channel of ΔyΔ𝑦\Delta yroman_Δ italic_y. In this case, we set Δy=[Δy0,Δy1,Δy2,Δy3]Δ𝑦Δsubscript𝑦0Δsubscript𝑦1Δsubscript𝑦2Δsubscript𝑦3\Delta y=[\Delta y_{0},\Delta y_{1},\Delta y_{2},\Delta y_{3}]roman_Δ italic_y = [ roman_Δ italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , roman_Δ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_Δ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , roman_Δ italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ], where ΔycΔsubscript𝑦𝑐\Delta y_{c}roman_Δ italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT hava the same shape (H,W)𝐻𝑊(H,W)( italic_H , italic_W ), we get:

y=[y0,y1,y2,y3]=[Δy0+x,Δy1+x,Δy2+x,Δy3+x].𝑦subscript𝑦0subscript𝑦1subscript𝑦2subscript𝑦3Δsubscript𝑦0𝑥Δsubscript𝑦1𝑥Δsubscript𝑦2𝑥Δsubscript𝑦3𝑥y=[y_{0},y_{1},y_{2},y_{3}]=[\Delta y_{0}+x,\Delta y_{1}+x,\Delta y_{2}+x,% \Delta y_{3}+x].italic_y = [ italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ] = [ roman_Δ italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_x , roman_Δ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x , roman_Δ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_x , roman_Δ italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + italic_x ] . (22)

Finally, we can obtain HR gene expression profile by merging the 4-channel expression profile Y𝑌Yitalic_Y. We denote the HR expression profile as yhrsubscript𝑦𝑟y_{hr}italic_y start_POSTSUBSCRIPT italic_h italic_r end_POSTSUBSCRIPT, and the calculation process is as follows:

yhr(p,q)={y(0,p2,q2),pmod2=0,qmod2=0y(1,p2,q2),pmod2=0,qmod2=1y(2,p2,q2),pmod2=1,qmod2=0y(3,p2,q2),pmod2=1,qmod2=1,subscript𝑦hr𝑝𝑞cases𝑦0𝑝2𝑞2formulae-sequencemodulo𝑝20modulo𝑞20𝑦1𝑝2𝑞2formulae-sequencemodulo𝑝20modulo𝑞21𝑦2𝑝2𝑞2formulae-sequencemodulo𝑝21modulo𝑞20𝑦3𝑝2𝑞2formulae-sequencemodulo𝑝21modulo𝑞21y_{\text{\tiny{hr}}}(p,q)=\begin{cases}y(0,\left\lfloor\frac{p}{2}\right% \rfloor,\left\lfloor\frac{q}{2}\right\rfloor),&p\bmod 2=0,q\bmod 2=0\\ y(1,\left\lfloor\frac{p}{2}\right\rfloor,\left\lfloor\frac{q}{2}\right\rfloor)% ,&p\bmod 2=0,q\bmod 2=1\\ y(2,\left\lfloor\frac{p}{2}\right\rfloor,\left\lfloor\frac{q}{2}\right\rfloor)% ,&p\bmod 2=1,q\bmod 2=0\\ y(3,\left\lfloor\frac{p}{2}\right\rfloor,\left\lfloor\frac{q}{2}\right\rfloor)% ,&p\bmod 2=1,q\bmod 2=1\end{cases},italic_y start_POSTSUBSCRIPT hr end_POSTSUBSCRIPT ( italic_p , italic_q ) = { start_ROW start_CELL italic_y ( 0 , ⌊ divide start_ARG italic_p end_ARG start_ARG 2 end_ARG ⌋ , ⌊ divide start_ARG italic_q end_ARG start_ARG 2 end_ARG ⌋ ) , end_CELL start_CELL italic_p roman_mod 2 = 0 , italic_q roman_mod 2 = 0 end_CELL end_ROW start_ROW start_CELL italic_y ( 1 , ⌊ divide start_ARG italic_p end_ARG start_ARG 2 end_ARG ⌋ , ⌊ divide start_ARG italic_q end_ARG start_ARG 2 end_ARG ⌋ ) , end_CELL start_CELL italic_p roman_mod 2 = 0 , italic_q roman_mod 2 = 1 end_CELL end_ROW start_ROW start_CELL italic_y ( 2 , ⌊ divide start_ARG italic_p end_ARG start_ARG 2 end_ARG ⌋ , ⌊ divide start_ARG italic_q end_ARG start_ARG 2 end_ARG ⌋ ) , end_CELL start_CELL italic_p roman_mod 2 = 1 , italic_q roman_mod 2 = 0 end_CELL end_ROW start_ROW start_CELL italic_y ( 3 , ⌊ divide start_ARG italic_p end_ARG start_ARG 2 end_ARG ⌋ , ⌊ divide start_ARG italic_q end_ARG start_ARG 2 end_ARG ⌋ ) , end_CELL start_CELL italic_p roman_mod 2 = 1 , italic_q roman_mod 2 = 1 end_CELL end_ROW , (23)

where yhrsubscript𝑦hry_{\text{\tiny{hr}}}italic_y start_POSTSUBSCRIPT hr end_POSTSUBSCRIPT is divided into a series of non-overlapping 2x2 sub-maps, each sub-map’s top-left, top-right, bottom-left, and bottom-right respectively come from channels 0, 1, 2, and 3 of y𝑦yitalic_y.

3 Experimental Results

Refer to caption
Figure 2: The experiments indicate that stEnTrans exhibits superior interpolation performance. (A) and (B) Applying stEnTrans, DIST, NEDI, Linear, Cubic and NN to downsampled simulated data for predicting the ground truth of gene expression. This gene ZNF703 comes from 10X Visium IDC data. (B) A series of gene expression profiles are obtained through a consistent process with (A). This gene MUC1 also originates from 10X Visium IDC data. (C) and (D) apply stEnTrans and DIST to the ground truth of ZNF703 gene and MUC1 gene, resulting in HR gene expression maps. Due to inferior interpolation performance of other methods, we only compare with DIST. The gene-wise minimum and maximum of the ground truth map the color ranges consistently.

3.1 More finer interpolation capability

To validate the interpolation capability of stEnTrans, we evaluated two genes with different spatial patterns on IDC data, ZNF703 and MUC1. We used down-sampled data as simulation, the original data as ground truth, and applied stEnTrans, DIST [32], Linear, Cubic, NN and NEDI [16] for interpolation on the down-sampled data for predicting ground truth (Fig. 2). In real-world applications, gene expression exhibits vacant locations due to imperfect spatial coverage and quality control, so down-sampled data includes losses to mimic these real-world scenarios. stEnTrans and the mentioned interpolation methods can all fill vacancy expression to ensure the integrity and continuity of tissue slices. The result shows that stEnTrans demonstrated a more delicate effect in interpolation, closely resembling real data than other methods. NN and NEDI may result in discontinuous expression with noticeable aliasing. Linear and Cubic methods overcome aliasing but result in blurring. These interpolation methods only rely on nearby points to predict the gene expression of unmeasured points, neglecting the utilization of global effective information. Furthermore, stEnTrans is trained on all genes within the tissue, effectively establishing correlated information among genes, as opposed to methods that only consider the expression of individual gene. DIST enhances linear continuity but lacks clarity in details, deviating from ground truth. For modeling long-range dependencies, convolutional methods lack the robustness of the Transformer framework, making stEnTrans superior to DIST in this aspect.

Next, we applied stEnTrans and DIST to genes ZNF703 and MUC1 for interpolation on the ground data to obtain their HR gene expression profiles (Fig. 2). Through comparison, while both methods can achieve HR data with the same resolution, stEnTrans presents a smoother expression mapping, sharper gene expression, and more detailed depiction.

3.2 More accurate interpolation capability

To further illustrate the superiority of stEnTrans, we calculated the Pearson correlation coefficients (PCC) between ground truth and each imputed expression for all spots of the genes after quality filtering across six datasets.

To illustrate that stEnTrans can also be applied to other platforms, we created a simulated ST data from STARmap data. By proportionally mapping the coordinates of these pseudo-spots in spatial dimensions and fine-tuning, we can simulate array-based ST data. RNA molecules of 903 genes have been determined in the mouse placenta for this dataset. ClusterMap clusters RNA into subcellular structures, generating 7224 pseudo-spots representing subcellular sizes [9], which closely align with our simulated data. Continuing, we simulated another ST data from Stereo-seq data, which provides subcellular resolution spatial expression in the adult mouse hemi-brain [5], using the same spatial coordinate mapping method as described above. Additionally, there is one real ST dataset and three 10X Visium datasets. We calculated PCCs between ground truth and each imputed expression for all spots of the genes after quality filtering across the aforementioned six datasets and showed the results in (Fig. 3).

Refer to caption
Figure 3: PCCs evaluation between ground truth and each imputed expression in six ST datasets. The results indicate that stEnTrans exhibits outstanding precision in interpolation, as demonstrated through experiments on six datasets. (A) The human melanoma ST data (mel1 rep1). (B) Simulation created from STARmap mouse placenta. (C) Simulation created from Stereo-seq adult mouse hemi-brain. (D) Human breast cancer spatial transcriptomics data. (E) Visium mouse sagittal posterior brain. (F) Visium human IDC data. Gene-wise Pearson correlation correlations between ground truth and imputed expression using stEnTrans, DIST, Linear, Cubic, NN, NEDI on that six datasets. Boxes represent the middle 50% range of the data,which show 25th, 50th and 75th percentiles. Whiskers represent the overall distribution range of the data, providing information about the data’s dispersion and potential outliers.

stEnTrans achieved higher median in gene prediction PCCs between imputed and true expression compared to all other methods across six different datasets (median: in HM data, stEnTrans=0.636; in MP data, stEnTrans=0.554; in AMHB data, stEnTrans=0.513; in HBC data, stEnTrans=0.584; in MBSP data, stEnTrans=0.606; in IDC data, stEnTrans=0.537). Even in HM data, HBC data and MP data, the 25th percentile of PCCs for stEnTrans surpasses or is nearly equivalent to the 75th percentile for DIST. In the other three datasets, stEnTrans’s 25th percentile of PCCs are higher than the maximum of 50th percentiles of DIST. We did not compare stEnTrans with methods like BayesSpace [31], TESLA [10] and iStar [30] because BayesSpace divides a spot into multiple sub-spots, rendering it unable to predict unmeasured regions between spots; TESLA and iStar requires histological images, which some simulated data cannot provide.

Refer to caption
Figure 4: stEnTrans can help find spatial patterns for ST data and enrich more biologically significant pathways. (A) The original histopathological annotations of hematoxylin- and eosin-stained tissue image, where black represents melanoma, red represents stroma, and yellow represents lymphatic tissue. (B) stEnTrans has endowed disease-related genes with distinct spatial patterns. These genes exhibit higher rankings in the imputed data, whereas they have lower rankings in the original data. We provided the rankings of genes at the top of each gene expression profile. (C) The number of genes and significant GO:BP terms in three mainly families of original data and imputed data, where orange represents original data and blue represents imputed data. (D) Top 10 significant GO:BP terms were only identified in the imputed data. above corresponds to the melanoma family, and below corresponds to the lymphoid family.

3.3 Better help to discover spatial patterns

Next, we explore whether stEnTrans can help discover spatial patterns in melanoma ST data. We utilize a tool, Sepal, which employs diffusion-based modeling to identify transcripts with spatial patterns in the transcript profiles [2]. This method is applied to original and imputed mel ST data using stEnTrans. Following Sepal’s standards, we rank each transcript profile according to their degree of randomness, and then extract the top-150 transcript profiles as experimental samples. Next, we categorize the top-150 transcript profiles into pattern families based on the similarity of their spatial organization. Among them, the spatial organization structures and biological processes of the three main pattern families can match well with histological annotations, including melanoma, lymphoid, and stroma. Finally, we perform functional enrichment analysis for each pattern family using the Gene Ontology: Biological Processes (GO: BP) database [23].

We selected a subset of transcript profiles that ranked lower in the original melanoma ST data but higher in the imputed melanoma ST data (Fig. 4). These genes play important roles in various physiological and pathological processes. DLL3 is highly expressed in melanoma, making it a potential therapeutic target [29]. WARS plays important physiopathological roles in cancer diseases, with the potential to serve as a pharmacological target and therapeutic agent [1]. The expression of IGFBP7 can induce cellular senescence and apoptosis, and it has the potential to inhibit the growth of melanoma [6]. Due to low resolution, incomplete gene expression results in these genes being ranked lower in the original melanoma data. After enhancing the resolution through stEnTrans, the spatial structure of these genes becomes more apparent, resulting in a significant improvement in their rankings.

We first separately counted the respective quantities of the main three pattern families within the top-150 genes in original and imputed human melanoma data and then performed functional enrichment analysis (Fig.  4). The bar chart reveals that, despite a decrease in the number of genes in the melanoma family, it has also enriched more GO:BP terms. the number of genes and GO:BP terms in the lymphoid and stroma families have both increased. It’s worth noting that the imputed human melanoma data for the three families collectively enriched in 79 new pathways compared to the original human melanoma data. In the melanoma family, the top-10 significantly enriched new pathways are mostly related to biological regulation and metabolic processes. In contrast, within the lymphoid family, the top-10 significantly enriched new pathways are mostly associated with immune response, cell activation, and proliferation processes. In addition to the top-10 significantly enriched new pathways, the imputed human melanoma data also newly enriched in multiple pathways related to growth factors and lymphocyte activation.

3.4 Ablation Study

To assess the contributions of absolute positional encoding (denote as Pos) and the Res module to performance, we conducted a ablation study. Our ablation study is based on simulated ST data from STARmap mouse placenta (MP), simulated ST data from Stereo-seq adult mouse hemi-brain (AMHB), human melanoma ST data (HM), Human Invasive ductal carcinoma spatial transcriptomics data (IDC), human breast cancer spatial transcriptomics data (HBC), mouse brain sagittal posterior data (MBSP). All experiments are evaluated using PCC.

Specifically, we initially assessed the performance of the original model by calculating PCCs on six datasets as the baseline for comparison. Subsequently, we systematically removed Module Pos, Module Res, and both simultaneously, observing the model’s performance under these ablation conditions. Overall, the results indicate that, among all ablation conditions, the removal of Pos has the most significant impact on the model’s performance, resulting in the poorest performance. Next in significance is the simultaneous removal of Pos and Res, while the impact of removing Res is relatively smaller but still leads to a decline in performance.

Table 1: Effects of Pos and Res. “Pos” refers to the absolute positional encoding, while “Res” stands for the skip connection that combines the original data to the output of the Transformer Encoder.
PCCs HM MP AMHB HBC MBSP IDC
stEnTrans 0.636±plus-or-minus\pm±0.018 0.554±plus-or-minus\pm±0.021 0.513±plus-or-minus\pm±0.056 0.584±plus-or-minus\pm±0.006 0.606±plus-or-minus\pm±0.010 0.537±plus-or-minus\pm±0.003
(w/o)Pos 0.581±plus-or-minus\pm±0.017 0.439±plus-or-minus\pm±0.019 0.506±plus-or-minus\pm±0.005 0.559±plus-or-minus\pm±0.009 0.569±plus-or-minus\pm±0.016 0.525±plus-or-minus\pm±0.004
(w/o)Res 0.619±plus-or-minus\pm±0.014 0.476±plus-or-minus\pm±0.020 0.510±plus-or-minus\pm±0.005 0.570±plus-or-minus\pm±0.008 0.598±plus-or-minus\pm±0.012 0.529±plus-or-minus\pm±0.004
(w/o)Pos&Res 0.587±plus-or-minus\pm±0.017 0.470±plus-or-minus\pm±0.015 0.505±plus-or-minus\pm±0.004 0.561±plus-or-minus\pm±0.009 0.583±plus-or-minus\pm±0.014 0.529±plus-or-minus\pm±0.004

4 Discussion And Conclusion

In our research, we propose stEnTrans, a deep learning method based on Transformer, which can impute the gene expression on unmeasured areas that are not covered between spots and accidentally lost locations during sequencing through self-supervised learning, enhance the expression of all spots, and thus comprehensively improve the quality of spatial transcriptomics data.Our method does not rely on any other data, such as histological images, and only requires a gene expression matrix with spatial coordinates to improve gene expression profiles to high resolution. Compared with other methods of similar function, it has higher precision and the obtaining high-resolution gene expression profiles have more delicate and smooth characteristics.

In spatial transcriptomics studies, identifying genes with spatial expression patterns is very important for researching various physiological and pathological processes. However, due to limitations in existing technologies, the statistical power is often low, making it challenging to identify genes with spatial expression patterns within small local regions using raw spatial transcriptomic data. stEnTrans has the capability to enhance the resolution of gene expression profiles, which can help to discover the spatial patterns of genes, making disease-related genes have more significant spatial patterns. In addition, it can also help to discover more biologically meaningful pathways.

The results of ablation experiments show that both Res module and absolute position encoding are indispensable to the network and can improve the performance of the network, but the use of absolute position encoding has a higher impact on improving the network’s performance.

Despite the significant achievements of stEnTrans in improving the resolution of gene expression profiles by imputing, there are still several challenges in practical applications. Our next research direction is not only to impute the unmeasured areas, but also to increase the resolution to the sub-spot level by dividing the spot into multiple sub-spots without relying on any other data.

4.0.1 Acknowledgements

The work was supported in part by the National Natural Science Foundation of China (62262069), in part by the Yunnan Fundamental Research Projects under Grant (202201AT070469, 202301BF070001-019).

References

  • [1] Ahn, Y.H., Oh, S.C., et al.: Tryptophanyl-trna synthetase as a potential therapeutic target. International Journal of Molecular Sciences 22(9),  4523 (2021)
  • [2] Andersson, A., Lundeberg, J.: sepal: identifying transcript profiles with spatial patterns by diffusion-based modeling. Bioinformatics 37(17), 2644–2650 (2021)
  • [3] Asp, M., Bergenstråhle, J., et al.: Spatially resolved transcriptomes—next generation tools for tissue exploration. BioEssays 42(10), 1900221 (2020)
  • [4] Brody, S., Alon, U., et al.: On the expressivity role of layernorm in transformers’ attention. arXiv preprint arXiv:2305.02582 (2023)
  • [5] Chen, A., Liao, S., et al.: Spatiotemporal transcriptomic atlas of mouse organogenesis using dna nanoball-patterned arrays. Cell 185(10), 1777–1792 (2022)
  • [6] Chen, R.Y., Chen, H.X., et al.: In-vivo transfection of pcdna3. 1-igfbp7 inhibits melanoma growth in mice through apoptosis induction and vegf downexpression. Journal of Experimental & Clinical Cancer Research 29,  1–8 (2010)
  • [7] Dosovitskiy, A., Beyer, L., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations. pp. 1–22 (2021)
  • [8] He, K., Zhang, X., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016)
  • [9] He, Y., Tang, X., et al.: Clustermap for multi-scale clustering analysis of spatial gene expression. Nature Nommunications 12(1),  5909 (2021)
  • [10] Hu, J., Coleman, K., et al.: Deciphering tumor ecosystems at super resolution from spatial transcriptomics with tesla. Cell Systems 14(5), 404–417 (2023)
  • [11] Huuki-Myers, L., Spangler, A., et al.: Integrated single cell and unsupervised spatial transcriptomic analysis defines molecular anatomy of the human dorsolateral prefrontal cortex. BioRxiv (2023)
  • [12] Janesick, A., Shelansky, R., et al.: High resolution mapping of the breast cancer tumor microenvironment using integrated single cell, spatial and in situ analysis of ffpe tissue. Nature Nommunications 14(1),  5353 (2023)
  • [13] Jing, L., Tian, Y.: Self-supervised visual feature learning with deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 43(11), 4037–4058 (2020)
  • [14] Li, Q., Zhang, X., et al.: Spatial transcriptomics for tumor heterogeneity analysis. Frontiers in Genetics 13(2), 906158 (2022)
  • [15] Li, X., Min, W., Wang, S., Wang, C., Xu, T.: stmcdi: Masked conditional diffusion model with graph neural network for spatial transcriptomics data imputation. arXiv preprint arXiv:2403.10863 (2024)
  • [16] Li, X., Orchard, M.T.: New edge-directed interpolation. IEEE Transactions on Image Processing 10(10), 1521–1527 (2001)
  • [17] Liu, B., Li, Y., et al.: Analysis and visualization of spatial transcriptomic data. Frontiers in Genetics 12(3), 785290 (2022)
  • [18] Liu, F., Ren, X., et al.: Rethinking skip connection with layer normalization in transformers and resnets. arXiv preprint arXiv:2105.07205 (2021)
  • [19] Lohoff, T., Ghazanfar, S., et al.: Integration of spatial and single-cell transcriptomic data elucidates mouse organogenesis. Nature Biotechnology 40(1), 74–85 (2022)
  • [20] Min, W., Fang, D., Chen, J., Zhang, S.: Dimensionality reduction and denoising of spatial transcriptomics data using dual-channel masked graph autoencoder. bioRxiv pp. 1–20 (2024)
  • [21] Moses, L., Pachter, L.: Museum of spatial transcriptomics. Nature Methods 19(5), 534–546 (2022)
  • [22] Rao, N., Clark, S., et al.: Bridging genomics and tissue pathology: 10x genomics explores new frontiers with the visium spatial gene expression solution. Genetic Engineering & Biotechnology News 40(2), 50–51 (2020)
  • [23] Raudvere, U., Kolberg, L., et al.: g: Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic Acids Research 47(W1), W191–W198 (2019)
  • [24] Shaw, P., Uszkoreit, J., et al.: Self-attention with relative position representations. arXiv preprint arXiv:1803.02155 (2018)
  • [25] Sun, S., Zhu, J., et al.: Statistical analysis of spatial expression patterns for spatially resolved transcriptomic studies. Nature Methods 17(2), 193–200 (2020)
  • [26] Thrane, K., Eriksson, H., et al.: Spatially resolved transcriptomics enables dissection of genetic heterogeneity in stage iii cutaneous malignant melanoma. Cancer Research 78(20), 5970–5979 (2018)
  • [27] Vaswani, A., Shazeer, N., et al.: Attention is all you need. Advances in Neural Information Processing Systems 14, 1–11 (2017)
  • [28] Wang, X., Allen, W.E., et al.: Three-dimensional intact-tissue sequencing of single-cell transcriptional states. Science 361(6400), eaat5691 (2018)
  • [29] Yao, J., Bergsland, E., et al.: Dll3 as an emerging target for the treatment of neuroendocrine neoplasms. The Oncologist 27(11), 940–951 (2022)
  • [30] Zhang, D., Schroeder, A., et al.: Inferring super-resolution tissue architecture by integrating spatial transcriptomics with histology. Nature Biotechnology pp. 1–6 (2024)
  • [31] Zhao, E., Stone, M.R., et al.: Spatial transcriptomics at subspot resolution with bayesspace. Nature Biotechnology 39(11), 1375–1384 (2021)
  • [32] Zhao, Y., Wang, K., et al.: Dist: spatial transcriptomics enhancement using deep learning. Briefings in Bioinformatics 24(2), bbad013 (2023)