Graph Neural Networks in Histopathology: Emerging Trends and Future Directions

Siemen Brussee Leiden University Medical Center, The Netherlands Giorgio Buzzanca Leiden University Medical Center, The Netherlands Anne M.R. Schrader M.D Leiden University Medical Center, The Netherlands Jesper Kers M.D Leiden University Medical Center, The Netherlands Amsterdam University Medical Center, The Netherlands

(27-03-2024)

Abstract

Histopathological analysis of Whole Slide Images (WSIs) has seen a surge in the utilization of deep learning methods, particularly Convolutional Neural Networks (CNNs). However, CNNs often fall short in capturing the intricate spatial dependencies inherent in WSIs. Graph Neural Networks (GNNs) present a promising alternative, adept at directly modeling pairwise interactions and effectively discerning the topological tissue and cellular structures within WSIs. Recognizing the pressing need for deep learning techniques that harness the topological structure of WSIs, the application of GNNs in histopathology has experienced rapid growth. In this comprehensive review, we survey GNNs in histopathology, discuss their applications, and explore emerging trends that pave the way for future advancements in the field. We begin by elucidating the fundamentals of GNNs and their potential applications in histopathology. Leveraging quantitative literature analysis, we identify four emerging trends: Hierarchical GNNs, Adaptive Graph Structure Learning, Multimodal GNNs, and Higher-order GNNs. Through an in-depth exploration of these trends, we offer insights into the evolving landscape of GNNs in histopathological analysis. Based on our findings, we propose future directions to propel the field forward. Our analysis serves to guide researchers and practitioners towards innovative approaches and methodologies, fostering advancements in histopathological analysis through the lens of graph neural networks.

Keywords— Graph Neural Networks, Computational Pathology, Graph Representation Learning, Hierarchical Graph Representation Learning, Adaptive Graph Structure Learning, Multimodal Graph Representation Learning, Higher-order Graph Representation Learning

1 Introduction

Histopathology analysis is an important diagnostic tool and examination tool that can be used for disease diagnosis, estimating disease prognosis, and selecting for or monitoring of therapeutic strategies. Since the digitization of whole slide images (WSIs) in the early 2000s, the computational analysis of histopathology images has become an increasingly important part of histopathology. Starting with image analysis algorithms, the field transitioned to a deep learning approach following the rise of convolutional neural networks in the 2010s, largely due to the availability of large datasets (e.g., ImageNet [1]) and deeper convolutional architectures (e.g., AlexNet [2]). In the last 5 years, paradigms in the field have become more heterogeneous, with the advent of attention-based multiple instance learning [3] [4], vision transformers [5] [6], self-supervised learning [7] [8] and graph neural network [9] [10] approaches.

The emergence of Graph Neural Networks (GNNs) [9] has allowed effective modeling of naturally graph-structured data, such as social networks, (bio)chemical molecules [11] [12], geospatial data [13] [14], and tabular data which can be effectively modeled as a graph, such as in recommendation systems [15] and drug interactions [16]. GNNs can be effectively used for problems involving pairwise interactions between entities in data. In addition, the topological inductive bias that can be encoded in the graph structure allows GNN models to learn based on the topology of the problem. We can define the graph neural network as an optimizeable transformation on all graph attributes that preserves graph symmetries by being permutation invariant [17]. Fundamental for the graph neural network is the notion of message-passing in which we use a learned transformation that exchanges feature information between entities in the graph, leading to topology-aware feature vectors. How the message-passing function is defined is dependent on the type of GNN used, of which many varieties exist (e.g., GCN [18], GAT [19], GIN [20]). In 2018, GNNS were also introduced to histopathology [10] and have gained tremendous popularity in the field since then.

While review papers on the application of GNNs in histopathology exist, they give a general overview [21] or focus on the clinical applications of GNNs in histopathology [22]. Instead, we focus on identifying and quantifying emerging trends in the application of GNNs in histopathology and use these to provide future directions in the field.

Our review is organized into three main sections: First, we introduce GNNs, and their applications in histopathology. Secondly, we identify emerging trends in the application of GNNs in histopathology, from which we select some emerging paradigms which we discuss in more depth (Figure 1). Thirdly, based on our findings, we provide future directions for the field.

Refer to caption — Figure 1: Overview of the four emerging subtopics of GNNs in Histopathology, covered in this review: Hierarchical GNNs, Multimodal GNNS, Higher-order Graphs, and Adaptive Graph Structure Learning. [23] [24] [25] [26] [27]

2 Graph Neural Networks in Histopathology

2.1 Graph Neural Networks

A graph $G$ is defined as a set of nodes $N$ connected by edges $E$ : $G=(V,E)$ . The set of edges is defined as a tuple of nodes: $E=\{(x,y)|x,y\in V\}$ . The connectivity of the nodes in a graph is captured in the adjacency matrix $A^{n\times n}$ , where $n$ is the number of nodes in $G$ . Each entry $a_{ij}\in A$ denotes the existence of an edge $e_{ij}\in E$ as follows:

a_{ij}=\begin{cases}1,&\text{if }e_{ij}\in E\\ 0,&\text{if }e_{ij}\notin E\\ \end{cases}

(1)

Alternatively, the values of $a_{ij}$ can denote edge weights ranging from 0 to 1, which represents the connectivity strength between nodes $i$ and $j$ . Given an undirected graph $G=(V,E)$ , we can define the $k$ -neighborhood of any node $v\in V$ , noted as $N_{k}(v)$ recursively as follows:

$\displaystyle N_{0}(v)$	$\displaystyle=\{v\},$	(2)
$\displaystyle N_{1}(v)$	$\displaystyle=\{u\mid(v,u)\in E\text{ or }(u,v)\in E\},$	(3)
$\displaystyle N_{k}(v)$	$\displaystyle=\{u\mid\exists w\in N_{k-1}(v)\text{ such that }(w,u)\in E\text{% or }(u,w)\in E\}.$	(4)

GNNs aggregate feature information from the $k$ -neighborhood of each node, where $k$ directly corresponds to the number of GNN layers used. This aggregated information is used to update the node feature representation, $h$ , in each GNN layer. Mathematically, the node representation update is defined as follows:

\begin{split}h_{u}^{k+1}&=\text{UPDATE}^{(k)}(h_{u}^{(k)},\text{AGGREGATE}^{(k% )}(\{h_{v}^{(k)},\forall v\in N_{k}(u)\}))\\ &=\text{UPDATE}^{(k)}(h_{u}^{(k)},m^{(k)}_{N(u)})\end{split}

(5)

where UPDATE and AGGREGATION denote the functions that update node representation $h_{u}$ and aggregate the hidden representations from $u$ ’s neighborhood $N_{k}(u)$ , respectively. How the UPDATE and AGGREGATION functions are exactly defined is dependent on the message passing scheme used and are usually parameterized by two learnable weight matrices. However, all message passing schemes employ a permutation-invariant AGGREGATION function (e.g., sum, mean). We can generally distinguish two types of message-passing schemes: Spectral message-passing, based on the spectral graph properties (e.g., eigenvalues) calculated using the graph Fourier transform, and Spatial message-passing, which are directly applied on the connectivity structure present in the input graphs. In this review, we mainly focus on spatial message-passing methods as these are applied in the vast majority of histopathology applications using GNNs. We first denote a tuple $(G,A,X)$ , where $G$ denotes the input graph, $A$ the associated adjacency matrix and $X$ the input node feature matrix. To make the graph representation less sensitive to node degrees, we can normalize the adjacency matrix into a normalized adjacency matrix $\tilde{A}$ , as follows:

\tilde{A}=D^{-1/2}AD^{-1/2}

(6)

, where $D$ denotes the degree matrix (diagonal matrix where $D_{ii}$ is the degree of node $i$ ) of the graph. To utilize spectral information in the graph structure, we can use the Laplacian matrix of the graph, defined as: $L=D-A$ . During message passing, we transform feature matrix $X$ into hidden feature representation matrix $H$ , typically using a learned weight matrix $W$ and a nonlinear activation function $\sigma$ .

One of the most widely adopted and earliest spatial GNN schemes is the Graph Convolutional Network (GCN). The message passing function uses a normalized adjacency matrix to update the hidden representations of nodes based on the node neighborhood. To acquire the hidden representation matrix $H$ , the message passing function in GCN layer $l$ is defined as follows:

H^{l+1}=\sigma(\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}H^{l}W% ^{l})

(7)

in which $\tilde{D}$ denotes the degree matrix of $G$ and $\tilde{A}$ represents the adjacency matrix with added self-loops for each node [18].

The spatial Graph Attention Network (GAT) [19] extends the GCN scheme by adding attention weights to each edge of the graph. This essentially allows models to learn the importance of nodes during message passing. For each edge $e_{vu}$ connecting nodes $v$ and $u$ , we first calculate an attention score:

e_{vu}=\sigma\left(\vec{a}^{T}\left[Wh_{v}^{(l)}\|Wh_{u}^{(l)}\right]\right)

(8)

$\|$ denotes concatenation and $\vec{a}$ is a trainable shared parameter vector. Using this score, we can calculate the corresponding edge attention weight as follows:

\alpha_{vu}=\frac{\exp(e_{vu})}{\sum_{u^{\prime}\in N(v)}\exp(e_{vu^{\prime}})}

(9)

We then update the hidden node representation of $h_{v}^{(l)}\in H^{l}$ as follows:

h_{v}^{(l+1)}=\sigma\left(\sum_{u\in N(v)}\alpha_{vu}\cdot W^{l}h_{u}^{(l)}\right)

(10)

the spatial GraphSAGE method [28] provides a scalable and flexible framework to decide how neighboring nodes should be aggregated. It differs from other message-passing schemes in that it samples $S$ neighbors in the neighborhood of each node, instead of using all neighbors. Given hidden node representation $h_{v}^{(l)}\in H^{l}$ , we can define its message-passing scheme as follows:

h_{v}^{(l+1)}=\sigma\left(\mathbf{W}^{(l)}\cdot\text{AGG}^{(l)}\left(\{h_{u}^{% (l)}:u\in\mathcal{S}_{v}\}\right)\right)

(11)

Where AGG denotes an aggregation function at layer $l$ , which can be any permutation invariant function (e.g., sum, mean).

Xu et al. introduced the Graph Isomorphism Network (GIN) [20], which has an expressive spatial message-passing scheme aimed to differentiate between isomorphic graph structures ¹¹1 $\begin{aligned} G_{1}&=(V_{1},E_{1})\text{ and }G_{2}=(V_{2},E_{2})\text{ are % {isomorphic} }\iff\exists f:V_{1}\to V_{2}\text{ such that }\\ f&\text{ is a bijection and }\forall u,v\in V_{1},\{u,v\}\in E_{1}\iff\{f(u),f% (v)\}\in E_{2}.\end{aligned}$ For any hidden node representation $h_{v}^{l}\in H^{l}$ , the message-passing is defined as follows:

h_{v}^{(l+1)}=\text{MLP}^{(l)}\left((1+\epsilon^{(l)})\cdot h_{v}^{(l)}+\sum_{% u\in\mathcal{N}(v)}h_{u}^{(l)}\right)

(12)

Here, the MLP denotes a multilayer perceptron which process each node’s aggregated feature vector. $\epsilon^{l}$ is a learnable parameter which learns how to scale the node’s own feature vector.

The spectral ChebNet [29] method uses Chebyshev polynomials to approximate spectral graph convolution. First, we rescale the graphs Laplacian matrix $L$ using the largest eigenvector of $L$ , $\lambda_{max}$ : $\hat{L}=(2L/\lambda_{max})-I$ . Given the approximation parameter $k$ , we can compute the approximated Chebyshev polynomial $Z^{(k)}$ as follows:

\displaystyle\begin{aligned} \mathbf{Z}^{(1)}&=\mathbf{X}\\ \mathbf{Z}^{(2)}&=\mathbf{\hat{L}}\cdot\mathbf{X}\\ \mathbf{Z}^{(k)}&=2\cdot\mathbf{\hat{L}}\cdot\mathbf{Z}^{(k-1)}-\mathbf{Z}^{(k% -2)}\end{aligned}

(13)

Finally, the message-passing function to update hidden representation matrix in layer $l$ , $H^{l}$ , is defined as follows:

\mathbf{H}^{l+1}=\sum_{k=1}^{K}\mathbf{Z}^{(k)}\cdot\mathbf{W}^{(l)}

(14)

Prediction tasks using GNNs can be categorized into node-level, edge-level, and graph-level prediction tasks. Node-level tasks, such as node classification, predict labels of target nodes based on the transformed representations after message-passing. Edge-level tasks include edge classification, where labels are predicted for edges in the graph, and link prediction. In link prediction, the aim is to predict whether links between nodes should exist based on the node features after message passing. Lastly, graph-level tasks need a global pooling step, which aggregates information from node and / or edge level into a global representation which can be used to predict graph-level labels. Let us define a graph $G=(V,E)$ with an associated node feature matrix $X$ . We can then use any permutation-invariant function to pool the node features into a global representation:

pool(G)=\bigoplus_{v\in V}X(v)

(15)

where $\bigoplus$ is any permutation invariant function (e.g., sum).

2.2 GNNs in Histopathology

Graphs have been used in digital pathology since the 1990s [30] and have later been combined with classical machine learning algorithms for diagnosis tasks [31]. Since then, GNNs have been gaining popularity throughout the 2010s to become the primary method for graph-based machine learning tasks. Since the first application of GNNs in histopathology, in 2018 [10], the use of GNNs in histopathology has grown rapidly, with more than 150 publications in 2024. GNNs offer several important advantages for modeling of histopathological images:

1.

GNNs acquire relationship-aware representations By exchanging information between nodes in the input graph, GNNs learn context-aware representations. This is important in pathology, where meaningul biological structures often depend on the cellular or regional context [32]. It should be noted that vision transformer models do also allow learning relationship-aware representations but these relationships are calculated between arbitrary patches instead of between predefined biologically relevant entities (e.g., cells), in the case of GNNs.
2.

GNNs can learn from the topological information in the WSI Graphs are a natural way to capture topology. In histopathology, factors like cellular density can be important in diagnosis, which can be captured in the topological information in the graph structure [33] [34].
3.

GNNs model the entire WSI at once Due to the sheer size of whole slide images, traditional deep learning methods usually split the WSI into image patches and use these as model input. This approach introduces patching bias, as optimal resolution, size, and stride depend on the problem at hand [35]. GNNs can model the WSI as a graph, which is much smaller than the original image. This allows it to be loaded into memory, effectively capturing the global structure of the WSI [36].
4.

GNNs allow for hierarchical modeling In histopathological image analysis, diagnosis often relies on information acquired from multiple spatial scales of the WSI (e.g., global patterns combined with specific cellular features). GNNs allow for modeling both these scales in a single model, either by connecting graphs on different scales or by learning the global structure through pooling operations [24] [37].
5.

GNNs allow for entity-wise interpretability Whereas CNN-based methods usually rely on pixel-level explainability, GNNs allow for entity-wise explainability. This allows pathologists to investigate the dependence of the model prediction on certain biological entities, such as cells or substructures in the WSI [38].
6.

GNNs allow for injecting task-specific inductive biases The input graph structure can be modified based on prior information about the task at hand. This in turn allows for more specific explainability and efficient modeling of the problem [39].
7.

GNNs allow for straightforward multimodal integration Multimodal integration often requires modeling separate modules whose information is fused together to arrive at a final prediction. In GNNs, information can be simply added to the feature vectors associated with the node, edge, or graph, which can then be jointly updated using message passing. This approach is efficient, as no additional model modules are required and allows quick injection or removal of information from different modalities [40] [41].

Applying GNNs to histopathology requires some decision making and algorithmic steps (Figure 4). First, we preprocess the WSI (e.g., quality control, stain normalization). Now either a cell segmentation algorithm can be applied from which a cell graph can be constructed, or one extracts patches from which a patch graph can be constructed. Using the extracted image entities, a graph can be defined using a chosen graph construction algorithm. When a graph has been defined, it can be used as input for a GNN-model. The predictions given by the GNN-model can be explained using various GNN explainability methods. We will further explore this typical workflow of GNNs in histopathology in the following sections.

2.2.1 Defining the input graph

For GNNs to be applied to histopathology images, one first needs to define which entities nodes in the input graph will represent. The majority of GNNs applied to histopathology use one of 3 types of input graphs, as shown in Figure 2: Cell Graphs, where nodes represent cells or nuclei, segmented using a segmentation algorithm or model (e.g., HoVerNet [42]). Patch Graphs, where nodes represent patches of the image, and lastly, Tissue Graphs, where nodes represent larger-scale semantic entities in the graph. These tissue graph entities can be acquired from a semantic segmentation map, superpixels (usually generated using the SLIC algorithm [43]), or clustered superpixels, which represent similar regions in the input image. Some alternate approaches also exist; notably, approaches that treat image pixels as nodes and approaches that construct a patch-based hypergraph ²²2graph where edges can connect any number of nodes instead of the pairwise edges seen in regular graphs.

Once the entities for the nodes have been established, one needs to decide how the nodes should be connected. For this, histopathology GNNs usually use one of four graph construction strategies, or combinations of these strategies. First, we can use a simple distance threshold, where we connect all nodes having a pairwise distance (e.g., Euclidean) less than a set threshold $t$ . Second, we can use the k-Nearest Neighbor (k-NN) algorithm. Here, we set a parameter $k$ , which denotes how many neighbors each node will have. Then, we connect the $k$ closest neighbors of each node to the target node. Note that for both of these approaches, we can base our notion of distance on spatial distance or distance between the node-associated feature vectors. Third, we can construct a Region Adjacency Graph (RAG), where we connect all entities that share a border³³3In patch graphs, this is equivalent to using a $k=4$ k-NN without diagonal neighbors and $k=8$ k-NN with diagonal neighbors.. Typically, this approach is used for patch- or tissue graphs, with a clear border between entities. Lastly, we can use Delaunay triangulation. Here, we form all possible triangles between the nodes, such that the circumcircle of each triangle does not contain other nodes than the 3 nodes the triangle consists of.

2.2.2 Extracting features

To allow a GNN to use the image-based features present in whole slide images, one usually extracts features associated with the entity of the node and attaches that to the node as node features. Similarly, one can also add features to graph edges, which the GNN can use in the message-passing function. Backbones of pretrained⁴⁴4usually on the ImageNet dataset [1] CNN (e.g., ResNet [45]) or Vision Transformer [5] models are primarily used for node feature extraction, where we use an image patch corresponding with the node entity, process it using the feature extraction model, and extract the feature vectors of this image in the intermediate layers of the model as node features. Sometimes, the feature extraction model is pretrained in a supervised manner on the histopathology images for the problem itself, or fine-tuned for the prediction task at hand, which allows for more problem-specific features. More recently, self-supervised training has been applied for feature extraction, allowing for learning features that generalize better across prediction tasks [46]. Handcrafted features, based on morphology-, texture- or intensity measurements can also be used as node features. Furthermore, (spatial) graph features (e.g., node degree) can be calculated on a node-, edge- or graph-level to more directly incorporate topological information in the model prediction.

2.2.3 Graph Neural Network architectures

Most message-passing schemes used in histopathology GNNs are not specific to histopathology. Popular schemes used include Graph Convolutional Networks (GCNs), Graph Attention Networks (GATs), GraphSAGE, or GINs (Graph Isomorphism Networks). Some approaches invented schemes specific for their problem [47] [48] [49] [39] [50] [51] and lately, Graph Transformers models have gained traction as a popular alternative or addition to regular message-passing. In the overall model architectures, many approaches combine message-passing layers with other neural network modules, like transformers, LSTMs, MLPs, and MIL aggregation layers. For graph-level prediction problems, global pooling layers are applied, sometimes combined with sequentially applied local pooling layers which hierarchically coarsen the graph.

2.2.4 Applications

GNNs in histopathology have been applied to a wide variety of tasks. Mainly on supervised prediction tasks such as survival prediction, region-of-interest (ROI) classification, cancer grading, cancer subtyping, cell classification, and the prediction of treatment response. Some applications aim to predict data in other modalities, such as genetic mutations or (spatial) gene expression. Although most use cases are classification problems, some research has used GNNs for semantic segmentation [52] [53] [54] or nuclei detection [55] [56]. Another interesting application is Content-Based Histopathological Image Retrieval (CBHIR). Here, we first use GNNs to extract- and save a graph representation for a ROI in a WSI. When pathologists grade new cases, we can use these embeddings to retrieve similar ROIs, helping in the diagnostic process. Most GNN applications focus on cancer as a disease, with a few exceptions [57] [58] [39] [59] [60] [61] [62].

2.3 Explainability

One major advantage GNNs have over other model types in histopathology is interpretability. The model output can be explained on an entity level and visualized using a graph overlay. For example, one can pool nodes in a cell graph using an attention mechanism, calculate the attention scores for each node, assign a color based on the attention score per node, and then visualize the attention scores on a cellular level when overlaying the graph over the WSI. Many methods for explainability in GNNs have emerged since the inception of the GNN (e.g., GNNExplainer [63], GCExplainer [64]). There have also been efforts to develop explainability methods specific for histopathology GNNs [65] [66] [67] or to use combinations of existing GNN explainability techniques to extract a clinically interpretable model output [68].

3 Methodology

Using Google Scholar, we identified 156 papers applying GNNs to histopathology. The first of these papers is from September 2018, when the first paper applying GNNs to histopathology was published, up to March 2024. We included all papers applied on H&E stained whole slide images or tissue microarrays (TMAs) where GNNs (i.e., message-passing) were part of the methodology. The papers were categorized based on the following properties:

•

Message-passing scheme
•

Type(s) of input graph
•

Graph construction method
•

Feature extraction method
•

Application(s)
•

Tissue type(s)
•

Hierarchy
•

Multimodality

We quantified the frequencies in each of these properties to identify emerging trends in the literature (Figure 5).

From our quantification, we identified 4 upcoming trends to explore further:

1.

Hierarchical GNNs
2.

Adaptive Graph Structure Learning
3.

Multimodal GNNs
4.

Higher-order graphs

4 Emerging Trends

4.1 Hierarchical GNNs

Diagnostic- and prognostic information present on WSIs often exists on multiple levels of coarsity. For example, the cellular microenvironment can be an important diagnostic factor but can depend on where this microenvironment is globally located in the tissue. Cellular graphs are suitable for capturing the microenvironment, but can miss the global tissue information present in the WSI. Similarly, patch- or tissue-based graphs can capture global information in the WSI, but miss the topological information of the cellular structures [69]. To connect the information on different levels of coarsity, we can either apply local pooling layers which learn a hierarchical representation of the input graph in an end-to-end manner, denoted as Learned Hierarchy, or we can define the hierarchy between graphs prior to model training, denoted as Pre-established hierarchy. Both are illustrated in Figure 6.

In a learned hierarchy, we apply local pooling layers that can iteratively coarsen the graph structure hierarchically. Let us define our input graph with associated node features as $G_{0}=(V_{0},E_{0},X_{0})$ . Assuming that we have $k$ local pooling layers in our GNN architecture, we sequentially coarsen our input graph to $G1,G_{2},...,G_{k}$ where $G_{k}$ is the final pooled graph representation. Mathematically, we define a local pooling to coarsen the graph $G_{i}$ to $G_{i+1}$ as follows:

G_{i+1}=\text{pool}_{i}(G_{i}),\quad\forall i\in\{0,1,2,...,k-1\}

(16)

where $pool_{i}$ is defined by any permutation-invariant pooling function. Prominently used examples include DiffPool [70], SAGPool [71], and MinCutPool [72]. Apart from pure local pooling, we also classify methods that learn the hierarchy using a cross-hierarchical transformer [73] [74] [75] layer as learned hierarchy methods.

Learned hierarchy methods learn a node assignment matrix $S^{(l)}$ which denotes the changes in the graph structure after applying the pooling operation. Often, multiple local pooling layers are applied subsequently to coarsen the graph. One sets a pooling ratio hyperparameter, denoted as $k$ , which determines how many nodes should be present after the pooling operation. For any one of these layers, $l$ , the pooling operation updates the adjacency matrix of the input graph, $A$ , and its corresponding node attributes $X$ . The hidden representations are denoted $H$ , where $X=H^{0}$ . We denote the pooling operation as:

(A^{l+1},H^{l+1})=\text{POOL}(A^{l},H^{l})

(17)

The pooling operation is dependent on the pooling function used. DiffPool [70] applies a graph neural network to learn a differentiable cluster assignment matrix which maps nodes to clusters, which are used as individual nodes after the pooling operation. DiffPool uses two GNNs: one for obtaining node embeddings, $GNN_{l,embed}$ , and one for assigning the nodes to cluster nodes, $GNN_{l,pool}$ . In each DiffPool layer $l$ , we use the embedding GNN for extracting a feature matrix $Z$ :

Z^{l}=GNN_{l,embed}(A^{l},H^{l})

(18)

Then, we calculate the assignment matrix using the pooling GNN:

S^{l}=softmax(GNN_{l,pool}(A^{l},H^{l}))

(19)

Now, we update both the hidden node representations and a new adjacency matrix:

	$\displaystyle H^{l+1}=S^{l^{T}}Z^{l}$		(20)
	$\displaystyle A^{l+1}=S^{l^{T}}A^{l}S^{l}$		(21)

Self-Attention Graph Pooling (SAGPool) [71] uses the self-attention mechanism mechanism to learn which nodes are important and to discard unimportant ones. First, we calculate the self-attention score using a graph convolution operation:

H^{l+1}=\sigma(\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}H^{l}W% ^{l})

(22)

Here, $W^{l}$ is a learned weight matrix which we use to calculate the attention score. For each node $v\in V$ , we calculate:

\alpha^{l}_{i}=softmax(W^{l}\dot{h}^{l}_{i})

(23)

where $h_{i}$ is the feature embedding of $v_{i}$ . SAGPool then ranks the nodes on their attention scores and selects the top- $k$ nodes to retain. Based on the nodes to retain, the adjacency matrix gets masked and this mask, $H_{mask}$ , gets multiplied with the original adjacency matrix to coarsen the graph: $A^{l+1}=A\odot H_{mask}$ .

Lastly, MinCutPool [72] uses the mincut partition objective function to decide the assignment matrix $S$ . Similarly to the DiffPool method, we first generate a GNN-based node feature matrix $H^{l+1}$ :

H^{l+1}=GNN(H^{l},A^{l},W^{l}_{GNN})

(24)

where $W^{l}_{GNN}$ is the learned weight matrix of the GNN. Using the updated representation, we can use a multilayer perceptron (MLP) to calculate the node assignment matrix $S$ :

S=MLP(H^{l+1},W^{l}_{MLP})

(25)

Where $W^{l}_{MLP}$ are the learned weights of the MLP. Both $W^{l}_{GNN}$ and $W^{l}_{MLP}$ are trained by minimizing two loss terms $L_{c}$ , denoting the cut loss term, and $L_{o}$ , denoting the orthogonality loss term. The cut loss term approximates the Mincut objective, by aiming to minimize the number of edges between clusters while maximizing the edges within clusters. The orthogonality loss term encourages orthogonal cluster assignments and similarly sized clusters. Together, these loss functions form the objective loss $L_{u}$ :

L_{u}=L_{c}+L_{o}=-\frac{{\text{Tr}(S^{\top}\tilde{A}S)}}{{\text{Tr}(S^{\top}D% \tilde{S})}}+\frac{{\text{Tr}(S^{\top}S-IK)}}{{\sqrt{K}}}

(26)

Where $D$ is the degree matrix of the normalized adjacency matrix $\tilde{A}$ , $I$ is the identity matrix and $K$ is the number of desired clusters.

The pooling operation is performed as follows:

	$\displaystyle A^{l+1}$	$\displaystyle=S^{T}\tilde{A}S$		(27)
	$\displaystyle H^{l+1}$	$\displaystyle=S^{T}H$		(27)

4.1.1 Learned hierarchy

As Table 1 shows, the vast majority of GNN applications in histopathology use existing local pooling functions such as in the examples above. In this section, we give some examples of newly designed learned hierarchy methods, specifically for problems in histopathology.

Local Pooling: Hou et al. [49] proposed Iterative Hierarchical Pooling (IHPool), which they combined with a pre-established hierarchy. As input, the authors used a pyramidal heterogeneous patch graph, with one graph existing on 10x resolution, one on 5x resolution, and one on thumbnail resolution. Features were generated using KimiaNet. IHPool was designed to filter redundant information for the downstream prediction task while retaining this pyramidal structure when applying the pooling operation. The method achieves this by conditioning the set of nodes to be pooled on each resolution level on the pooling outcome of the lower-resolution nodes. Let $X$ be a matrix of node features, $A$ be the adjacency matrix of the input graph, $k$ be the ratio of nodes to retain after pooling and $P$ be a learnable projection layer. Now, let us denote the input graph $G=(V,E,R)$ where $R$ represents the set of different resolutions in the graph. For each resolution $r\in R$ , patches on resolution $r$ are represented as nodes. The nodes are pooled hierarchically, such that nodes in higher magnification levels are subordinate to nodes in lower levels. For all nodes, a fitness score is calculated and nodes are assigned to clusters based on spatial distance and fitness difference between nodes. Specifically, for each node $n\in N$ on resolution $r$ , we use a learnable projection matrix $P$ to calculate the fitness scores as follows:

\phi_{n}^{r}=\tanh\left(\frac{{V_{n}^{r}\cdot P}}{{||P||}}\right)

(28)

where $V_{n}^{r}$ is the set of nodes to be pooled, based on the hierarchical edges between resolutions. Based on the calculated node assignments, we create a new node feature matrix $X^{\prime}$ . The adjacency matrix $A^{\prime}$ is updated to maintain graph connectivity based on the node assignments.

Wang et al. [76] proposed a new module for pooling information from cell graphs to use as embeddings for clusters of cells, called cell community forests. The authors first applied DBSCAN clustering to cell embeddings where they clustered the cells based on their density. The hierarchical relationships between the cellular structures is captured by organizing the clusters into nested relationships based on their density (i.e. each dense cellular cluster is nested within a sparser, larger cluster). Cellular features pooled hierarchically up to the sparsest cluster level and then processed by a LSTM module to construct the graph embedding for downstream predictions.

Zhao et al. [77] proposed an extension of the popular MinCutPool by adding an additional message-passing layer in the pooling equation. For acquiring the cluster assignment matrix $S$ , where each node $s\in S$ will be a single node in the coarsened graph, the authors used the following equation:

S=H(\sigma(\hat{A}HW_{pool}))

(29)

where $\hat{A}$ is the LaPlacian-normalized adjacency matrix $H$ denotes the hidden representation matrix of the nodes, $W_{pool}$ denotes a learnable pooling weight matrix and $\sigma$ denotes a nonlinear activation function (e.g., ReLU).

Attention-based Interaction Modeling: Azadi et al. [75] proposed two attention-based methods for exchanging information between different levels of graph coarsity. The authors used a local graph, where nodes represent patches in the WSI, and a global graph, where nodes represent MinCutPool-based clusters of nodes in the local graph. Now, attention scores are calculated for each node in the local- and global graph. The first method the authors proposed for exchanging information between the local- and global graph was Mixed Co-Attention (MCA), in which the information is not mixed directly, but weight sharing is applied between parallel processing of the local- and global nodes. the second method was Mixed Guided Attention, where the idea of MCA was expanded on by directly infusing the calculated local node feature representation into the attention score calculation of the global nodes. The authors found that the mixed co-attention strategy worked optimally for their use case.

Alternative Approaches: Ding et al. [78] did not learn a hierarchical representation using pooling layers, instead using a FractalNet architecture. Here, the input graph is given to separate processing paths which consist of different numbers of GNN layers, thereby representing different semantic levels in the tissue. The hierarchy between the paths is encoded using a combination of a gated bimodal unit and an MLP mixer architecture. The former calculates a weighted combination of representations, while the latter enhances communication between the path representations and strengthens connections among different path features.

Li et al. [79] propose a hierarchical Graph V-Net to encode hierarchy in a patch graph input. First, attention-based message-passing is used to exchange information between adjacent patches. Then, the authors used a graph coarsening operation where the node features are arranged as a 2D grid based on the spatial location of the patches. This grid is then evenly divided into submatrices and each submatrix is projected to a single feature vector using a learnable layer, which will act as a node after the coarsening operation. Notably, the Graph V-Net also uses graph upsampling layers, which add nodes until the size of the input graph has been restored, similar to what is done in UNet-architectures.

Publication	Date	Application	Learned hierarchy method
Zheng et al. [37]	2019/10	CBHIR	DiffPool
Zhou et al. [80]	2019/10	Cancer grading	DiffPool
Sureka et al. [38]	2020/10	Binary classification	DiffPool
Zheng et al. [81]	2020/12	CBHIR	DiffPool
Chen et al. [25]	2020/09	Survival prediction, Cancer grading	SAGPool
Jiang et al. [82]	2021/01	Cancer grading	DiffPool
Zheng et al. [83]	2021/04	CBHIR	DiffPool
Wang et al. [84]	2021/09	Survival prediction	SAGpool
Xiang et al. [85]	2021/10	Binary classification	DiffPool
Xie et al. [86]	2022/01	Treatment response prediction	TopKPooling
Dwivedi et al. [87]	2022/04	Cancer grading	SAGPool
Hou et al. [49]	2022/06	Binary classification	IHPool
Bai et al. [88]	2022/08	Cancer subtyping	MinCutPool
Zuo et al. [89]	2022/09	Survival prediction	SAGPool
Hou et al. [73]	2022/09	Cancer subtyping	Hierarchical attention mechanism
Lim et al. [90]	2022/10	Survival prediction	SAGPool
Wang et al. [76]	2023/02	Cancer subtyping	Scattering Cell Pooling
Zhao et al. [77]	2023/02	Cancer subtyping, Cancer grading	GCMinCut
Ding et al. [78]	2023/02	Cancer subtyping, Cancer grading	Fractal paths
Ding et al. [91]	2023/04	Survival prediction	SAGPool
Li et al. [79]	2023/09	Node classification	Graph V-Net
Syed et al. [59]	2023/09	Rheuma subtyping	SAGPool
Shi et al. [74]	2023/09	Cancer subtyping, mutation prediction	Hierarchical attention mechanism
Wu et al. [92]	2023/10	Survival prediction	SAGPool
Nakhli et al. [50]	2023/10	Survival prediction	SAGPool
Azadi et al. [75]	2023/10	Survival prediction	MinCutPool, Hierarchical attention mechanism
Hou et al. [93]	2023/10	Survival prediction	Matrix multiplication
Abbas et al. [94]	2023/12	Cancer grading	DiffPool
Xu et al. [95]	2023/12	Cancer subtyping	DiffPool
Azher et al. [96]	2024/01	Cancer grading, Survival prediction	SAGPool
Yang et al. [97]	2024/03	Binary classification, Survival prediction	MinCutPool

Table 1: Publications applying GNNs to histopathology which used learned hierarchies

4.1.2 Pre-established hierarchy

In pre-established hierarchy, we encode the hierarchy prior to model training. For example, we can construct multiple graphs at different levels of coarsity in the WSI, and connect them using assignment matrices, which denote how the nodes are connected between the hierarchical levels. During message-passing, the learned representations of the lower hierarchy level are aggregated and used as input for the corresponding nodes at the higher hierarchy level. We differentiate between approaches connecting graphs on different semantic levels (e.g., cells and tissues), and approaches connecting different magnifications of the WSI (e.g., 40x, 20x). An overview of publications using this approach is given in Table 2.

Semantic Hierarchies: Pati et al. [24] were the first to introduce a pre-established hierarchy in the graph to use as input for a GNN model. They constructed a cell graph, $CG$ , using a nuclei segmentation map and a tissue graph, $TG$ , constructed by clustering superpixels into larger tissue areas based on similarity. To model the hierarchy, they introduced an assignment matrix $S_{CG\to TG}$ , such that $S_{CG\to TG}(i,j)=1$ if a cellular node $i$ from the cell graph belongs to tissue node $j$ in the tissue graph.

Wang et al. [84] introduced hierarchy by applying separate message passing operations on both a cell graph and a patch graph. As patch-level features, cellular node representations pooled based on the cells located in the patch. were used. The authors combined hierarchy learning with pre-established hierarchy by also applying self-attention graph pooling on both the cell- as well as the patch-graph.

Sims et al. [98] connected a cell graph with a level-1 and level-2 patch graph, which represent patches of increasing size (400 $\mu$ m, 800 $\mu$ m). They define their message passing for any cellular node $i$ as $CG_{i}\longrightarrow L1_{i}\longrightarrow L2_{i}\longrightarrow L1_{i}% \longrightarrow CG_{i}$ , where each $\longrightarrow$ defines a message-passing function, $CG_{i}$ represents the node in the cell graph and $L1_{i}$ , $L2_{i}$ represent the node corresponding to the level-1 patch and the level-2 patch on which this cell exists. By applying message-passing in this way, the model can exchange information between distant cells without using many message-passing layers, as the cellular nodes belonging to the same layer-2 node can be 800 $\mu$ m away.

Guan et al. [99] proposed a Node-aligned hierarchical graph-to-local clustering approach, inspired by the Bag-Of-Visual-Words (BOVW) methodology in Computer Vision. Starting with a set of H&E stained WSIs, the authors first clustered the patches for each WSI, into visual word bags, where each bag is defined as $B$ . A local clustering approach is used that samples global clusters from each bag $B$ into local subclusters using K-means. These subclusters represent a codebook of ’visual words’ representing tissues with different properties. We can use this codebook to categorize patches in input WSIs into subclusters, from which we can construct a graph. This is achieved by connecting the patches in each subcluster using inner-sub-bag edges, and the subclusters themselves using outer-sub-bag edges. This graph structure allows hierarchically modeling WSIs by applying message-passing between patches in each subclusters to retrieve representations which are pooled on a subgraph-level. Subsequently, message-passing is performed between the pooled subcluster representations themselves.

Hou et al. [73] proposed constructing a cell graph along with superpixel-based tissue graphs at two levels ( $CG,TG_{l1},TG_{l2}$ ). They generated features for the cell graph by using a pretrained ResNet on a patch around the nucleus centroid, while generating tissue graph representations by averaging ResNet-embeddings from all crops belonging to a superpixel. The hierarchical information flow is modeled using a Transformer block that calculates the cross-attention between the graphs at different levels.

Shi et al. [100] used graphs at four different levels of hierarchy: a tissue graph on 5x resolution, consisting of superpixels constructed using the SLIC algorithm [43], and 3 patch graphs at 5x, 10x, and 20x resolution, respectively. The 5x resolution patch graph is used to generate features for the tissue graph. Then, after applying message-passing to the 10x- and 20x patch graphs, the interaction between the different hierarchical levels is modeled using a hierarchical attention module. This module produces a tissue graph where the interactions are captured in the node features. Message-passing layers, global attention layers and a fully connected layer are applied subsequently to the tissue graph to come to a final prediction.

Gupta et al. [101] modeled a tissue graph and a cell graph together as a heterogeneous graph with cellular nodes, tissue nodes, cell-cell edges, tissue-tissue edges, and cell-tissue edges: $H=\{C,T,E_{cell\to cell},E_{tissue\to tissue},E_{cell\to tissue}\}$ . After applying message-passing layers, they calculated the cross-attention between the cellular and tissue nodes using the transformer architecture to model the hierarchical relationships.

Abbas et al. [94] established four separate hierarchical levels, where one level is a global image analyzed using a CNN model and the other levels are cell graphs constructed at different semantic hierarchy levels (global, spanning the entire wsi ( $G^{(0)}$ ), using patches of size $512x512$ px ( $G^{(1)}$ ) or using patches of size $256x256$ px ( $G^{(2)}$ )). For each level, a subset of the segmented cells is randomly selected to build a cell graph. After applying message-passing layers on each level separately, the outputs are combined and processed using a fully connected layer. The combined representation and the representations gathered at each cell graph level separately are combined using an entropy weighting strategy, which weights the different representations based on the uncertainty of the model prediction given that representation.

Multiresolution Hierarchies: Xing et al. [102] constructed hierarchical patch graphs at several levels of image resolution, thus aggregating information from multiple resolution levels. Starting with a single patch, they subsampled the same patch at increasingly lower resolution, and connected the lower-resolution patches to the corresponding higher-resolution patch it was sampled from. This input graph was then used for a GNN model.

Bazargani et al. [103] introduced hierarchy into their approach by constructing separate patch graphs on 5x, 10x and 20x resolution and then performing message-passing operations both on each graph separately as well as between graphs with different resolutions.

Bontempo et al. [104] used a knowledge distillation approach combined with two patch graphs at different resolutions (high, low). They performed message-passing both hierarchically between high and low resolution and in each resolution graph itself. They treated the high-resolution graph as a ’teacher’ and the low-resolution graph as a ’student’ network, between which they optimize the KL-divergence for the bag-level predictions at each resolution.

Mirabadi et al. [23] proposed modeling the pyramidal multi-magnification structure in whole slide images as a multiresolution graph, where information on both the inner-magnification and the intra-magnification levels could be modeled. They extracted patches from three magnification levels (20x, 10x and 5x), such that the patches on the higher resolutions are spatially equivalent to center crops of the patches at the lower resolutions. A RAG-graph was constructed such that nodes on each level were connected to both their adjacent neighbors on the same resolution as well as the spatially corresponding lower- and higher-level patch nodes. This allowed information to be exchanged between resolutions during message passing. After message passing, a mean pooling operation was applied on each resolution level, resulting in a 3 node graph. This three-node graph embedding is then used for the downstream classification task.

Publication	Date	Application	Hierarchy
Pati et al. [24]	2020/07	ROI classification	$CG\to TG$
Xing et al. [102]	2021/08	Cancer subtyping	$PG_{40x}\to PG_{10x}\to PG_{5x}$
Wang et al. [84]	2021/09	Survival prediction	$CG\to PG$
Sims et al. [98]	2022/01	ROI classification	$CG\to$ $PG_{1}\to PG_{2}$
Hou et al. [49]	2022/06	Binary classification	$PG_{10x}\to PG_{5x}\to PG_{thumbnail}$
Guan et al. [99]	2022/06	Cancer subtyping	$S_{k}\to K_{G}\to B$
Hou et al. [73]	2022/09	Cancer subtyping	$CG\to TG_{l1}\to TG_{l2}$
Shi et al. [100]	2023/01	Cancer grading	$PG_{20x}\to PG_{10x}\to TG_{5x}$
Wang et al. [76]	2023/02	Cancer subtyping	$CG\to CCFG$
Gupta et al. [101]	2023/07	Cancer subtyping, binary classification	$CG\to TG$
Bazargani et al. [103]	2023/08	Cancer subtyping	$PG_{20x}\to PG_{10x}\to PG_{5x}$
Bontempo et al. [104]	2023/10	Binary classification	$PG_{high}\to PG_{low}$
Abbas et al. [94]	2023/12	Cancer grading	$CG_{256px}\to CG_{512px}\to CG_{global}\to WSI_{thumbnail}$
Mirabadi et al. [23]	2024/02	Cancer subtyping	$PG_{20x}\to PG_{10x}\to PG_{5x}$

Table 2: Publications applying GNNs to histopathology which used a pre-established hierarchy. All hierarchies are shown small to large, such that when

X\to Y

, entities in

X

are subordinate to the entities in

Y

. CG: Cell Graph, PG: Patch Graph, TG: Tissue Graph.

4.2 Adaptive Graph Structure Learning

Most GNN applications in histopathology use a fixed input graph with fixed edge connectivity. While successful results have been achieved using this approach, we argue that it is suboptimal. Whether connections between nodes should exist is not clearly defined in the histopathology image, leading to the wide range of different approaches for constructing the input graphs, as previously discussed. These approaches are usually not based on biological or medical information, and thus introduce inductive bias which might not reflect the biology in the tissue. To counteract this problem, one can either adjust the message-passing equation such that some edges are given more representative power than others (e.g., using GAT [19]), or one can make the graph construction a learnable transformation. The second approach, Adaptive Graph Structure Learning (AGSL), has gained more popularity recently (Table 3). In GNNs for histopathology, the AGSL strategy employs either a learned transformation that updates the adjacency matrix or learned convolutional filters that dynamically construct the graph.

Learned Transformation: In 2020, Adnan et al. [36] introduced adaptive graph learning for the classification of lung cancer subtypes. The authors modeled the whole slide image as a fully connected graph of representative patches. Then, they used a pre-trained DenseNet for feature extraction. The graph connectivity is learned end-to-end using both global WSI context and local pairwise context between patches. Let us denote WSI $W$ with patches $w_{1},...,w_{n}$ , where for each patch $w_{i}$ we have a feature vector $x_{i}$ . The authors first pooled the patch representations into a global context vector $c$ using a pooling function $\phi$ (e.g., sum):

c=\phi(x_{1},x_{2},...,x_{n})

(30)

The global vector $c$ is concatenated to each patch feature vector $x_{i}$ and is jointly processed by MLP layers which gives a feature vector $x_{i}^{*}$ that contains both local and global context information. Finally the matrix $X^{*}$ , which holds all feature vectors $x^{*}$ , is processed using a cross-correlation layer which determines the connectivity of the output graph in $A$ , where each element $a_{ij}\in A$ represents the correlation between patches $w_{i}$ and $w_{j}$ and are used as edge weights in the learned graph structure. The learned graph can be used for any downstream tasks and has shown better performance than other (graph-based) MIL methods, available at the time.

Hou et al. [73] described a spatial-hierarchical GNN framework that could dynamically learn the graph structure during model training. Their Dynamic Structure Learning module first embeds the representation of both node features $V$ and nuclear centroid coordinates $P$ together into a single representation $J$ , using the following equation:

J=Concat[\sigma(P^{T}W_{1}),\sigma(V^{T}W_{2})]

(31)

Where $W_{1}$ and $W_{2}$ are learned weight matrices and $\sigma$ denotes a non-linear activation function. Next, the authors applied a distance-thresholded k-NN algorithm on the acquired embedding $J$ to determine the edge connectivity. Given a set of nodes $V$ , set of edges $E$ , distance threshold $d_{\text{min}}$ and the number of neighbors $k$ , we use the following equation to determine the edges in $E$ :

e_{uv}\in E\iff\{u,v\in V\mid||v-u||_{2}\leq\min(d_{k},d_{\text{min}})\}

(32)

Here, $d_{k}$ denotes the distance between nodes $u$ and the $k$ -closest neighbor.

Liu et al. [105] propose learning the graph structure based on the cosine similarity between the transformed patch feature vectors. Given an input feature matrix $X$ and a transformation matrix $T$ we create a projected matrix $P=XT$ . They then calculate the cosine similarity between each pair of patches in $P$ which are saved as a symmetric adjacency matrix $A_{L}$ , which holds the ’edge strength’ between any two patches in $P$ . The edge strength is then thresholded using a set threshold $\epsilon$ :

e_{uv}\in E\iff\{u,v\in V\mid\frac{{P[u]\cdot P[v]}}{{\|P[u]\|\cdot\|P[v]\|}}% \leq\epsilon\}

(33)

where $P[u]$ and $P[v]$ denote the projected feature vectors of nodes $u$ and $v$ , respectively. Note that the transformation matrices are learned, which allows the graph structure to be adapted during model training.

CNN-filter Based: Gao et al. [106] and Ding et al. [78] both use a very different approach, where the learned feature maps generated by a CNN are used as basis for the graph construction. More specifically, they treat the units in each feature map as nodes in which the features are spatially concatenated across channels into a node feature vector. After this concatenation, the K-nn algorithm is used to connect the nodes. By basing the graph structure on learned CNN feature maps, the graph structure is learned by training the CNN and, since each unit in the feature maps corresponds to a spatial region in the input image, the constructed graph can capture spatial dependencies between regions in the WSI. Given the acquired node embedding matrix $X\in\mathbf{R}^{N\times C}$ where $N$ is the number of nodes and $C$ the amount of channels, we determine the existence of edges as follows:

e_{uv}\in E\iff\{u,v\in V\mid||u_{f}-v_{f}||_{2}\leq d_{k}\}

(34)

Where $u_{f}$ , $v_{f}$ are the feature vectors of node $u$ and $v$ , and $d_{k}$ is the distance between node $u$ and the $k$ -closest neighbor of $u$ .

Publication	Date	Application	Adaptive learning mechanism
Adnan et al. [36]	2020/05	Binary classification	Learned transformation
Gao et al. [106]	2022/02	Cancer subtyping	CNN-filter based
Hou et al. [73]	2022/09	Cancer subtyping	Learned transformation
Ding et al. [78]	2023/02	Cancer subtyping, Cancer grading	CNN-filter based
Liu et al. [105]	2023/04	Survival prediction	Learned transformation

Table 3: Publications applying GNNs in histopathology and using adaptive graph structure learning strategies.

4.3 Multimodal GNNs

In histopathology diagnostics, different modalities are often combined to assist in clinical decision-making and prognostic predictions. While most applications of GNNs in histopathology focus solely on H&E image data, approaches considering multiple modalities have gained popularity recently. Combining data from multiple modalities helps increase model accuracy and generalization. Graph Neural Networks are especially suitable for multimodal integration, as data from different modalities can be easily combined in the node- and edge feature vectors [107]. In the last few years, multiple approaches combined IHC-stained biopsy images with H $\&$ E stained biopsy images, while other approaches incorporated spatial transcriptomics or genetic data in the model input. We differentiate between Stain multimodality, where the same whole slide images with different stainings (e.g., IHC) are combined, and Full multimodality, where the modalities are not based on WSIs (e.g., CT-scans, gene expression data). An overview of the multimodal GNNs in histopathology is given in Table 4.

An important challenge in multimodal integration in Deep Learning models is how- and where in the model architecture data from different modalities should be combined, which we call fusion. In a GNN context, we broadly differentiate between early fusion, where data from different modalities is combined prior to message passing and late fusion, where data is combined after the message passing steps (Figure 7).

We broadly categorize the multimodal GNNs into four groups: Pathomic fusion based, which uses the pathomic fusion strategy, popularized by Chen et al. [25], Early fusion, Late fusion and Modality prediction, encompassing models that predict one modality using another. Models that do not directly fuse modalities but use predictions from one modality to drive how the other modalities are processed are considered Late fusion models.

4.3.1 Full multimodality

Pathomic Fusion: Chen et al. [25] integrated whole slide image information together with RNA-Seq counts and copy number variant (CNV) information. They used this combined information for cancer subtyping and survival analysis on clear cell renal cell carcinoma and glioma TCGA datasets. Their multimodal model fused information from 3 different modules: A CNN-based image module, a GNN-based cell graph module, and a genomic module, which took CNV and RNA-seq information as input. In the image module, a set of WSI patches was used as input for an ImageNet-pretrained VGG19 CNN model optimized for cancer grading and survival prediction. The cell graph module first segmented the nuclei in the image, constructed a graph using these nuclei, and used message-passing layers to learn a graph representation. Lastly, the genomic module, where a self-normalizing neural network was learned on a vector of CNV- and RNA-seq information to learn a genomic representation. Their approach for multimodal fusion, which they call Pathomic fusion models interactions between modalities via the Kronecker product of attention-gated representation. The attention gating is applied to the hidden representation of modality $m$ , $h_{m}$ , by learning a transformation $W_{ign\to m}$ which assigns an importance score for each modality, which we denote as $z_{m}$ :

$\displaystyle h_{m,\text{gated}}$	$\displaystyle=z_{m}\ast h_{m},\quad\forall m\in\{i,g,n\}$	(35)
$\displaystyle\text{where,}\quad h_{m}$	$\displaystyle=\text{ReLU}(W_{m}\cdot h_{m})$
$\displaystyle z_{m}$	$\displaystyle=\sigma(W_{\text{ign}\rightarrow m}\cdot[h_{i},h_{g},h_{n}])$

Where $h_{i}$ , $h_{g}$ , and $h_{n}$ , are the gated representation vectors of the image module, graph module, and genomic module, respectively. The authors calculated the Kronecker product of these vectors to get a combined representation $h_{fusion}$ :

h_{fusion}=\begin{pmatrix}h_{i}\\ 1\\ \end{pmatrix}\otimes\begin{pmatrix}h_{g}\\ 1\\ \end{pmatrix}\otimes\begin{pmatrix}h_{n}\\ 1\\ \end{pmatrix}

(36)

where $\otimes$ denotes the outer product. The result, $h_{fusion}$ is a three-dimensional tensor that can then be connected to a fully connected layer for classification tasks or survival prediction.

Jiang et al. [108] predicted EGFR gene mutations in lung cancer by augmenting the approach used by Chen et al. [25]. The authors approach differs from Chen et al. by not using genomic data but instead using clinical information (e.g., gender, age) as the third modality, next to a spatial cell graph and whole slide image. Comparing with a previous model from the same group [109], which used a cell graph- and image module but no clinical features, the authors found considerable performance increases for the multimodal model.

Early Fusion: Azher et al. [96] integrated spatial transcriptomics data with accompanying H&E WSI data to predict survival and grade cancer in colorectal cancer. The authors first constructed an embedding model that used an ImageNet-pretrained CNN to encode H&E patches and fully connected layers to encode spatial gene expression data at the same location. They then optimized a projection layer to merge the data from these modalities into a single vector using a combination of unimodal and cross-modal SimCLR loss functions. This effectively trained the model to encode a cross-modal embedding vector. The acquired embeddings were used as node vectors in a GNN model for downstream tasks. The authors showed that using expression-aware embeddings improved model performance on all tasks, indicating that pretraining using coupled H&E WSIs and spatial transcriptomics datasets can help retrieve more discriminative embeddings for downstream tasks.

Late Fusion: Zuo et al. [89] integrated H&E stained WSIs with genomic biomarker information. Specifically, they constructed a graph of patches containing Tumor Infiltrating Lymphocyte (TILs) and analyzed this graph using a GNN. Genomic data consisted of mRNA gene counts, which were transformed to a gene co-expression module matrix using the lmQCM algorithm. They then applied a concrete autoencoder model to the co-expression matrix to identify survival-associated features. The GNN- and autoencoder outputs were then fused using a self-attention layer.

De et al. [110] combined MRI- and H&E stained WSIs of brain tumors to predict the type of brain cancer. The modalities were not directly fused; instead, the authors first used a 3D-CNN model to detect whether the cancer was one of the possible cancer types (Glioblastoma). If this was the case, the model simply outputs glioblastoma as its prediction. When this was not the case, a patch graph was constructed from the H&E image which was used as input for a GNN model. Finally, this GNN model could predict one of the remaining subtypes (Normal, Astrocytoma, or Oligodendroglioma).

Xie et al. [111] combined gene expression with H&E whole slide image data for survival prediction in gastric cancer. Here, the authors first processed ResNet-based WSI tile features and a gene expression matrix separately using MLP layers. Then the interaction between each WSI patch and each gene feature vector was calculated using a cross-modal attention layer. After this processing, the data from both modalities was aggregated using a MIL-aggregation module and finally fused using concatenation. The fused embeddings were used to construct a patient graph, based on the similarity of the fused embeddings between the patients. A GNN was used to process this graph, which produced a survival prediction.

Zheng et al. [112] fused gene-expression signatures with a WSI-patch graph using their Genomic Attention Module approach. After message-passing on the patch graph, the pairwise interactions between each patch and each individual gene signature modeled using a self-attention mechanism. This allows the model to learn the interactions between spatial tissue regions and gene signatures, which allowed the authors to visualize which gene signatures were associated with certain regions in the WSI.

Modality Prediction: Fatemi et al. [113] integrated spatial transcriptomic data with co-localized H&E WSI data to characterize spatial tumor heterogeneity in colorectal cancer. They achieved this by training a model to predict the spatial gene expression from the H&E WSI. The authors tried to predict the spatial gene expression using both a CNN- and GNN-based network and showed that for this task, the CNN-based methods worked better.

Gao et al. [114] predicted spatial transcriptomic data using H&E images by integrating image- and cell graph data using CNN- and GNN-based models. The authors showed that integrating the graph- and image-based information together did significantly improve over using either one alone.

4.3.2 Stain multimodality

Early fusion: Li et al. [41] fused information from Second-Harmonic Generation (SHG) microscopy images and H&E WSIs together to differentiate between pancreatic ductal adenocarcinoma and chronic pancreatitis in pancreatic cancer. The images from both modalilities were registered and for each modality a separate graph was constructed. The features from each modality were combined into node features for the input graph, where nodes represented registered patches in both modalities. An ImageNet-pretrained ResNet model was used to retrieve features from the H&E patches, while collagen fiber-specific handcrafted features were extracted for each SHG-patch. A H&E-SHG graph was constructed where the node vectors contained the concatenation of the patch features from both modalities. This graph was used in a GNN model which predicted between the two classes.

Gallagher-Syed et al. [59] integrated data from IHC- (CD138, CD68, CD20) and H&E stained synovial biopsy samples to predict a Rheumatoid Arthritis subtype using a GNN model. Information between the staining modalities was exchanged by modeling each patch, from each staining, as a node and connecting the nodes based on their feature similarity to get a single multistain graph. The authors showed that the features across stains were similar enough to cause nodes from different staining to mix in the graph and, thus, enable information exchange between the modalities in message passing layers of the GNN. The authors used the multimodal graph as input for a GNN model whose output was used to predict the rheuma subtype.

Late fusion: Dwivedi et al. [87] combined trichrome- (TC) and H&E stainings of liver biopsies to predict liver fibrosis. The authors experimented with different modality fusion techniques. Their experiments showed that their late concatenation or addition and the pathomic fusion strategy proposed by Chen et al. [25] performed the best for fibrosis prediction. In the late and pathomic fusion strategies, they separately processed both the H&E and TC tissues as graphs using a GNN and then fused the features from both modalities together.

Qiu et al. [115] combined information from H&E stainings, multiphoton microscopy (MP), and two-photon excited fluorescence (TPEF) applied to the same breast cancer biopsies. Instead of fusing the modalities in the model itself, the authors determined tumor-associated collagen signatures from the 3 different modalities in different regions to calculate a 8-bit binary vector for each region. The regions sampled were treated as graph nodes having the binary vector as node attributes. Using these nodes, a fully connected graph was constructed and used as input for a GNN-model. The models output could be used for survival prediction.

Modality prediction: Pati et al. [116] used a generative approach to virtually predict IHC stained tissue images from H&E WSIs, and then used a multimodal GNN Transformer model to perform survival prediction and cancer grading tasks in prostate cancer, breast cancer, and colorectal cancer. The authors used three strategies for fusion (no fusion, early fusion, late fusion) and found that early fusion works optimally for both tasks. In early fusion, the authors combined ImageNet-pretrained ResNet features from the same patch in all modalities to form the node features in the input graph. In late fusion, meanwhile, all modalities were assigned a separate input graph, which was processed separately using the GNN Transformer model. Subsequently, the output features were combined. The authors hypothesized that early fusion allowed the model to learn multimodal spatial interactions during message passing, causing a performance gain compared to the other fusion strategy.

Publication	Date	Application	Fusion	Modalities
Chen et al. [25]	2020/09	Survival prediction, Cancer subtyping	Late (Pathomic fusion)	H&E WSI, Gene expression, CNV
Dwivedi et al. [87]	2022/04	Cancer grading	Late	H&E WSI, TC WSI
Qiu et al. [115]	2022/07	Survival prediction	Early	H&E WSI, MP, TPEF
Zuo et al. [89]	2022/09	Survival prediction	Late (Self-attention)	H&E WSI, Gene expression
De et al. [110]	2022/10	Cancer subtyping	None	H&E WSI, MRI
Li et al. [41]	2022/11	Cancer subtyping	Early	H&E WSI, SHG
Xie et al. [111]	2022/12	Survival prediction	Late	H&E WSI, Gene expression
Fatemi et al. [113]	2023/03	ST-prediction	None	H&E WSI, ST
Jiang et al. [108]	2023/03	Mutation prediction	Late (Pathomic fusion)	H&E WSI, clinical data
Gao et al. [114]	2023/07	ST-prediction, survival prediction	None	H&E WSI, ST
Gallagher et al. [59]	2023/09	Rheumatoid Subtyping	Early	H&E WSI, IHC WSI
Pati et al. [116]	2023/12	Survival prediction, Cancer grading	Early, Late	H&E, virtual IHC
Azher et al. [96]	2024/01	Survival prediction, Cancer grading	Early	H&E WSI, ST
Zheng et al. [112]	2024/01	Survival prediction	Late	WSI, Gene Expression

Table 4: Applications of Multimodal GNNs in histopathology. CNV: Copy Number Variation, TC: Trichrome, MP: MultiPhoton microscopy, TPEF: two-photon excited fluorescence microscopy, MRI: Magnetic Resonance Imaging, SHG: Second-Harmonic Generation microscopy, ST: Spatial Transcriptomics, IHC: Immunohistochemistry

4.4 Higher-order graphs

While graphs have shown to be adequate formats for the representation of histopathology slides, it is limited by the fact only pairwise relations can be modeled. Furthermore, the entities in the graphs can solely be modeled as nodes and edges. This limitation has inspired extensions to the graph modeling framework, which are collectively known as higher-order graphs. Examples of higher-order graphs are hypergraphs, cellular complexes, and combinatorial complexes. To allow learning from these higher-order graph structures, message-passing frameworks called topological neural networks (TNNs) have been developed [117].

In histopathology, TNNs have not yet been widely adopted, but there has been a steadily increasing number of publications that model WSIs as hypergraphs. Hypergraphs extend the graph modeling framework with hyperedges, which can connect sets containing an arbitrary number of nodes in the graph. This allows hypergraphs to model relations that rely on more than 2 pairwise entities. Deep learning on hypergraphs can be done using hypergraph neural network architectures, such as HGNN [118] and HyperGAT [119]. We provide an overview of publications using higher-order graphs in histopathology in Table 5.

Let us denote a hypergraph as $G=(V,E_{\text{hyp}})$ , which consists of a set of nodes $V$ and a set of hyperedges $E_{\text{hyp}}$ . Each hyperedge in $E_{\text{hyp}}$ is a pair of subsets of $V$ , allowing connections between any number of vertices. For example, a hypergraph with vertices $V=\{v_{1},v_{2},v_{3},v_{4}\}$ and hyperedges $E_{\text{hyp}}=\{\{v_{1},v_{2}\},\{v_{2},v_{3},v_{4}\},\{v_{1},v_{3},v_{4}\}\}$ of $V$ , expressing relationships between multiple nodes simultaneously. We denote the connectivity of a hypergraph using an incidence matrix $H^{|V|\times|E|}$ whose entries are defined as:

h(v,e)=\begin{cases}1,&\text{if }v\in e\\ 0,&\text{if }v\notin e\\ \end{cases}

(37)

for nodes $v\in V$ and edges $e\in E_{hyp}$ . For any node $v$ , its degree is defined as $d(v)=\sum_{e\in E_{hyp}}h(v,e)$ , similarly for any edge $e\in E_{hyp}$ , its degree is defined as $d(e)=\sum_{v\in V}h(v,e)$ . These degrees are saved in diagonal matrices $D_{e}$ and $D_{v}$ , which contain the edge degrees and node degrees, respectively. Lastly, we denote the matrix of node features as $X$ . The decision on which nodes to connect to a hyperedge is usually based on the feature similarity or spatial distance (i.e. closely related nodes are connected together by a single hyperedge). Feng et al. [118] introduced the hypergraph neural network (visualized in Figure 8), which defined a message passing operation on hypergraphs as follows:

\mathbf{X}^{(l+1)}=\sigma\left(\mathbf{D}_{v}^{-1/2}\mathbf{H}\mathbf{W}% \mathbf{D}_{e}^{-1}\mathbf{H}^{\top}\mathbf{D}_{v}^{-1/2}\mathbf{X}^{(l)}% \mathbf{\Theta^{(l)}}\right)

(38)

where $W$ is a learned weight matrix, $\sigma$ denotes a nonlinear activation function, and $\Theta$ is a learnable filter matrix used for feature extraction. After applying message passing, we have an updated feature matrix $X$ . This can then be used to obtain features on the hyperedge level using the equation $X^{(l+1)}_{he}=H^{T}\times X$ . Finally, the updated node-level embeddings are acquired by multiplying the hyperedge features with the incidence matrix: $X^{(l+1)\prime}=X^{(l+1)}_{he}\times H$ .

Di et al. [120] were the first to model WSIs as hypergraphs. They used their hypergraph approach for survival prediction in lung and brain cancer datasets. The authors started by constructing sets of $K$ similar patches based on the Euclidean distance between the feature vectors, which were retrieved using an ImageNet-pretrained ResNet model. $N$ hyperedges are then used to connect the patches in each of the sets. The authors then used the node feature matrix $X$ with the defined hypergraph, captured in $H$ , and updated the features using a series of convolutional hypergraph layers (HGNN). The acquired representations after message-passing are then used for the downstream survival prediction task. The authors show that their hypergraph-based method outperforms other CNN- and GNN-based frameworks for survival prediction.

Bakht et al. [121] followed by the construction of a patch-based hypergraph for the classification of patches in colorectal cancer. They used an ImageNet-pretrained VGG-19 model for extracting features for each WSI patch. Given a fixed neighbor parameter $k$ , their hypergraph construction strategy starts by defining the distance between any two patches $i$ , $j$ as:

d_{k}(i,j)=\exp\left(-\frac{||x_{i}-x_{j}||_{2}^{2}}{2\sigma^{2}}\right)

(39)

where $x_{i}$ , $x_{j}$ represent the feature vectors of patch $i$ and $j$ , respectively, and $\sigma$ is a bandwidth parameter. Then, the authors calculated the vertex-edge probabilistic incidence matrix which determines the probability of a node $v$ to be connected using hyperedge $e$ :

h(n,e)=\begin{cases}\exp\left(-\frac{d}{p_{\text{max}}d_{\text{avg}}}\right),&% \text{if }v\in e\\ 0,&\text{if }v\notin e\end{cases}

(40)

Here $d$ denotes the distance between the current node $n$ and the neighboring node. $p_{max}$ denotes the maximum probability and $d_{avg}$ is the average distance between all $k$ nearest neighbors. Finally, they use this incidence matrix to calculate the node and edge degrees:

d(v)=\sum_{v\in V}h(v,e),\quad d(e)=\sum_{e\in E}h(v,e)

(41)

The degrees are combined into matrices $D_{v}$ and $D_{e}$ , which are used, together with the incidence matrix $H$ and node feature matrix $X$ in 3 HGNN message passing layers. The output of these layers was used to predict the label of patches in the WSI.

Di et al. [122] then expanded on their previous work by using multiple hypergraphs that are fused together to be used as input for message passing layers. Specifically, they construct a topological hypergraph and a phenotype (feature-based) hypergraph. The authors sampled patches sequentially from the tissue boundary to the tissue center and grouped the patches in the same sequence step in the same topological area. The topological hypergraph is constructed by connecting neighboring patches with a hyperedge if they belong to the same topological area. The phenotype hypergraph meanwhile, is constructed using K-NN based on the vector similarity between the patch features. The two hypergraphs are then concatenated together to form a total incidence matrix $H$ . For processing the constructed hypergraph, the authors use max-mask convolutional layers, which are defined in 4 iterative steps:

1.

Hyperedge Feature Gathering First, hyperedge-level features are formed by multiplying the hypergraph incidence matrix ( $H$ ) and the node feature matrix ( $F_{v}^{(l)}$ ). This step aggregates the information from nodes connected by each hyperedge, resulting in hyperedge-level features ( $F_{e}^{(l)}$ ).
2.

Max-Mask Operation After gathering hyperedge-level features, a max-mask operation is performed on each dimensionality of $F_{e}^{(l)}$ . This operation aims to avoid overfitting by disregarding the contribution of dominant hyperedges that take the largest values.
3.

Node Feature Aggregating By multiplying the hyperedge features with the transposed incidence matrix ( $H^{T}\times F_{e}^{(l)}$ ), we can calculate the node features ( $F_{v}^{(l+1)}$ ).
4.

Node Feature Reweighting Finally, the output node features are further weighted using learnable parameters ( $\iota^{(l)}$ ), which are represented as a diagonal matrix. This reweighting is followed by a non-linear activation function ( $\sigma$ ). The reweighting step allows the model to learn the importance of different node features and adaptively adjust them.

Mathematically, the max-mask convolutional layer is defined as follows:

	$\displaystyle X^{(l+1)}$	$\displaystyle=\sigma\left((I-L)X^{(l)}+H^{-1}(I-L)X^{(\lambda)}\cdot\iota^{(l)% }\right)$		(42)
	$\displaystyle F^{(l+1)}_{e}$	$\displaystyle=H^{-1}(I-L)X^{(l)}+X^{(\lambda)}$		(42)

Here, $L$ is the multigraph Laplacian matrix, and $I$ denotes the identity matrix. $H^{-1}(I-L)X^{(\lambda)}$ functionally ensures that the top $\lambda$ attribute feature dimensionalities are ignored during gradient calculation.

Bankirane et al. [123] used adaptive agglomeration clustering to construct a patch hypergraph, which was then processed using a combination of HGNN and HGAT layers. The authors used self-supervised learning to learn patch-level representations. For agglomeration clustering, a similarity kernel was used that took into account both spatial locality and feature similarity between patches. This kernel calculated similarity scores between all two patches. If the similarity score was higher than a fixed threshold $\delta$ , the patches were assigned to the same cluster $C_{k}$ . For each cluster, the representation of the patches in the cluster was averaged to obtain cluster-level representations. Each clustered patch is treated as a node of a hypergraph. The hyperedges connected all nodes with a feature similarity higher than a fixed threshold $\delta_{h}$ . We denote the neighborhood of a clustered node $c_{i}$ as $\gamma(c_{i})={c_{i}\in C;\kappa_{h}(c_{i},c_{j})\geq\delta_{h}}$ . Here, $C$ denotes the set of all clusters and $\kappa_{h}(c_{i},c_{j})$ denotes the output of the feature similarity kernel $\kappa_{h}$ . Having determined the neighborhood, we can calculate the incidence matrix $H$ where:

h_{k,j}=\begin{cases}1,&\text{if }c_{i}\in\gamma(c_{i})\\ 0,&\text{else}\\ \end{cases}

(43)

The authors then used the incidence matrix $H$ , and node feature matrix $X$ as input for a series of HGNN-HGAT layers and were finally pooled into a hypergraph-level representation. This representation was finally used as input for an MLP layer which predicted the hazard score for survival prediction.

Most recently, Liang et al. [124] introduced the adaptive HGNN to histopathology, for the classification of sentinel node metastases and the differentiation between lung squamous cell carcinoma and lung adenocarcinoma. Here, the authors used the K-NN algorithm on patch-level ImageNet-pretrained ResNet features to construct a hypergraph of patches, where the $k$ most similar patches were connected using a hyperedge. Their main innovation comes in the form of adaptive HGNN, which can adjust the correlation strength between nodes and hyperedges on the graph during model training. They first denote a matrix of edge strength in layer $l$ as $T^{(l)}$ . Each element $t_{i}^{(l)}\in T^{(l)}$ , which denotes the attention score of the node $i$ and its associated hyperedge $e_{i,i^{\prime}}$ in the $l$ -th layer, is defined as:

t_{i}^{(l)}=\frac{{\exp(\sigma(sim(f_{i}M^{(l)},e_{{i,i^{\prime}}}M^{(l)})))}}% {{\sum\nolimits_{{k\in N_{j}}}{\exp(\sigma(sim(f_{i}M^{(l)},e_{i,k}M^{(l)})))}}}

(44)

here, $M^{(l)}$ denotes a feature transformation matrix. $e_{i,i^{\prime}}$ denotes the hyperedge in connecting node $i$ and $i^{\prime}$ . By calculating these edge strength scores, the incidence matrix can be updated as follows:

\tilde{H}^{i{\prime\prime(l)}}=D_{V}^{-1/2}(T_{i}^{(l)}\odot H^{i})WD_{e}^{-1}% (T_{i}^{(l)}\odot H^{i})^{{\text{T}}}D_{V}^{-1/2}

(45)

where $D_{v}$ , $D_{e}$ denote the node degree and edge degree matrices. $T^{(l)}$ denotes the edge strength matrix and $W$ is a learnable weight matrix. This function essentially adapts the node interconnection in $H$ using the calculated edge strengths in $T^{(l)}$ . Note that the edge strength changes depending on the layer $l$ , as the feature similarities also change between layer embeddings. The feature matrix is updated as follows:

\tilde{F}_{i}^{(l+1)}=\{\tilde{f}_{i,j}\}_{j=1}^{P}=\sigma((\tilde{H}^{i{% \prime\prime(l)}})F_{i}^{(l)}P_{i}^{(l)})

(46)

where $\sigma$ is a nonlinear activation function and $P_{i}^{(}l)$ denotes a learned projection matrix.

Publication	Date	Application	Hypergraph type	Message-Passing
Di et al. [120]	2020/09	Survival prediction	Patch Hypergraph	HGNN
Bakht et al. [121]	2021/05	Patch classification	Patch Hypergraph	HGNN
Di et al. [122]	2022/09	Survival prediction	Patch Hypergraph	HGMConv
Benkirane et al. [123]	2022/11	Survival prediction	Patch Hypergraph	HGCN, HGAT
Liang et al. [124]	2024/02	Binary classification	Patch HyperGraph	Adaptive HGNN

Table 5: Publications which utilized hypergraph neural networks for histopathology WSI analysis.

5 Future Prospects and Directions

5.1 Topological Deep Learning

In our review, we highlighted the application of deep learning on hypergraphs in histopathology. Interestingly, this approach has only been applied on a patch level, whereas we argue that hypergraph-based modeling might be very well suited for cell-level modeling. For example, cells can be organized in clusters that can have an important diagnostic context [125]. Such cell clusters could be modeled using hypergraphs, where homogeneous clusters are connected using a single hyperedge. Furthermore, there exist many other higher-order graph types such as cellular complexes and combinatorial complexes. We anticipate that these approaches will also be tested in a histopathological context. For example, using cellular complexes, different semantic tissue structures (e.g., tertiary lymphoid structures) can be modeled jointly with cells, but as separate graph entities.

5.2 Graph transformers

In the last few years, GNNs have been combined with transformer architectures, which has given birth to the Graph Transformer modeling paradigm. Graph transformers either use the positional embedding of the graph in the input to the transformer module, use the graph structure as a prior to build an attention mask for each input, or directly combine message passing layers with transformer blocks in the model architecture [126]. Graph transformers are especially suited for modeling long-distance relations in graphs, as they do not suffer from oversmoothing, where node representations become almost identical across the graph when using increased GNN layer depth and oversquashing, where the computational costs of adding GNN layers growth exponentially [127]. In histopathology, these graph transformers have also been used. One major challenge in the application of graph transformers is their scalability, as the time- and memory complexity of the attention mechanism in Transformers grows exponentially ( $O(|V|^{2})$ , where $V$ is the number of nodes). This is especially a problem in cell graphs in histopathology, as these graphs often pass 10.000 nodes in size. Recently, efforts have been made to greatly mitigate this scalability challenge [128] [129] [130], which leads us to believe the popularity of graph transformers in histopathology will continue to grow.

5.3 Graph-based multimodality

In our review, we highlighted the use of graph-based modeling in multimodal approaches, but we argue that graphs themselves should be utilized more for the multimodal integration itself. For example, several researchers have used the concept of a Patient graph, where nodes represents (aggregated) datapoints from different medical modalities corresponding to the same patient [131] or multiple patients [132] [133]. Some approaches use graphs to model time series data, where, for example, medical information on the same patient gathered at different timepoints can be effectively utilized [134] [135]. Zheng et al. proposed a framework in which adaptive graph structure learning and GNNs are combined to integrate data from different medical modalities for disease prediction [136]. One major problem in the application of multimodal approaches in histopathology is that, often, not every modality is available for each patient. This effectively creates a missing modality problem. Ma et al. proposed a Bayesian meta-learning framework which mitigates this problem, allowing effectively multimodal learning and prediction even when a large number of modalities are missing in the data [137]. We argue that these approaches should be combined to effectively model the relationships between modalities, based on the task at hand, even in settings where modalities are missing from the patient data.

5.4 SSL using GNNs

Due to the high costs of annotation in histopathology, adaptation of self-supervised learning (SSL) has been steadily growing in histopathology applications, particularly for feature extraction. As such, they have been primarily adapted for feature extraction in GNN approaches. Although there have been a handful of approaches that used graph-based SSL [138] [139], we argue that more can still be gained from this approach. For example, only contrastive approaches have been tried, which leaves room for other schemes (e.g., autoregressive, generative). We propose using an approach similar to that of Deep Graph Infomax [140], where the aim is to maximize the mutual information between the global graph structure and the local subgraphs. This effectively makes the node features mindful of the global graph structure. This idea can be extended in hierarchical histopathology graphs, where the agreement between intermediate graphs (e.g., cell graphs and tissue graphs) can be maximized, to get more context-aware embeddings, similar to work by Yan et al. [141]

5.5 Hierarchical modeling in GNNs

As explained in our review, hierarchical GNNs are an increasingly popular modeling technique for histopathology WSIs, due to the information in WSIs existing on different levels of coarsity. We believe that this trend will continue and extend to hypergraphs and other higher-order graph structures, for which hierarchical pooling frameworks are currently being established [142] [143]. Another future approach will be to learn the necessary level of coarsening to establish an effective hierarchical structure end-to-end, which is currently controlled using a pooling ratio hyperparameter. We argue that different levels of graph coarsity might be optimal for different problems, as some problems in histopathology rely more on cell-level information, while others more on larger tissue structures. Lastly, in most current approaches, message-passing occurs on each level of hierarchy separately, not directly between hierarchies. We argue that the field could move to message passing schemes that are more effective at taking into account the hierarchical graph structure [144].

5.6 Foundation models in computational histopathology

The rise of self-supervised learning as well as increased availability of histopathology datasets, has allowed the construction of very large deep neural networks, termed Foundation modes, trained on huge amounts of (unlabeled) histopathology images [145]. These models can be used for effective feature extraction in a wide variety of tissue types. Recent approaches have introduced medical texts in addition to image data [146] [147], which allows associating image data with medical texts and is thus very suitable for CBHIR applications. In both natural language processing and computer vision, there has been a move to foundation models that incorporate an even broader spectrum of modalities (video [148], audio [149], knowledge graphs [150]). We argue that in histopathology and medical imaging in general, there will also be a move towards broader multimodality, especially given the amount of different modalities available in the medical domain (WSI, IHC, MRI, CT, EHR, etc.). Graph models of WSIs can also be used as input in these models, to encode the topological information present in WSIs and correlate that with the image data.

5.7 Adaptive graph structure learning

We have seen that adaptive graph structure learning is currently based either on learned projections or CNN filters. Outside of histopathology, most adaptive graph structure learning assume graph homophily [151] where similar nodes are likely to be colocated. In histopathology, this is not always the case, as some structures might be composed of different cell types which can vary widely in morphology. Furthermore, most applications focus on homogeneous graphs, where a single type of node and edge exists. Work by Zhao et al. [152] showed that we can learn a heterogeneous graph optimized for downstream tasks, which is suitable for graphs showing heterophily, which can be the case in histopathology. Therefore, we argue that heterogeneous graph learning will be a useful approach for histopathology, if we model the WSIs as a heterogeneous graphs.

6 Conclusion

In this review, we provided a comprehensive overview of the recent developments in the applications of GNNs in histopathology, which can be used for guiding new research in the field. We quantified the growth of different modeling paradigms in the use of GNNs in histopathology. Based on our quantification, we provided a comprehensive overview of several emerging subfields, including hierarchical graph models, adaptive graph structure learning, multimodality using GNNs, and higher-order graph models. We then provided future directions in the field, including the use of topological deep learning, graph transformer models, self-supervised learning using GNNs, the use of foundation models and expanding adaptive graph structure learning to heterogeneous graphs.

References

[1] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
[2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
[3] Maximilian Ilse, Jakub Tomczak, and Max Welling. Attention-based deep multiple instance learning. In International conference on machine learning, pages 2127–2136. PMLR, 2018.
[4] PJ Sudharshan, Caroline Petitjean, Fabio Spanhol, Luiz Eduardo Oliveira, Laurent Heutte, and Paul Honeine. Multiple instance learning for histopathological breast cancer image classification. Expert Systems with Applications, 117:103–111, 2019.
[5] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[6] Xiyue Wang, Sen Yang, Jun Zhang, Minghui Wang, Jing Zhang, Junzhou Huang, Wei Yang, and Xiao Han. Transpath: Transformer-based self-supervised learning for histopathological image classification. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part VIII 24, pages 186–195. Springer, 2021.
[7] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
[8] Ozan Ciga, Tony Xu, and Anne Louise Martel. Self supervised contrastive learning for digital histopathology. Machine Learning with Applications, 7:100198, 2022.
[9] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE transactions on neural networks, 20(1):61–80, 2008.
[10] Ruoyu Li, Jiawen Yao, Xinliang Zhu, Yeqing Li, and Junzhou Huang. Graph cnn for survival analysis on whole slide pathological images. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 174–182. Springer, 2018.
[11] Kristof T Schütt, Huziel E Sauceda, P-J Kindermans, Alexandre Tkatchenko, and K-R Müller. Schnet–a deep learning architecture for molecules and materials. The Journal of Chemical Physics, 148(24), 2018.
[12] Shuangli Li, Jingbo Zhou, Tong Xu, Liang Huang, Fan Wang, Haoyi Xiong, Weili Huang, Dejing Dou, and Hui Xiong. Structure-aware interactive graph neural networks for the prediction of protein-ligand binding affinity. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 975–985, 2021.
[13] Zhiyong Cui, Kristian Henrickson, Ruimin Ke, and Yinhai Wang. Traffic graph convolutional recurrent neural network: A deep learning framework for network-scale traffic learning and forecasting. IEEE Transactions on Intelligent Transportation Systems, 21(11):4883–4894, 2019.
[14] Di Zhu, Fan Zhang, Shengyin Wang, Yaoli Wang, Ximeng Cheng, Zhou Huang, and Yu Liu. Understanding place characteristics in geographic contexts through graph convolutional neural networks. Annals of the American Association of Geographers, 110(2):408–420, 2020.
[15] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 974–983, 2018.
[16] Marinka Zitnik, Monica Agrawal, and Jure Leskovec. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics, 34(13):i457–i466, 2018.
[17] Benjamin Sanchez-Lengeling, Emily Reif, Adam Pearce, and Alexander B Wiltschko. A gentle introduction to graph neural networks. Distill, 6(9):e33, 2021.
[18] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
[19] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
[20] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.
[21] David Ahmedt-Aristizabal, Mohammad Ali Armin, Simon Denman, Clinton Fookes, and Lars Petersson. A survey on graph-based deep learning for computational histopathology. Computerized Medical Imaging and Graphics, 95:102027, 2022.
[22] Xiangyan Meng and Tonghui Zou. Clinical applications of graph neural networks in computational histopathology: A review. Computers in Biology and Medicine, 164:107201, 2023.
[23] Ali Khajegili Mirabadi, Graham Archibald, Amirali Darbandsari, Alberto Contreras-Sanz, Ramin Ebrahim Nakhli, Maryam Asadi, Allen Zhang, C Blake Gilks, Peter Black, Gang Wang, et al. Grasp: Graph-structured pyramidal whole slide image representation. arXiv preprint arXiv:2402.03592, 2024.
[24] Pushpak Pati, Guillaume Jaume, Lauren Alisha Fernandes, Antonio Foncubierta-Rodríguez, Florinda Feroce, Anna Maria Anniciello, Giosue Scognamiglio, Nadia Brancati, Daniel Riccio, Maurizio Di Bonito, et al. Hact-net: A hierarchical cell-to-tissue graph neural network for histopathological image classification. In Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, and Graphs in Biomedical Image Analysis: Second International Workshop, UNSURE 2020, and Third International Workshop, GRAIL 2020, Held in Conjunction with MICCAI 2020, Lima, Peru, October 8, 2020, Proceedings 2, pages 208–219. Springer, 2020.
[25] Richard J Chen, Ming Y Lu, Jingwen Wang, Drew FK Williamson, Scott J Rodig, Neal I Lindeman, and Faisal Mahmood. Pathomic fusion: an integrated framework for fusing histopathology and genomic features for cancer diagnosis and prognosis. IEEE Transactions on Medical Imaging, 41(4):757–770, 2020.
[26] Donglin Di, Jun Zhang, Fuqiang Lei, Qi Tian, and Yue Gao. Big-hypergraph factorization neural network for survival prediction from whole slide image. IEEE Transactions on Image Processing, 31:1149–1160, 2022.
[27] Yanqiao Zhu, Weizhi Xu, Jinghao Zhang, Qiang Liu, Shu Wu, and Liang Wang. Deep graph structure learning for robust representations: A survey. arXiv preprint arXiv:2103.03036, 14, 2021.
[28] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017.
[29] Shanshan Tang, Bo Li, and Haijun Yu. Chebnet: Efficient and stable constructions of deep neural networks with rectified power units using chebyshev approximations. arXiv preprint arXiv:1911.05467, 2019.
[30] Harshita Sharma, Norman Zerbe, Sebastian Lohmann, Klaus Kayser, Olaf Hellwich, and Peter Hufnagl. A review of graph-based methods for image analysis in digital histopathology. Diagnostic pathology, 1(1), 2015.
[31] Cagatay Bilgin, Cigdem Demir, Chandandeep Nagi, and Bulent Yener. Cell-graph mining for breast tissue modeling and classification. In 2007 29th Annual international conference of the IEEE Engineering in Medicine and Biology Society, pages 5311–5314. IEEE, 2007.
[32] Phillip P Santoiemma and Daniel J Powell Jr. Tumor infiltrating lymphocytes in ovarian cancer. Cancer biology & therapy, 16(6):807–820, 2015.
[33] Sahirzeeshan Ali, Robert Veltri, Jonathan A Epstein, Christhunesa Christudass, and Anant Madabhushi. Cell cluster graph for prediction of biochemical recurrence in prostate cancer patients from tissue microarrays. In Medical Imaging 2013: Digital Pathology, volume 8676, pages 164–174. SPIE, 2013.
[34] Hayley M Reynolds, Scott Williams, Alan M Zhang, Cheng Soon Ong, David Rawlinson, Rajib Chakravorty, Catherine Mitchell, and Annette Haworth. Cell density in prostate histopathology images as a measure of tumor distribution. In Medical Imaging 2014: Digital Pathology, volume 9041, pages 180–187. SPIE, 2014.
[35] Le Hou, Dimitris Samaras, Tahsin M Kurc, Yi Gao, James E Davis, and Joel H Saltz. Patch-based convolutional neural network for whole slide tissue image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2424–2433, 2016.
[36] Mohammed Adnan, Shivam Kalra, and Hamid R Tizhoosh. Representation learning of histopathology images using graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 988–989, 2020.
[37] Yushan Zheng, Bonan Jiang, Jun Shi, Haopeng Zhang, and Fengying Xie. Encoding histopathological wsis using gnn for scalable diagnostically relevant regions retrieval. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part I 22, pages 550–558. Springer, 2019.
[38] Mookund Sureka, Abhijeet Patil, Deepak Anand, and Amit Sethi. Visualization for histopathology images using graph convolutional neural networks. In 2020 IEEE 20th international conference on bioinformatics and bioengineering (BIBE), pages 331–335. IEEE, 2020.
[39] Tai Hasegawa, Helena Arvidsson, Nikolce Tudzarovski, Karl Meinke, Rachael V Sugars, and Aravind Ashok Nair. Edge-based graph neural networks for cell-graph modeling and prediction. In International Conference on Information Processing in Medical Imaging, pages 265–277. Springer, 2023.
[40] Yasha Ektefaie, George Dasoulas, Ayush Noori, Maha Farhat, and Marinka Zitnik. Multimodal learning with graphs. Nature Machine Intelligence, 5(4):340–350, 2023.
[41] Bin Li, Michael S Nelson, Omid Savari, Agnes G Loeffler, and Kevin W Eliceiri. Differentiation of pancreatic ductal adenocarcinoma and chronic pancreatitis using graph neural networks on histopathology and collagen fiber features. Journal of Pathology Informatics, 13:100158, 2022.
[42] Simon Graham, Quoc Dang Vu, Shan E Ahmed Raza, Ayesha Azam, Yee Wah Tsang, Jin Tae Kwak, and Nasir Rajpoot. Hover-net: Simultaneous segmentation and classification of nuclei in multi-tissue histology images. Medical image analysis, 58:101563, 2019.
[43] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. Slic superpixels compared to state-of-the-art superpixel methods. IEEE transactions on pattern analysis and machine intelligence, 34(11):2274–2282, 2012.
[44] Babak Ehteshami Bejnordi, Geert Litjens, Meyke Hermsen, Nico Karssemeijer, and Jeroen AWM van der Laak. A multi-scale superpixel classification approach to the detection of regions of interest in whole slide histopathology images. In Medical Imaging 2015: Digital Pathology, volume 9420, pages 99–104. SPIE, 2015.
[45] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[46] Atharva Tendle and Mohammad Rashedul Hasan. A study of the generalizability of self-supervised representations. Machine Learning with Applications, 6:100124, 2021.
[47] Zhiyang Gao, Jun Shi, and Jun Wang. Gq-gcn: Group quadratic graph convolutional network for classification of histopathological images. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part VIII 24, pages 121–131. Springer, 2021.
[48] Mo Zhang, Bin Dong, and Quanzheng Li. Ms-gwnn: multi-scale graph wavelet neural network for breast cancer diagnosis. In 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI), pages 1–5. IEEE, 2022.
[49] Wentai Hou, Lequan Yu, Chengxuan Lin, Helong Huang, Rongshan Yu, Jing Qin, and Liansheng Wang. H^ 2-mil: Exploring hierarchical representation with heterogeneous multiple instance learning for whole slide image analysis. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 933–941, 2022.
[50] Ramin Nakhli, Allen Zhang, Ali Mirabadi, Katherine Rich, Maryam Asadi, Blake Gilks, Hossein Farahani, and Ali Bashashati. Co-pilot: Dynamic top-down point cloud with conditional neighborhood aggregation for multi-gigapixel histopathology image representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21063–21073, 2023.
[51] Shidan Wang, Ruichen Rong, Qin Zhou, Donghan M Yang, Xinyi Zhang, Xiaowei Zhan, Justin Bishop, Zhikai Chi, Clare J Wilhelm, Siyuan Zhang, et al. Deep learning of cell spatial organizations identifies clinically relevant insights in tissue images. Nature communications, 14(1):7872, 2023.
[52] Valentin Anklin, Pushpak Pati, Guillaume Jaume, Behzad Bozorgtabar, Antonio Foncubierta-Rodrıguez, Jean-Philippe Thiran, Mathilde Sibony, Maria Gabrani, and Orcun Goksel. Learning whole-slide segmentation from inexact and incomplete labels using tissue graphs (2021). arXiv preprint arXiv:2103.03129.
[53] Jun Zhang, Zhiyuan Hua, Kezhou Yan, Kuan Tian, Jianhua Yao, Eryun Liu, Mingxia Liu, and Xiao Han. Joint fully convolutional and graph convolutional networks for weakly-supervised segmentation of pathology images. Medical image analysis, 73:102183, 2021.
[54] PengHui He, AiPing Qu, ShuoMin Xiao, and MeiDan Ding. A gnn-based network for tissue semantic segmentation in histopathology image. In Journal of Physics: Conference Series, volume 2504, page 012047. IOP Publishing, 2023.
[55] Sachin S Bahade, Michael Edwards, and Xianghua Xie. Cascaded graph convolution approach for nuclei detection in histopathology images. Journal of Image and Graphics, 11(1), 2023.
[56] Zhi Wang, Kai Fan, Xiaoya Zhu, Honglei Liu, Gang Meng, Minghui Wang, and Ao Li. Cross-domain nuclei detection in histopathology images using graph-based nuclei feature alignment. IEEE Journal of Biomedical and Health Informatics, 2023.
[57] Marta Wojciechowska, Stefano Malacrino, Natalia Garcia Martin, Hamid Fehri, and Jens Rittscher. Early detection of liver fibrosis using graph convolutional networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 217–226. Springer, 2021.
[58] Aravind Nair, Helena Arvidsson, Jorge E Gatica V, Nikolce Tudzarovski, Karl Meinke, and Rachael V Sugars. A graph neural network framework for mapping histological topology in oral mucosal tissue. BMC bioinformatics, 23(1):506, 2022.
[59] Amaya Gallagher-Syed, Luca Rossi, Felice Rivellese, Costantino Pitzalis, Myles Lewis, Michael Barnes, and Gregory Slabaugh. Multi-stain self-attention graph multiple instance learning pipeline for histopathology whole slide images. arXiv preprint arXiv:2309.10650, 2023.
[60] Joonsang Lee, Elisa Warner, Salma Shaikhouni, Markus Bitzer, Matthias Kretzler, Debbie Gipson, Subramaniam Pennathur, Keith Bellovich, Zeenat Bhat, Crystal Gadegbeku, et al. Clustering-based spatial analysis (clusa) framework through graph neural network for chronic kidney disease prediction using histopathology images. Scientific Reports, 13(1):12701, 2023.
[61] Ran Su, Hao He, Changming Sun, Xiaomin Wang, and Xiaofeng Liu. Prediction of drug-induced hepatotoxicity based on histopathological whole slide images. Methods, 212:31–38, 2023.
[62] Vasundhara Acharya, Diana Choi, Bülent Yener, and Gillian Beamer. Prediction of tuberculosis from lung tissue images of diversity outbred mice using jump knowledge based cell graph neural network. IEEE Access, 2024.
[63] Zhitao Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. Gnnexplainer: Generating explanations for graph neural networks. Advances in neural information processing systems, 32, 2019.
[64] Lucie Charlotte Magister, Dmitry Kazhdan, Vikash Singh, and Pietro Liò. Gcexplainer: Human-in-the-loop concept-based explanations for graph neural networks. arXiv preprint arXiv:2107.11889, 2021.
[65] Guillaume Jaume, Pushpak Pati, Antonio Foncubierta-Rodriguez, Florinda Feroce, Giosue Scognamiglio, Anna Maria Anniciello, Jean-Philippe Thiran, Orcun Goksel, and Maria Gabrani. Towards explainable graph representations in digital pathology. arXiv preprint arXiv:2007.00311, 2020.
[66] Junchi Yu, Tingyang Xu, and Ran He. Towards the explanation of graph neural networks in digital pathology with information flows. arXiv preprint arXiv:2112.09895, 2021.
[67] Sina Abdous, Reza Abdollahzadeh, and Mohammad Hossein Rohban. Ks-gnnexplainer: Global model interpretation through instance explanations on histopathology images. arXiv preprint arXiv:2304.08240, 2023.
[68] Alessandro Farace di Villaforesta, Lucie Charlotte Magister, Pietro Barbiero, and Pietro Liò. Digital histopathology with graph neural networks: Concepts and explanations for clinicians. arXiv preprint arXiv:2312.02225, 2023.
[69] Pushpak Pati, Guillaume Jaume, Antonio Foncubierta-Rodriguez, Florinda Feroce, Anna Maria Anniciello, Giosue Scognamiglio, Nadia Brancati, Maryse Fiche, Estelle Dubruc, Daniel Riccio, et al. Hierarchical graph representations in digital pathology. Medical image analysis, 75:102264, 2022.
[70] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. Advances in neural information processing systems, 31, 2018.
[71] Junhyun Lee, Inyeop Lee, and Jaewoo Kang. Self-attention graph pooling. In International conference on machine learning, pages 3734–3743. PMLR, 2019.
[72] Filippo Maria Bianchi, Daniele Grattarola, and Cesare Alippi. Spectral clustering with graph neural networks for graph pooling. In International conference on machine learning, pages 874–883. PMLR, 2020.
[73] Wentai Hou, Helong Huang, Qiong Peng, Rongshan Yu, Lequan Yu, and Liansheng Wang. Spatial-hierarchical graph neural network with dynamic structure learning for histological image classification. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 181–191. Springer, 2022.
[74] Jiangbo Shi, Lufei Tang, Zeyu Gao, Yang Li, Chunbao Wang, Tieliang Gong, Chen Li, and Huazhu Fu. Mg-trans: Multi-scale graph transformer with information bottleneck for whole slide image classification. IEEE Transactions on Medical Imaging, 2023.
[75] Puria Azadi, Jonathan Suderman, Ramin Nakhli, Katherine Rich, Maryam Asadi, Sonia Kung, Htoo Oo, Mira Keyes, Hossein Farahani, Calum MacAulay, et al. All-in: Al ocal gl obal graph-based di stillatio n model for representation learning of gigapixel histopathology images with application in cancer risk assessment. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 765–775. Springer, 2023.
[76] Hongxiao Wang, Gang Huang, Zhuo Zhao, Liang Cheng, Anna Juncker-Jensen, Máté Levente Nagy, Xin Lu, Xiangliang Zhang, and Danny Z Chen. Ccf-gnn: A unified model aggregating appearance, microenvironment, and topology for pathology image classification. IEEE Transactions on Medical Imaging, 2023.
[77] Weiqin Zhao, Shujun Wang, Maximus Yeung, Tianye Niu, and Lequan Yu. Mulgt: Multi-task graph-transformer with task-aware knowledge injection and domain knowledge-driven pooling for whole slide image analysis. arXiv preprint arXiv:2302.10574, 2023.
[78] Saisai Ding, Zhiyang Gao, Jun Wang, Minhua Lu, and Jun Shi. Fractal graph convolutional network with mlp-mixer based multi-path feature fusion for classification of histopathological images. Expert Systems with Applications, 212:118793, 2023.
[79] Yonghao Li, Yiqing Shen, Jiadong Zhang, Shujie Song, Zhenhui Li, Jing Ke, and Dinggang Shen. A hierarchical graph v-net with semi-supervised pre-training for histological image based breast cancer classification. IEEE Transactions on Medical Imaging, 2023.
[80] Yanning Zhou, Simon Graham, Navid Alemi Koohbanani, Muhammad Shaban, Pheng-Ann Heng, and Nasir Rajpoot. Cgc-net: Cell graph convolutional network for grading of colorectal cancer histology images. In Proceedings of the IEEE/CVF international conference on computer vision workshops, pages 0–0, 2019.
[81] Yushan Zheng, Zhiguo Jiang, Fengying Xie, Jun Shi, Haopeng Zhang, Jianguo Huai, Ming Cao, and Xiaomiao Yang. Diagnostic regions attention network (dra-net) for histopathology wsi recommendation and retrieval. IEEE transactions on medical imaging, 40(3):1090–1103, 2020.
[82] Nan Jiang, Yaqing Hou, Dongsheng Zhou, Pengfei Wang, Jianxin Zhang, and Qiang Zhang. Weakly supervised gleason grading of prostate cancer slides using graph neural network. In ICPRAM, pages 426–434, 2021.
[83] Yushan Zheng, Zhiguo Jiang, Haopeng Zhang, Fengying Xie, Jun Shi, and Chenghai Xue. Histopathology wsi encoding based on gcns for scalable and efficient retrieval of diagnostically relevant regions. arXiv preprint arXiv:2104.07878, 2021.
[84] Zichen Wang, Jiayun Li, Zhufeng Pan, Wenyuan Li, Anthony Sisk, Huihui Ye, William Speier, and Corey W Arnold. Hierarchical graph pathomic network for progression free survival prediction. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part VIII 24, pages 227–237. Springer, 2021.
[85] Xu Xiang and Xiaofeng Wu. Multiple instance classification for gastric cancer pathological images based on implicit spatial topological structure representation. Applied Sciences, 11(21):10368, 2021.
[86] Chensu Xie, Chad Vanderbilt, Chao Feng, David Ho, Gabrielle Campanella, Jacklynn Egger, Andrew Plodkowski, Jeffrey Girshman, Peter Sawan, Kathryn Arbour, et al. Computational biomarker predicts lung ici response via deep learning-driven hierarchical spatial modelling from h&e. 2022.
[87] Chaitanya Dwivedi, Shima Nofallah, Maryam Pouryahya, Janani Iyer, Kenneth Leidal, Chuhan Chung, Timothy Watkins, Andrew Billin, Robert Myers, John Abel, et al. Multi stain graph fusion for multimodal integration in pathology. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1835–1845, 2022.
[88] Yu Bai, Yue Mi, Yihan Su, Bo Zhang, Zheng Zhang, Jingyun Wu, Haiwen Huang, Yongping Xiong, Xiangyang Gong, and Wendong Wang. A scalable graph-based framework for multi-organ histology image classification. IEEE Journal of Biomedical and Health Informatics, 26(11):5506–5517, 2022.
[89] Yingli Zuo, Yawen Wu, Zixiao Lu, Qi Zhu, Kun Huang, Daoqiang Zhang, and Wei Shao. Identify consistent imaging genomic biomarkers for characterizing the survival-associated interactions between tumor-infiltrating lymphocytes and tumors. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 222–231. Springer, 2022.
[90] Seohoon Lim and Seung-Won Jung. A comparative study on graph construction methods for survival prediction using histopathology images. In 2022 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia), pages 1–4. IEEE, 2022.
[91] Ruiwen Ding, Erika Rodriguez, Ana Cristina Araujo Lemos Da Silva, and William Hsu. Using graph neural networks to capture tumor spatial relationships for lung adenocarcinoma recurrence prediction. In 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), pages 1–5. IEEE, 2023.
[92] Yawen Wu, Yingli Zuo, Qi Zhu, Jianpeng Sheng, Daoqiang Zhang, and Wei Shao. Transfer learning-assisted survival analysis of breast cancer relying on the spatial interaction between tumor-infiltrating lymphocytes and tumors. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 612–621. Springer, 2023.
[93] Wentai Hou, Yan He, Bingjian Yao, Lequan Yu, Rongshan Yu, Feng Gao, and Liansheng Wang. Multi-scope analysis driven hierarchical graph transformer for whole slide image based cancer survival prediction. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 745–754. Springer, 2023.
[94] Syed Farhan Abbas, Trinh Thi Le Vuong, Kyungeun Kim, Boram Song, and Jin Tae Kwak. Multi-cell type and multi-level graph aggregation network for cancer grading in pathology images. Medical Image Analysis, 90:102936, 2023.
[95] Jichen Xu, Jingmin Xin, Peiwen Shi, Jiayi Wu, Zheng Cao, Xiaoli Feng, and Nanning Zheng. Lymphoma recognition in histology image of gastric mucosal biopsy with prototype learning. In 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pages 1–4. IEEE, 2023.
[96] Zarif L Azher, Michael Fatemi, Yunrui Lu, Gokul Srinivasan, Alos B Diallo, Brock C Christensen, Lucas A Salas, Fred W Kolling IV, Laurent Perreard, Scott M Palisoul, et al. Spatial omics driven crossmodal pretraining applied to graph-based deep learning for cancer pathology analysis. In PACIFIC SYMPOSIUM ON BIOCOMPUTING 2024, pages 464–476. World Scientific, 2023.
[97] Zijian Yang, Yibo Zhang, Lili Zhuo, Kaidi Sun, Fanling Meng, Meng Zhou, and Jie Sun. Prediction of prognosis and treatment response in ovarian cancer patients from histopathology images using graph deep learning: a multicenter retrospective study. European Journal of Cancer, 199:113532, 2024.
[98] Joe Sims, Heike I Grabsch, and Derek Magee. Using hierarchically connected nodes and multiple gnn message passing steps to increase the contextual information in cell-graph classification. In MICCAI Workshop on Imaging Systems for GI Endoscopy, pages 99–107. Springer, 2022.
[99] Yonghang Guan, Jun Zhang, Kuan Tian, Sen Yang, Pei Dong, Jinxi Xiang, Wei Yang, Junzhou Huang, Yuyao Zhang, and Xiao Han. Node-aligned graph convolutional network for whole-slide image representation and classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18813–18823, 2022.
[100] Jiangbo Shi, Lufei Tang, Yang Li, Xianli Zhang, Zeyu Gao, Yefeng Zheng, Chunbao Wang, Tieliang Gong, and Chen Li. A structure-aware hierarchical graph-based multiple instance learning framework for pt staging in histopathological image. IEEE Transactions on Medical Imaging, 2023.
[101] Ravi Kant Gupta, Nikhil Cherian Kurian, Pranav Jeevan, Amit Sethi, et al. Heterogeneous graphs model spatial relationships between biological entities for breast cancer diagnosis. arXiv preprint arXiv:2307.08132, 2023.
[102] Xiaodan Xing, Yixin Ma, Lei Jin, Tianyang Sun, Zhong Xue, Feng Shi, Jinsong Wu, and Dinggang Shen. A multi-scale graph network with multi-head attention for histopathology image diagnosis. In COMPAY 2021: The third MICCAI workshop on Computational Pathology, 2021.
[103] Roozbeh Bazargani, Ladan Fazli, Larry Goldenberg, Martin Gleave, Ali Bashashati, and Septimiu Salcudean. Multi-scale relational graph convolutional network for multiple instance learning in histopathology images. arXiv preprint arXiv:2212.08781, 2022.
[104] Gianpaolo Bontempo, Angelo Porrello, Federico Bolelli, Simone Calderara, and Elisa Ficarra. Das-mil: Distilling across scales for mil classification of histological wsis. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 248–258. Springer, 2023.
[105] Pei Liu, Luping Ji, Feng Ye, and Bo Fu. Graphlsurv: A scalable survival prediction network with adaptive and sparse structure learning for histopathological whole-slide images. Computer Methods and Programs in Biomedicine, 231:107433, 2023.
[106] Zhiyang Gao, Zhiyang Lu, Jun Wang, Shihui Ying, and Jun Shi. A convolutional neural network and graph convolutional network based framework for classification of breast histopathological images. IEEE Journal of Biomedical and Health Informatics, 26(7):3163–3173, 2022.
[107] Kexin Ding, Mu Zhou, Zichen Wang, Qiao Liu, Corey W Arnold, Shaoting Zhang, and Dimitri N Metaxas. Graph convolutional networks for multi-modality medical imaging: Methods, architectures, and clinical applications. arXiv preprint arXiv:2202.08916, 2022.
[108] Yanyun Jiang, Shuai Ma, Wei Xiao, Jing Wang, Yanhui Ding, Yuanjie Zheng, and Xiaodan Sui. Predicting egfr gene mutation status in lung adenocarcinoma based on multifeature fusion. Biomedical Signal Processing and Control, 84:104786, 2023.
[109] Wei Xiao, Yanyun Jiang, Zhigang Yao, Xiaoming Zhou, Xiaodan Sui, and Yuanjie Zheng. Lad-gcn: Automatic diagnostic framework for quantitative estimation of growth patterns during clinical evaluation of lung adenocarcinoma. Frontiers in Physiology, 13:946099, 2022.
[110] Arijit De, Radhika Mhatre, Mona Tiwari, and Ananda S Chowdhury. Brain tumor classification from radiology and histopathology using deep features and graph convolutional network. In 2022 26th International Conference on Pattern Recognition (ICPR), pages 4420–4426. IEEE, 2022.
[111] Yuzhang Xie, Guoshuai Niu, Qian Da, Wentao Dai, and Yang Yang. Survival prediction for gastric cancer via multimodal learning of whole slide images and gene expression. In 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 1311–1316. IEEE, 2022.
[112] Yi Zheng, Regan D Conrad, Emily J Green, Eric J Burks, Margrit Betke, Jennifer E Beane, and Vijaya B Kolachalama. Graph attention-based fusion of pathology images and gene expression for prediction of cancer survival. bioRxiv, pages 2023–10, 2023.
[113] Michael Fatemi, Eric Feng, Cyril Sharma, Zarif Azher, Tarushii Goel, Ojas Ramwala, Scott M Palisoul, Rachael E Barney, Laurent Perreard, Fred W Kolling, et al. Inferring spatial transcriptomics markers from whole slide images to characterize metastasis-related spatial heterogeneity of colorectal tumors: A pilot study. Journal of Pathology Informatics, 14:100308, 2023.
[114] Ruitian Gao, Xin Yuan, Yanran Ma, Ting Wei, Luke Johnston, Yanfei Shao, Wenwen Lv, Tengteng Zhu, Yue Zhang, Junke Zheng, et al. Predicting gene spatial expression and cancer prognosis: An integrated graph and image deep learning approach based on he slides. bioRxiv, pages 2023–07, 2023.
[115] Lida Qiu, Deyong Kang, Chuan Wang, Wenhui Guo, Fangmeng Fu, Qingxiang Wu, Gangqin Xi, Jiajia He, Liqin Zheng, Qingyuan Zhang, et al. Intratumor graph neural network recovers hidden prognostic value of multi-biomarker spatial heterogeneity. Nature communications, 13(1):4250, 2022.
[116] Pushpak Pati, Sofia Karkampouna, Francesco Bonollo, Eva Comperat, Martina Radic, Martin Spahn, Adriano Martinelli, Martin Wartenberg, Marianna Kruithof-de Julio, and Maria Anna Rapsomaniki. Multiplexed tumor profiling with generative ai accelerates histopathology workflows and improves clinical predictions. bioRxiv, pages 2023–11, 2023.
[117] Mathilde Papillon, Sophia Sanborn, Mustafa Hajij, and Nina Miolane. Architectures of topological deep learning: A survey on topological neural networks. arXiv preprint arXiv:2304.10031, 2023.
[118] Yifan Feng, Haoxuan You, Zizhao Zhang, Rongrong Ji, and Yue Gao. Hypergraph neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 3558–3565, 2019.
[119] Kaize Ding, Jianling Wang, Jundong Li, Dingcheng Li, and Huan Liu. Be more with less: Hypergraph attention networks for inductive text classification. arXiv preprint arXiv:2011.00387, 2020.
[120] Donglin Di, Shengrui Li, Jun Zhang, and Yue Gao. Ranking-based survival prediction on histopathological whole-slide images. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 428–438. Springer, 2020.
[121] Ahsan Baidar Bakht, Sajid Javed, Hasan AlMarzouqi, Ahsan Khandoker, and Naoufel Werghi. Colorectal cancer tissue classification using semi-supervised hypergraph convolutional network. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1306–1309. IEEE, 2021.
[122] Donglin Di, Changqing Zou, Yifan Feng, Haiyan Zhou, Rongrong Ji, Qionghai Dai, and Yue Gao. Generating hypergraph-based high-order representations of whole-slide histopathological images for survival prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(5):5800–5815, 2022.
[123] Hakim Benkirane, Maria Vakalopoulou, Stergios Christodoulidis, Ingrid-Judith Garberis, Stefan Michiels, and Paul-Henry Cournède. Hyper-adac: Adaptive clustering-based hypergraph representation of whole slide images for survival analysis. In Machine Learning for Health, pages 405–418. PMLR, 2022.
[124] Meiyan Liang, Xing Jiang, Jie Cao, Bo Li, Lin Wang, Qinghui Chen, Cunlin Zhang, and Yuejin Zhao. Caf-ahgcn: context-aware attention fusion adaptive hypergraph convolutional network for human-interpretable prediction of gigapixel whole-slide image. The Visual Computer, pages 1–19, 2024.
[125] P. S. Chandran, N. B. Byju, R. Deepak, R. Rajesh Kumar, S. Sudhamony, P. Malm, and E. Bengtsson. Cluster detection in cytology images using the cellgraph method. 2012 International Symposium on Information Technologies in Medicine and Education, 2:923–927, 2012.
[126] Erxue Min, Runfa Chen, Yatao Bian, Tingyang Xu, Kangfei Zhao, Wenbing Huang, Peilin Zhao, Junzhou Huang, Sophia Ananiadou, and Yu Rong. Transformer for graphs: An overview from architecture perspective. arXiv preprint arXiv:2202.08455, 2022.
[127] Devin Kreuzer, Dominique Beaini, Will Hamilton, Vincent Létourneau, and Prudencio Tossou. Rethinking graph transformers with spectral attention. Advances in Neural Information Processing Systems, 34:21618–21629, 2021.
[128] Ladislav Rampášek, Michael Galkin, Vijay Prakash Dwivedi, Anh Tuan Luu, Guy Wolf, and Dominique Beaini. Recipe for a general, powerful, scalable graph transformer. Advances in Neural Information Processing Systems, 35:14501–14515, 2022.
[129] Hamed Shirzad, Ameya Velingker, Balaji Venkatachalam, Danica J Sutherland, and Ali Kemal Sinop. Exphormer: Sparse transformers for graphs. arXiv preprint arXiv:2303.06147, 2023.
[130] Qitian Wu, Wentao Zhao, Chenxiao Yang, Hengrui Zhang, Fan Nie, Haitian Jiang, Yatao Bian, and Junchi Yan. Simplifying and empowering transformers for large-graph representations. Advances in Neural Information Processing Systems, 36, 2024.
[131] So Yeon Kim. Gnn-surv: Discrete-time survival prediction using graph neural networks. Bioengineering, 10(9):1046, 2023.
[132] Jianliang Gao, Tengfei Lyu, Fan Xiong, Jianxin Wang, Weimao Ke, and Zhao Li. Mgnn: A multimodal graph neural network for predicting the survival of cancer patients. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1697–1700, 2020.
[133] Juan G Diaz Ochoa and Faizan E Mustafa. Graph neural network modelling as a potentially effective method for predicting and analyzing procedures based on patients’ diagnoses. Artificial Intelligence in Medicine, 131:102359, 2022.
[134] Emma Rocheteau, Catherine Tong, Petar Veličković, Nicholas Lane, and Pietro Liò. Predicting patient outcomes with graph representation learning. arXiv preprint arXiv:2101.03940, 2021.
[135] Hirad Daneshvar and Reza Samavi. Heterogeneous patient graph embedding in readmission prediction. In AI, 2022.
[136] Shuai Zheng, Zhenfeng Zhu, Zhizhe Liu, Zhenyu Guo, Yang Liu, Yuchen Yang, and Yao Zhao. Multi-modal graph learning for disease prediction. IEEE Transactions on Medical Imaging, 41(9):2207–2216, 2022.
[137] Mengmeng Ma, Jian Ren, Long Zhao, Sergey Tulyakov, Cathy Wu, and Xi Peng. Smil: Multimodal learning with severely missing modality. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 2302–2310, 2021.
[138] Yigit Ozen, Selim Aksoy, Kemal Kösemehmetoğlu, Sevgen Önder, and Ayşegül Üner. Self-supervised learning with graph neural networks for region of interest retrieval in histopathology. In 2020 25th International conference on pattern recognition (ICPR), pages 6329–6334. IEEE, 2021.
[139] Oscar Pina and Verónica Vilaplana. Self-supervised graph representations of wsis. In Geometric Deep Learning in Medical Image Analysis, pages 107–117. PMLR, 2022.
[140] Petar Veličković, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. Deep graph infomax. arXiv preprint arXiv:1809.10341, 2018.
[141] Hao Yan, Senzhang Wang, Jun Yin, Chaozhuo Li, Junxing Zhu, and Jianxin Wang. Hierarchical graph contrastive learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 700–715. Springer, 2023.
[142] Yingfu Zhao, Fusheng Jin, Ronghua Li, Hongchao Qin, Peng Cui, and Guoren Wang. Self-attention hypergraph pooling network. International Journal of Software & Informatics, 13(4), 2023.
[143] Domenico Mattia Cinque, Claudio Battiloro, and Paolo Di Lorenzo. Pooling strategies for simplicial convolutional networks. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
[144] Zhiqiang Zhong, Cheng-Te Li, and Jun Pang. Hierarchical message-passing graph neural networks. Data Mining and Knowledge Discovery, 37(1):381–408, 2023.
[145] Eugene Vorontsov, Alican Bozkurt, Adam Casson, George Shaikovski, Michal Zelechowski, Siqi Liu, Philippe Mathieu, Alexander van Eck, Donghun Lee, Julian Viret, et al. Virchow: A million-slide digital pathology foundation model. arXiv preprint arXiv:2309.07778, 2023.
[146] Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Andrew Zhang, Long Phi Le, et al. Towards a visual-language foundation model for computational pathology. arXiv preprint arXiv:2307.12914, 2023.
[147] Zhi Huang, Federico Bianchi, Mert Yuksekgonul, Thomas J Montine, and James Zou. A visual–language foundation model for pathology image analysis using medical twitter. Nature medicine, 29(9):2307–2316, 2023.
[148] Matthew Christensen, Milos Vukadinovic, Neal Yuan, and David Ouyang. Multimodal foundation models for echocardiogram interpretation. arXiv preprint arXiv:2308.15670, 2023.
[149] Josh Gardner, Simon Durand, Daniel Stoller, and Rachel M Bittner. Llark: A multimodal foundation model for music. arXiv preprint arXiv:2310.07160, 2023.
[150] Yizhen Luo, Kai Yang, Massimo Hong, Xingyi Liu, and Zaiqing Nie. Molfm: A multimodal molecular foundation model. arXiv preprint arXiv:2307.09484, 2023.
[151] Yanqiao Zhu, Weizhi Xu, Jinghao Zhang, Yuanqi Du, Jieyu Zhang, Qiang Liu, Carl Yang, and Shu Wu. A survey on graph structure learning: Progress and opportunities. arXiv preprint arXiv:2103.03036, 2021.
[152] Jianan Zhao, Xiao Wang, Chuan Shi, Binbin Hu, Guojie Song, and Yanfang Ye. Heterogeneous graph structure learning for graph neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 4697–4705, 2021.