Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
11institutetext: University of Sydney, Camperdown, NSW 2050, Australia 22institutetext: Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA 33institutetext: University of Maryland at College Park, College Park, MD 20742, USA 44institutetext: Microsoft, Redmond, WA 98052, USA

Revisiting Adaptive Cellular Recognition Under
Domain Shifts: A Contextual Correspondence View

Jianan Fan 11    Dongnan Liu 11    Canran Li 11    Hang Chang 22    Heng Huang 33   
Filip Braet
11
   Mei Chen 44    Weidong Cai 11
Abstract

Cellular nuclei recognition serves as a fundamental and essential step in the workflow of digital pathology. However, with disparate source organs and staining procedures among histology image clusters, the scanned tiles inherently conform to a non-uniform data distribution, which induces deteriorated promises for general cross-cohort usages. Despite the latest efforts leveraging domain adaptation to mitigate distributional discrepancy, those methods are subjected to modeling the morphological characteristics of each cell individually, disregarding the hierarchical latent structure and intrinsic contextual correspondences across the tumor micro-environment. In this work, we identify the importance of implicit correspondences across biological contexts for exploiting domain-invariant pathological composition and thereby propose to exploit the dependence over various biological structures for domain adaptive cellular recognition. We discover those high-level correspondences via unsupervised contextual modeling and use them as bridges to facilitate adaptation over diverse organs and stains. In addition, to further exploit the rich spatial contexts embedded amongst nuclear communities, we propose self-adaptive dynamic distillation to secure instance-aware trade-offs across different model constituents. The proposed method is extensively evaluated on a broad spectrum of cross-domain settings under miscellaneous data distribution shifts and outperforms the state-of-the-art methods by a substantial margin. Code is available at https://github.com/camwew/Cellular-Recognition_DA_CC.

1 Introduction

In the routine of pathological examination, cellular recognition, which aims to identify the specific type of each cell nucleus and segment it, contributes fundamentally to the success of various downstream clinical commissions [28, 39]. Despite the numerous learning-based approaches proposed in this context [36, 21, 15], those efforts are founded upon an ill-considered assumption that histopathology imaging data is uniformly distributed across cohorts, which dismiss the potential collapse incurred by cross-organ/stain variances [9, 11]. In digital pathology practices, query samples could be acquired from divergent organs with inconsistent staining procedures [23, 37]. The inherently present data distribution shifts induce far-reaching detriment to the robustness and general applicability of learned recognition model. The quantitative illustration and evidence are showcased in Fig. 1(a). It is observed that the precision of cellular subtyping degrades drastically when evaluated across domains, though the performance of class-agnostic segmentation is relatively robust.

Refer to caption
Figure 1: (a) Illustrative results for cellular recognition under domain shifts. X\rightarrowY denotes that the model is trained with data from X organ and then evaluated on Y organ. bDice and bPQ measure the accuracy of class-agnostic segmentation. Fsuperscript𝐹F^{*}italic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denote the F𝐹Fitalic_F score for different nuclear types, indicating subtyping accuracy. (b) Schematic diagram of the hierarchical nature of latent variables. The pathological composition principle of tumor micro-environment inherits fundamental invariance regardless of the underlying confounding factors such as sampling organs and staining protocols, holding great promises to formalize domain-agnostic biomarkers.

A promising solution is to leverage well-annotated data from one source cohort and then perform unsupervised domain adaptation (UDA) [26, 19, 8] to transfer the learned domain-agnostic knowledge to another data collection with disparate distribution. However, the efficacy of existing UDA methods in the context of cross-domain cellular recognition is intrinsically undermined by the following limitations: i) Most of these efforts are subjugated to aligning low-level visual attributes in a category-agnostic manner [40, 41, 7]. Without exploitation of discriminative contextual characteristics along adaptation, there is no guarantee that cells of different subtypes can be well separated under domain shifts. ii) In spite of the recent endeavors to perform class-conditioned alignment [45, 24, 25], a major deterrent stands that they either depend on feature representations from the source domain or require pseudo-labeling to categorize objects in the target domain. In digital pathology, the large discrepancy between the source and target domain at the feature distribution level and the unreliability of pseudo-labels on cellular species would inevitably lead to error accumulation and biased alignment.

Refer to caption
Figure 2: Exemplary H&E-stained histology tiles. In each sub-figure, red, yellow, blue, and green rectangles correspond to the nuclei of neoplastic, epithelial, connective, and inflammatory cells, respectively.

iii) Most importantly, those approaches seek to devise instance-level alignment strategy to transfer object semantics, which indicate the morphological characteristics (e.g., appearance, texture, and shape) of each individual object, to facilitate adaptation across domains [4, 46, 2]. However, cellular recognition differs from general perception tasks in computer vision by large contexts. Specifically, other than object semantics, the underpinning pathological composition principle and the resultant relational correspondences across biological structures are also indispensable for distinguishing visually ambiguous cells. As illustrated in Fig. 1(b), we characterize the composing process of tumor micro-environment with a hierarchical formulation between latent variables and representations. With conjoint parents in the directed acyclic graph [31], the representations for nuclear clusters and adjacent tissues are causally subjected to statistical dependence. The implicit correspondences among latent representations also instigate observable implications on biological structure traits. As Fig. 2 shows, the tissue structures surrounding neoplastic and epithelial cells demonstrate significant visual differences, despite the analogous morphological properties of nuclei themselves. It suggests that jointly modeling cells and surrounding tissues and exploiting their correspondences are advantageous for recognizing cells with ambiguous semantics [33]. Furthermore, the biological correlations concurrently hold that the spatial contexts among neighboring nuclei in nuclear clusters are found to be informative and could behave as discriminative markers for cell identification [1].

In this work, we propose a novel framework harnessing the inherent biological correspondences across pathological environment for cross-domain nuclei recognition. Our insight lies in leveraging the correspondences between observable biological structures to formulate a suite of surrogate tasks, from which the underlying pathological composition principle can be implicitly learned and exploited. The unveiled pathogenesis principle inherits coherence regardless of the confounding factors such as sampling organs and staining procedures, and can therefore behave as domain-invariant knowledge. By feeding neural models with the explicit invariance, we present an elegant solution to overcome shortcut learning biased from spurious correlations, which has been regarded as the primary determinant for model collapse under domain shifts [34]. In specific, we devise a multifaceted self-discovery scheme to uncover the correspondences across biological structures over complementary levels and intrinsically learn the coherent data-genesis principle to endow the adapted model with general serviceability. It is achieved by devising pretext tasks to perform counterfactual nuclei masking for exploring the implicit dependencies in an unsupervised manner. Moreover, we propose self-adaptive dynamic distillation to further leverage the correspondences within nuclear clusters via exploiting the spatial contexts of neighbouring nuclei and accordingly conduct instance-adaptive rectifications on model outputs.

In a nutshell, our contributions are three-fold: i) We develop a novel framework for cross-domain cellular nuclei recognition which, for the first time, goes beyond ambiguous object semantics and proposes to leverage the latent hierarchy in the pathological composing process and the implicit biological correspondences as bridges to foster model adaptation. ii) We propose the hierarchical self-discovery and instance-adaptive rectification methodologies to exploit the multifaceted biological correspondences for learning high-level composition principle via surrogate tasks, without the need for prior pathological knowledge. iii) We comprehensively evaluate our proposed method and demonstrate its effectiveness on diverse cross-domain settings, attaining remarkable improvements over state-of-the-art UDA methods.

2 Related Work

UDA in Biomedical Images.  Aiming at mitigating the distributional bias and learning transferable knowledge across data cohorts, unsupervised domain adaptation (UDA) [47, 35] is raised as a popular line of research. Representative works for UDA propose to mitigate the discrepancy across domains from the image appearance and feature representation level. To accomplish appearance-level alignment, a generative or style hallucinative model is introduced to transform source domain images to synthetic target-like ones [29, 26, 43]. With respect to content representation-level alignment, distributional disparity regularization and adversarial optimization are commonly adopted approaches to derive domain-invariant representation [3, 38, 42]. Those methodologies have shown impressive efficacy for alleviating cross-domain gap in various scenarios such as anatomical structure segmentation [14] and nucleus detection [40].

Domain Adaptive Object Recognition.  In the context of cross-domain object recognition, the prevailing approach is employing region-of-interest (ROI) based domain discriminators to align instance-level features from different domains [4]. Following a similar spirit, multi-branch architectures are further devised to compensate for the model collapse issue and resort to securing the best trade-offs between domain-specific and domain-invariant attributes [5, 17]. Style hallucination techniques are also utilized to mitigate domain discrepancy from the image appearance level [20]. However, the aforementioned approaches stand without consideration of category-related characteristics along domain alignment and therefore tend to incur negative transfer across irrelevant entity types. In recent literature, several efforts [45, 46, 22] have posited the beneficial practice of performing class-wise alignment and harnessing category pseudo-labels from the target domain. For instance, [17] proposes a multi-scale adversarial training paradigm which jointly minimizes image-level domain discrepancy and aligns semantic representation across domains in a class-aware manner. Nonetheless, considering the model vulnerability in cross-domain scenarios for the nuclei recognition task and the resulting biased pseudo-labels, securing precise class-wise alignment across distinct nuclei types is deemed impractical. Analogous to our work, [41, 24] study category-aware nuclei recognition under domain shifts. However, they adopt approaches similar to [45] and are therefore suboptimal due to the lack of consideration for domain-invariant biological composition and the unstable pseudo-label learning process.

3 Methodology

3.1 A Hierarchical View on Pathology Data Genesis

As a starting point, we characterize the underpinning pathological genesis process with a hierarchical latent variable model [18], as depicted in Fig. 1(b). The hierarchical dependence structure can be formally described as follows:

Proposition 1 (Hierarchical Formulation of Latent Variables)

Let 𝒢superscript𝒢\mathcal{G}^{*}caligraphic_G start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be the directed acyclic graph describing the causal structure of latent variables, with the sets of nodes and edges denoted as 𝒱𝒱\mathcal{V}caligraphic_V and \mathcal{E}caligraphic_E. 𝒱𝒱\mathcal{V}caligraphic_V is composed of the measurable representations 𝐑𝐑\mathbf{R}bold_R for biological structures as well as the high-level latent variables 𝐒𝐒\mathbf{S}bold_S. Then, the underlying data genesis procedure can be characterized with the following structural formulations: 𝐑i=𝐒jPa(𝐑i)pij(𝐒j)+Q(ϵi)subscript𝐑𝑖subscriptsubscript𝐒𝑗𝑃𝑎subscript𝐑𝑖subscript𝑝𝑖𝑗subscript𝐒𝑗𝑄subscriptitalic-ϵ𝑖\mathbf{R}_{i}=\sum_{\mathbf{S}_{j}\in Pa(\mathbf{R}_{i})}p_{ij}(\mathbf{S}_{j% })+Q(\epsilon_{i})bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT bold_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_P italic_a ( bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( bold_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + italic_Q ( italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where Pa()𝑃𝑎Pa(\cdot)italic_P italic_a ( ⋅ ) represents the set of parents for a certain node, p(,)subscript𝑝p_{(\cdot,\cdot)}italic_p start_POSTSUBSCRIPT ( ⋅ , ⋅ ) end_POSTSUBSCRIPT denotes the causal dependence function, and Q(ϵ)𝑄italic-ϵQ(\epsilon)italic_Q ( italic_ϵ ) is the probability distribution over exogenous random variables {ϵi}subscriptitalic-ϵ𝑖\{\epsilon_{i}\}{ italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }.

Based upon the principle, with the high-level pathological composition mechanism denoted as a latent variable 𝐒csubscript𝐒𝑐\mathbf{S}_{c}bold_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, we have 𝐒csubscript𝐒𝑐\mathbf{S}_{c}bold_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT as the conjoint ancestor for different biological structures, namely 𝐑c𝐒c𝐑tsubscript𝐑𝑐subscript𝐒𝑐subscript𝐑𝑡\mathbf{R}_{c}\leftarrow\mathbf{S}_{c}\rightarrow\mathbf{R}_{t}bold_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ← bold_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT → bold_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where 𝐑csubscript𝐑𝑐\mathbf{R}_{c}bold_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and 𝐑tsubscript𝐑𝑡\mathbf{R}_{t}bold_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT stand for the representations of cells and tissues.

Refer to caption

Figure 3: Conceptual illustration of the insight. To exploit the intrinsic pathological composition principle which inherits cross-domain coherence, we propose to devise self-supervised surrogate tasks to discover multifaceted biological correspondences, from which the high-level principle variables can be implicitly learned to endow the model with strengthened generalizability.

It implies that the correspondences between those biological structures can be traced back to the high-level latent factor 𝐒csubscript𝐒𝑐\mathbf{S}_{c}bold_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, which holds fundamental invariance across domains. As Fig. 3 illustrates, 𝐒csubscript𝐒𝑐\mathbf{S}_{c}bold_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT serves as an intermediate node in the correspondence discovery chain (e.g., 𝐑t𝐒c𝐑csubscript𝐒𝑐subscript𝐑𝑡subscript𝐑𝑐\mathbf{R}_{t}\xmapsto{\mathbf{S}_{c}}\mathbf{R}_{c}bold_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_ARROW start_OVERACCENT bold_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_OVERACCENT ↦ end_ARROW bold_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT), such that the knowledge can be implicitly learned and injected to the model along the path. In this regard, we propose to implicitly learn the domain-coherent principle to foster model generalization by exploring the biological correspondences via counterfactual nuclei masking and restoration.

A schematic illustration of the proposed method is shown in Fig. 4. Specifically, we propose multifaceted correspondence self-discovery and instance-adaptive dynamic distillation, which aim to capture the inherent nuclei-tissue and nuclei-nuclei relationships and exploit relational contexts within nuclear clusters to rectify model predictions dominated by the characteristics of individual nucleus, respectively. Those surrogate tasks share a backbone network with the primary nuclei recognition branch for seamless transfer of complementary knowledge.

3.2 Multifaceted Correspondence Discovery

Correspondence across Nucleus and Tissue.  As illustrated in Fig. 2, capturing the underlying correlations across nuclei and tissue structures offers discriminative contextual information and could benefit the precise identification of nuclear subtypes in spite of their ambiguous visual attributes. Those intrinsic biological correspondences are rooted in the fundamental composition mechanism of pathology data and thus inherit stronger robustness against domain shifts compared with vanilla pixel or object-level semantics. To this end, we devise a self-regulated surrogate task to exploit such domain-robust correspondences, by firstly masking nuclei pixels and then resorting to restore the concealed biological contexts. In the restoration step, the only information available is the characteristics of background tissue structures, as nuclear properties have been covered up. On that account, the task becomes predicting the attributes of nuclei according to their surrounding tissue formulation, which provides an elegant solution to learn the relationships between nuclei and tissue.

Refer to caption

Figure 4: Overview of the proposed approach. We aim to learn the implicit correspondences across various biological structures via self-regulated surrogate tasks. Specifically, we first perform nuclei masking and then learn to restore the obscured contextual details based on the characteristics of tissue and neighbouring nuclei. For correspondence discovery within nuclear cluster, the dotted bounding box and features indicate the location and mask token of the masked nucleus.

Specifically, given an image 𝐈𝐈\mathbf{I}bold_I and its nuclei binary mask 𝐌^^𝐌\mathbf{\hat{M}}over^ start_ARG bold_M end_ARG, we mask the inner details of nuclei regions by replacing all nuclei pixel values with their average value to perform counterfactual intervention: 𝐈^𝐦𝐚𝐬𝐤=𝐈(𝟙𝐌^)+𝚊𝚟𝚐(𝐌^𝐈)𝐌^subscript^𝐈𝐦𝐚𝐬𝐤direct-product𝐈1^𝐌direct-product𝚊𝚟𝚐direct-product^𝐌𝐈^𝐌\mathbf{\hat{I}_{mask}}=\mathbf{I}\odot(\mathbbm{1}-\mathbf{\hat{M}})+\mathtt{% avg}(\mathbf{\hat{M}}\odot\mathbf{I})\odot\mathbf{\hat{M}}over^ start_ARG bold_I end_ARG start_POSTSUBSCRIPT bold_mask end_POSTSUBSCRIPT = bold_I ⊙ ( blackboard_1 - over^ start_ARG bold_M end_ARG ) + typewriter_avg ( over^ start_ARG bold_M end_ARG ⊙ bold_I ) ⊙ over^ start_ARG bold_M end_ARG, where direct-product\odot denotes the element-wise product operation. Then, we restore the pixel values of masked nuclei based on the surrounding tissue characteristics and accordingly model the nuclei-tissue relationships. Considering that the functional types of nuclei are mainly dependent on the locally surrounding tissue structures [33], we construct the pixel-wise restoration decoder with convolutional filter to leverage its strong focus on local contextual information. In specific, the masked images 𝐈^𝐦𝐚𝐬𝐤subscript^𝐈𝐦𝐚𝐬𝐤\mathbf{\hat{I}_{mask}}over^ start_ARG bold_I end_ARG start_POSTSUBSCRIPT bold_mask end_POSTSUBSCRIPT are forwarded to backbone 𝑩𝑩\bm{B}bold_italic_B for encoding tissue structures, followed by the restoration block 𝑮~𝐈𝐍𝐂subscriptbold-~𝑮𝐈𝐍𝐂\bm{\tilde{G}}_{\bf{INC}}overbold_~ start_ARG bold_italic_G end_ARG start_POSTSUBSCRIPT bold_INC end_POSTSUBSCRIPT to re-fill those masked regions per pixel. The final restoration results can be represented with 𝐈^𝐫𝐞𝐜=𝑮~𝐈𝐍𝐂(𝑩(𝐈^𝐦𝐚𝐬𝐤))+𝐈^𝐦𝐚𝐬𝐤.subscript^𝐈𝐫𝐞𝐜subscriptbold-~𝑮𝐈𝐍𝐂𝑩subscript^𝐈𝐦𝐚𝐬𝐤subscript^𝐈𝐦𝐚𝐬𝐤\mathbf{\hat{I}_{rec}}=\bm{\tilde{G}}_{\bf{INC}}(\bm{B}(\mathbf{\hat{I}_{mask}% }))+\mathbf{\hat{I}_{mask}}.over^ start_ARG bold_I end_ARG start_POSTSUBSCRIPT bold_rec end_POSTSUBSCRIPT = overbold_~ start_ARG bold_italic_G end_ARG start_POSTSUBSCRIPT bold_INC end_POSTSUBSCRIPT ( bold_italic_B ( over^ start_ARG bold_I end_ARG start_POSTSUBSCRIPT bold_mask end_POSTSUBSCRIPT ) ) + over^ start_ARG bold_I end_ARG start_POSTSUBSCRIPT bold_mask end_POSTSUBSCRIPT . As optimization target of the surrogate restoration task, we impose two training objectives towards disparate yet complementary regularization endpoints:

TCD=𝔼(𝐈,𝐈^𝐫𝐞𝐜)𝒳s/t[(𝐈,𝐈^𝐫𝐞𝐜)+perpt(𝐈,𝐈^𝐫𝐞𝐜;𝐃ˇ)].subscriptTCDsubscript𝔼similar-to𝐈subscript^𝐈𝐫𝐞𝐜subscript𝒳𝑠𝑡delimited-[]𝐈subscript^𝐈𝐫𝐞𝐜subscriptperpt𝐈subscript^𝐈𝐫𝐞𝐜ˇ𝐃\mathcal{L}_{\rm{TCD}}=\mathbb{E}_{(\mathbf{I},\mathbf{\hat{I}_{rec}})\,\sim% \mathcal{X}_{s/t}}[\mathcal{H}(\mathbf{I},\mathbf{\hat{I}_{rec}})+\mathcal{L}_% {\rm{perpt}}(\mathbf{I},\mathbf{\hat{I}_{rec}};\mathbf{\check{D}})].caligraphic_L start_POSTSUBSCRIPT roman_TCD end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT ( bold_I , over^ start_ARG bold_I end_ARG start_POSTSUBSCRIPT bold_rec end_POSTSUBSCRIPT ) ∼ caligraphic_X start_POSTSUBSCRIPT italic_s / italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ caligraphic_H ( bold_I , over^ start_ARG bold_I end_ARG start_POSTSUBSCRIPT bold_rec end_POSTSUBSCRIPT ) + caligraphic_L start_POSTSUBSCRIPT roman_perpt end_POSTSUBSCRIPT ( bold_I , over^ start_ARG bold_I end_ARG start_POSTSUBSCRIPT bold_rec end_POSTSUBSCRIPT ; overroman_ˇ start_ARG bold_D end_ARG ) ] . (1)

Here 𝒳s/tsubscript𝒳𝑠𝑡\mathcal{X}_{s/t}caligraphic_X start_POSTSUBSCRIPT italic_s / italic_t end_POSTSUBSCRIPT denote the data distributions of source and target domains. The first term aims to ensure the pixel-level context coherence after restoration with a matching loss term \mathcal{H}caligraphic_H. The other perceptual regularization term enforces the generated image to not only locally approach the raw image but also exhibit harmonization and consistency from the global perspective. An adversarial discriminator 𝐃ˇˇ𝐃\mathbf{\check{D}}overroman_ˇ start_ARG bold_D end_ARG is trained together with the restoration network for estimating perceptual disparity and deriving perptsubscriptperpt\mathcal{L}_{\rm{perpt}}caligraphic_L start_POSTSUBSCRIPT roman_perpt end_POSTSUBSCRIPT [10].

Correspondence within Nuclear Cluster.  Recently, several works [27, 1] have underlined the value of considering the community nature of cells, whereas disregarding the particular value of this property under cross-domain scenarios as the correspondences within nuclei clusters are rooted in the pathogenesis of tumor [32] and could deliver implicit modeling of the invariant factor. For example, in all types of organs, epithelial cells tend to cluster together in a ring-like shape [9]. We thereupon devise an instance-level nuclei restoration task to explore the inter-nuclei correspondences. The aim is to predict the attributes of a masked nucleus according to its neighbouring nuclei, formulating a neat recipe to learn the implicit object adjacency relationships.

As indicated in Fig. 4, in each image tile 𝐈𝐈\mathbf{I}bold_I, we first perform ROI align and thereafter mask the instance-wise feature of the central nucleus. Here, masking means that we do not forward the feature of the masked nucleus to the following network. We denote the instance-wise feature maps for the masked nucleus and the neighbouring nuclei as 𝐅𝐦𝐚𝐬𝐤subscript𝐅𝐦𝐚𝐬𝐤\mathbf{F_{mask}}bold_F start_POSTSUBSCRIPT bold_mask end_POSTSUBSCRIPT and {𝐅𝐧𝐛𝐫i}i=1N1superscriptsubscriptsuperscriptsubscript𝐅𝐧𝐛𝐫𝑖𝑖1𝑁1\{\mathbf{F_{nbr}}^{i}\}_{i=1}^{N-1}{ bold_F start_POSTSUBSCRIPT bold_nbr end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT, respectively, where N𝑁Nitalic_N is the total number of nuclei proposals. The restoration scheme intends to retrieve the masked feature maps via contextual modeling of the neighboring nuclear community. It is essential to capture the long-range dependencies and spatial relationships among the group of adjacent nuclei so that the attributes of the masked nucleus can be reasonably restored. In this regard, we develop a ViT [6]-based framework where the high-level correlations across nuclei are exploited with the self-attention mechanism. The ROI features of unmasked nuclei are firstly passed through a ViT encoder 𝒁𝐅𝐍𝐑subscript𝒁𝐅𝐍𝐑\bm{Z}_{\bf{FNR}}bold_italic_Z start_POSTSUBSCRIPT bold_FNR end_POSTSUBSCRIPT. A token 𝐓𝐊𝐓𝐊\mathbf{TK}bold_TK is then concatenated with the encoded unmasked nuclei to represent the nucleus that is masked and required to be reconstructed. Positional embeddings 𝐏𝐄𝐏𝐄\mathbf{PE}bold_PE are added to incorporate the spatial information of each nucleus into consideration. Subsequently, those representations are forwarded to the feature reconstruction head 𝑮~𝐅𝐍𝐂subscriptbold-~𝑮𝐅𝐍𝐂\bm{\tilde{G}}_{\bf{FNC}}overbold_~ start_ARG bold_italic_G end_ARG start_POSTSUBSCRIPT bold_FNC end_POSTSUBSCRIPT, which is also constructed with a series of transformer blocks, to predict the feature values of the masked nucleus:

{𝐅^𝐫𝐞𝐜i}i=1N=𝑮~𝐅𝐍𝐂(𝚌𝚘𝚗𝚌𝚊𝚝[𝒁𝐅𝐍𝐑({𝐅𝐧𝐛𝐫i}i=1N1),𝐓𝐊]+𝐏𝐄).superscriptsubscriptsuperscriptsubscript^𝐅𝐫𝐞𝐜𝑖𝑖1𝑁subscriptbold-~𝑮𝐅𝐍𝐂𝚌𝚘𝚗𝚌𝚊𝚝subscript𝒁𝐅𝐍𝐑superscriptsubscriptsuperscriptsubscript𝐅𝐧𝐛𝐫𝑖𝑖1𝑁1𝐓𝐊𝐏𝐄\begin{split}\{\mathbf{\hat{F}_{rec}}^{i}\}_{i=1}^{N}=\bm{\tilde{G}}_{\bf{FNC}% }(\mathtt{concat}[\bm{Z}_{\bf{FNR}}(\{\mathbf{F_{nbr}}^{i}\}_{i=1}^{N-1}),\,% \mathbf{TK}]+\mathbf{PE}).\end{split}start_ROW start_CELL { over^ start_ARG bold_F end_ARG start_POSTSUBSCRIPT bold_rec end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT = overbold_~ start_ARG bold_italic_G end_ARG start_POSTSUBSCRIPT bold_FNC end_POSTSUBSCRIPT ( typewriter_concat [ bold_italic_Z start_POSTSUBSCRIPT bold_FNR end_POSTSUBSCRIPT ( { bold_F start_POSTSUBSCRIPT bold_nbr end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ) , bold_TK ] + bold_PE ) . end_CELL end_ROW (2)

Then, the restored map for the masked nucleus 𝐅^𝐫𝐞𝐜𝐦𝐚𝐬𝐤={𝐅^𝐫𝐞𝐜i}i=1N[N]superscriptsubscript^𝐅𝐫𝐞𝐜𝐦𝐚𝐬𝐤superscriptsubscriptsuperscriptsubscript^𝐅𝐫𝐞𝐜𝑖𝑖1𝑁delimited-[]𝑁\mathbf{\hat{F}_{rec}^{mask}}=\{\mathbf{\hat{F}_{rec}}^{i}\}_{i=1}^{N}[N]over^ start_ARG bold_F end_ARG start_POSTSUBSCRIPT bold_rec end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_mask end_POSTSUPERSCRIPT = { over^ start_ARG bold_F end_ARG start_POSTSUBSCRIPT bold_rec end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT [ italic_N ]. The training objective of the surrogate task for inter-nuclei correspondence discovery is to restore the original feature maps of the masked nucleus based on neighbouring nuclei characteristics. Therefore, we apply the matching loss \mathcal{H}caligraphic_H to penalize the inconsistency between raw feature maps and the restoration results:

NCD=𝔼(𝐅|𝐈)𝒳s/t[(𝐅𝐦𝐚𝐬𝐤,𝐅^𝐫𝐞𝐜𝐦𝐚𝐬𝐤)].subscriptNCDsubscript𝔼similar-toconditional𝐅𝐈subscript𝒳𝑠𝑡delimited-[]subscript𝐅𝐦𝐚𝐬𝐤superscriptsubscript^𝐅𝐫𝐞𝐜𝐦𝐚𝐬𝐤\mathcal{L}_{\rm{NCD}}=\mathbb{E}_{(\mathbf{F}|\mathbf{I})\,\sim\mathcal{X}_{s% /t}}[\mathcal{H}(\mathbf{F_{mask}},\mathbf{\hat{F}_{rec}^{mask}})].caligraphic_L start_POSTSUBSCRIPT roman_NCD end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT ( bold_F | bold_I ) ∼ caligraphic_X start_POSTSUBSCRIPT italic_s / italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ caligraphic_H ( bold_F start_POSTSUBSCRIPT bold_mask end_POSTSUBSCRIPT , over^ start_ARG bold_F end_ARG start_POSTSUBSCRIPT bold_rec end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_mask end_POSTSUPERSCRIPT ) ] . (3)

3.3 Self-adaptive Dynamic Distillation

In standard object recognition pipeline, identification of instance type is dominated by the appearance and texture attributes of each individual object, which are vulnerable to the data distribution biases incurred by variations in imaging protocol and staining procedure [16, 44]. To further leverage the domain-invariant contextual information and spatial relationships of nuclear clusters, we propose to take advantage of transformer’s capability to capture high-level implicit correspondences between the group of input nuclei feature representations [13] and provide adaptive guidance.

Specifically, we reuse the ViT previously deployed to characterize and embed the inter-nuclei correlations with an appended classification head. Here, the transformer-based branch operates in parallel with the basic convolution-based one. Given that the two architectures focus on different hierarchies of contextual information (i.e., low-level instance-wise attributes and high-level inter-object correspondences, respectively), performing mutual distillation could strengthen the overall results to transcend object semantics-dominated prediction. Then, we propose to adaptively adjust the trade-offs across two branches for each nucleus according to instance-wise ambiguity. First, we estimate the uncertainty of model inference with predictive entropy, which measures the quantity of information included in the model’s predictive density function [30]. With the input pairs of neural network denoted as (𝐱,𝐲superscript𝐱superscript𝐲\mathbf{x}^{*},\mathbf{y}^{*}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT) and its weights denoted by 𝐖𝐖\mathbf{W}bold_W, the approximate predictive distribution is parameterized as:

q(𝐲|𝐱)=p(𝐲|𝐱,𝐖)q(𝐖)𝐝𝐖,𝑞conditionalsuperscript𝐲superscript𝐱𝑝conditionalsuperscript𝐲superscript𝐱𝐖𝑞𝐖𝐝𝐖q(\mathbf{y}^{*}|\mathbf{x}^{*})=\int p(\mathbf{y}^{*}|\mathbf{x}^{*},\mathbf{% W})q(\mathbf{W})\mathbf{dW},italic_q ( bold_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = ∫ italic_p ( bold_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_W ) italic_q ( bold_W ) bold_dW , (4)

where p(𝐲|𝐱,𝐖)𝑝conditionalsuperscript𝐲superscript𝐱𝐖p(\mathbf{y}^{*}|\mathbf{x}^{*},\mathbf{W})italic_p ( bold_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_W ) represents the predictive distribution, q(𝐖)𝑞𝐖q(\mathbf{W})italic_q ( bold_W ) denotes the approximate variational distribution of 𝐖𝐖\mathbf{W}bold_W. The Monte Carlo estimate y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG can be thereafter derived:

y^=𝔼q(𝐲|𝐱)(𝐲)1Tt=1T𝐲^(𝐱,𝐖t),^𝑦subscript𝔼𝑞conditionalsuperscript𝐲superscript𝐱superscript𝐲1𝑇superscriptsubscript𝑡1𝑇^superscript𝐲superscript𝐱superscript𝐖𝑡\hat{y}=\mathbb{E}_{q(\mathbf{y}^{*}|\mathbf{x}^{*})}(\mathbf{y}^{*})\approx% \frac{1}{T}\sum_{t=1}^{T}\hat{\mathbf{y}^{*}}(\mathbf{x}^{*},\mathbf{W}^{t}),over^ start_ARG italic_y end_ARG = blackboard_E start_POSTSUBSCRIPT italic_q ( bold_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≈ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over^ start_ARG bold_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG ( bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_W start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , (5)

where 𝐲^^superscript𝐲\hat{\mathbf{y}^{*}}over^ start_ARG bold_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG corresponds to the label predictions, T𝑇Titalic_T is the number of stochastic forward passes. Then, the final predictive entropy is obtained by aggregating entropy over all classes 𝐔=c=1C(y^=c)𝚕𝚘𝚐(y^=c).𝐔superscriptsubscript𝑐1𝐶^𝑦𝑐𝚕𝚘𝚐^𝑦𝑐\mathbf{U}=-\sum_{c=1}^{C}\mathbb{P}(\hat{y}=c)\mathtt{log}\,\mathbb{P}(\hat{y% }=c).bold_U = - ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT blackboard_P ( over^ start_ARG italic_y end_ARG = italic_c ) typewriter_log blackboard_P ( over^ start_ARG italic_y end_ARG = italic_c ) . Here C𝐶Citalic_C denotes the number of all classes, (y^=c)^𝑦𝑐\mathbb{P}(\hat{y}=c)blackboard_P ( over^ start_ARG italic_y end_ARG = italic_c ) denotes the probability of y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG belonging to class c𝑐citalic_c. For nuclei recognition, we find that (see Fig. 1) the cross-domain degradation for connective and inflammatory cells, which are sparsely distributed and possess distinct morphological attributes [9], is relatively moderate. It indicates that the proposals conditioned on object semantics could be more reliable for a nucleus that is distant from other nuclei and possesses unambiguous individual characteristics. We thereby consider 𝐔M/𝐔Tsuperscript𝐔𝑀superscript𝐔𝑇\mathbf{U}^{M}/\mathbf{U}^{T}bold_U start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT / bold_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT as the trade-off factor over the convolution- and transformer-based classification modules, with different model tendency assigned to nuclei under divergent spatial distributions. The trade-offs are thereafter adopted to regulate the overall loss function to deliver self-adaptive dynamic guidance:

SDD=𝔼𝐈𝒳s/𝒳t[1Ni=1N𝐔iM𝐔iT𝒮iM𝒮iT;𝐈],subscriptSDDsubscript𝔼similar-to𝐈subscript𝒳𝑠subscript𝒳𝑡1𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝐔𝑖𝑀superscriptsubscript𝐔𝑖𝑇delimited-∥∥superscriptsubscript𝒮𝑖𝑀superscriptsubscript𝒮𝑖𝑇𝐈\mathcal{L}_{\rm{SDD}}=\mathbb{E}_{\mathbf{I}\,\sim\mathcal{X}_{s}/\mathcal{X}% _{t}}[\frac{1}{N}\sum_{i=1}^{N}\frac{\mathbf{U}_{i}^{M}}{\mathbf{U}_{i}^{T}}% \lVert\mathcal{S}_{i}^{M}-\mathcal{S}_{i}^{T}\rVert;\mathbf{I}],caligraphic_L start_POSTSUBSCRIPT roman_SDD end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT bold_I ∼ caligraphic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT / caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG bold_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG start_ARG bold_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG ∥ caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ ; bold_I ] , (6)

where N𝑁Nitalic_N is the total number of nuclei in image 𝐈𝐈\mathbf{I}bold_I, 𝒮Msuperscript𝒮𝑀\mathcal{S}^{M}caligraphic_S start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT and 𝒮Tsuperscript𝒮𝑇\mathcal{S}^{T}caligraphic_S start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT are the classification scores of the two model constitutes.

Table 1: Comparison results of our proposed method against other state-of-the-art methods for nuclei classification under three cross-organ settings. X\rightarrowY denotes that the model is trained on data acquired from X organ and then evaluated on Y organ samples. F𝐹Fitalic_F scores for each class and the class-averaged overall score are reported. Neo., Epi., Con., and Inf. denote neoplastic, epithelial, connective, and inflammatory cells, respectively. Best and second best results are highlighted in bold and underlined, respectively.
Methods Breast\rightarrowTestis (F score) Breast\rightarrowThyroid (F score) Breast\rightarrowBile-duct (F score)
Neo. Epi. Con. Inf. Avg. Neo. Epi. Con. Inf. Avg. Neo. Epi. Con. Inf. Avg.
Source-only 0.428 0.070 0.529 0.607 0.409 0.311 0.036 0.445 0.368 0.290 0.553 0.000 0.498 0.522 0.393
DA-RCNN [4] 0.527 0.357 0.579 0.460 0.481 0.224 0.303 0.381 0.417 0.331 0.548 0.024 0.466 0.568 0.401
PSA [45] 0.576 0.256 0.580 0.254 0.417 0.386 0.284 0.392 0.401 0.366 0.566 0.000 0.454 0.462 0.371
MGA [46] 0.540 0.302 0.556 0.358 0.439 0.306 0.116 0.448 0.413 0.321 0.535 0.000 0.473 0.526 0.384
PT-MAF [17] 0.452 0.000 0.547 0.554 0.388 0.354 0.000 0.410 0.351 0.279 0.516 0.000 0.460 0.446 0.355
HT [5] 0.518 0.373 0.585 0.579 0.514 0.391 0.264 0.401 0.330 0.347 0.572 0.056 0.501 0.595 0.431
BAFA [41] 0.535 0.215 0.510 0.572 0.458 0.293 0.228 0.477 0.434 0.358 0.558 0.049 0.465 0.590 0.416
CAPL-Net [24] 0.551 0.252 0.602 0.364 0.442 0.401 0.172 0.403 0.389 0.341 0.522 0.000 0.481 0.576 0.394
Ours 0.596 0.521 0.594 0.645 0.589 0.460 0.359 0.482 0.452 0.438 0.627 0.203 0.509 0.615 0.488

Training Pipeline.  The overall optimization objective is to minimize the aggregated loss: total=rec+λ(TCD+NCD+SDD),subscripttotalsubscriptrecsuperscript𝜆subscriptTCDsubscriptNCDsubscriptSDD\mathcal{L}_{\rm{total}}=\mathcal{L}_{\rm{rec}}+\lambda^{*}(\mathcal{L}_{\rm{% TCD}}+\mathcal{L}_{\rm{NCD}}+\mathcal{L}_{\rm{SDD}}),caligraphic_L start_POSTSUBSCRIPT roman_total end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT roman_rec end_POSTSUBSCRIPT + italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( caligraphic_L start_POSTSUBSCRIPT roman_TCD end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT roman_NCD end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT roman_SDD end_POSTSUBSCRIPT ) , where recsubscriptrec\mathcal{L}_{\rm{rec}}caligraphic_L start_POSTSUBSCRIPT roman_rec end_POSTSUBSCRIPT is the base loss of the recognition model on the source domain, and λsuperscript𝜆\lambda^{*}italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT controls the scaling factors of surrogate tasks. The supplementary loss terms are aggregated from both the source and target domains.

4 Experiments and Results

4.1 Experimental Setup

Datasets.  To verify the effectiveness and general applicability of the proposed method, we perform extensive experiments under various domain shifts incurred by the discrepancy across organs and stains. For cross-organ adaptation, we leverage four datasets sampled from different organs, i.e., breast, testis, thyroid, and bile-duct, which are retrieved from The Cancer Genome Atlas (TCGA) and [9]. Nuclei are categorized as of neoplastic, epithelial, connective, and inflammatory cells. The statistics of used data are presented in Table 2. Then, we further evaluate our method on scenarios where both cross-organ and stain shifts stand out. Following previous work [41], we adopt CoNSep [12] and PanNuke [9] as the source and target domains. CoNSep contains histology tiles from one single organ type (colon), whilst PanNuke is collected from 19 different organs. The staining procedures for those datasets are also inconsistent due to different clinical purposes and regulatory requirements across cohorts and countries. The neoplastic and epithelial classes in PanNuke are merged into one class for label space coherence.

Table 2: Class-wise statistics of nuclei in the used histology image data across different organs.
Organ Type Number of Annotated Nuclei
Neoplastic Epithelial Connective Inflammatory Total
Breast 162,780 109,758 91,053 51,597 415,188
Testis 12,021 6,252 10,845 11,160 40,278
Thyroid 10,152 13,692 12,828 5,280 41,952
Bile-duct 26,460 2,352 23,316 13,995 66,123
Table 3: Comparison results on cross-organ nuclei instance segmentation. PQ𝑃𝑄PQitalic_P italic_Q metrics over each class and averaged score are reported.
Methods Breast\rightarrowTestis (PQ score) Breast\rightarrowThyroid (PQ score) Breast\rightarrowBile-duct (PQ score)
Neo. Epi. Con. Inf. Avg. Neo. Epi. Con. Inf. Avg. Neo. Epi. Con. Inf. Avg.
Source-only 0.207 0.029 0.344 0.309 0.222 0.144 0.032 0.320 0.196 0.173 0.295 0.000 0.288 0.299 0.220
DA-RCNN [4] 0.235 0.179 0.323 0.340 0.269 0.095 0.138 0.237 0.271 0.185 0.282 0.011 0.273 0.270 0.209
MGA [46] 0.243 0.150 0.289 0.266 0.237 0.121 0.045 0.282 0.257 0.176 0.294 0.000 0.263 0.295 0.212
HT [5] 0.214 0.207 0.306 0.313 0.260 0.137 0.084 0.262 0.177 0.165 0.309 0.032 0.295 0.283 0.230
BAFA [41] 0.236 0.145 0.311 0.332 0.256 0.127 0.103 0.302 0.264 0.199 0.297 0.015 0.257 0.301 0.218
Ours 0.252 0.390 0.324 0.385 0.338 0.159 0.141 0.297 0.284 0.220 0.330 0.102 0.286 0.304 0.256
Table 4: Comparison results for nuclei recognition under both organ and stain shifts. Neo-Epi. denotes the united class for neoplastic and epithelial cells.
Methods CoNSep\rightarrowPanNuke (F score) CoNSep\rightarrowPanNuke (PQ score)
Neo-Epi. Con. Inf. Avg. Neo-Epi. Con. Inf. Avg.
Source-only 0.285 0.278 0.356 0.306 0.216 0.130 0.149 0.165
DA-RCNN [4] 0.316 0.254 0.411 0.327 0.227 0.122 0.165 0.171
MGA [46] 0.370 0.272 0.359 0.334 0.240 0.117 0.158 0.172
HT [5] 0.348 0.251 0.366 0.322 0.195 0.102 0.144 0.147
BAFA [41] 0.261 0.280 0.403 0.314 0.183 0.120 0.174 0.159
CAPL-Net [24] 0.335 0.264 0.332 0.310 0.231 0.113 0.155 0.167
Ours 0.404 0.300 0.401 0.368 0.258 0.149 0.166 0.191
Methods PanNuke\rightarrowCoNSep (F score) PanNuke\rightarrowCoNSep (PQ score)
Neo-Epi. Con. Inf. Avg. Neo-Epi. Con. Inf. Avg.
Source-only 0.796 0.639 0.590 0.675 0.305 0.213 0.312 0.276
DA-RCNN [4] 0.819 0.588 0.603 0.669 0.289 0.174 0.335 0.265
MGA [46] 0.774 0.609 0.602 0.661 0.266 0.185 0.302 0.251
HT [5] 0.830 0.647 0.611 0.696 0.301 0.195 0.318 0.271
BAFA [41] 0.793 0.665 0.626 0.695 0.298 0.207 0.349 0.285
CAPL-Net [24] 0.768 0.574 0.606 0.649 0.273 0.178 0.320 0.257
Ours 0.856 0.696 0.618 0.723 0.356 0.252 0.352 0.320

Implementation Details and Evaluation Metrics.  Following previous works [19], we adopt Mask R-CNN [16] as the base model. The matching loss \mathcal{H}caligraphic_H between a pair of inputs is implemented with L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT regularization term. For nuclei masking on the target domain, we utilize the box proposals and masks generated with a model trained on the source domain since Fig. 1 shows that deep model is robust to domain shifts for class-agnostic segmentation.

For evaluation on the classification task, we follow previous works [12] and adopt the F𝐹Fitalic_F score to measure the performance of nuclei classification: F=TPTP+FP+FN.𝐹𝑇𝑃𝑇𝑃𝐹𝑃𝐹𝑁F=\frac{TP}{TP+FP+FN}.italic_F = divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_P + italic_F italic_N end_ARG . More weights are assigned to FP and FN compared with normal F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score to impose emphasis on false classification results. For instance segmentation, we use panoptic quality (PQ) score [9] for quantitative evaluation. Both F𝐹Fitalic_F and PQ𝑃𝑄PQitalic_P italic_Q scores are computed for each class and then averaged to demonstrate the overall performance.

4.2 Comparison with State-of-the-Art Methods

We compare our proposed method against the state-of-the-art UDA object recognition methods, including DA-RCNN [4], PSA [45], MGA [46], PT-MAF [17], HT [5], BAFA [41], and CAPL-Net [24] to justify its effectiveness. The results when the source domain-trained model is adopted straight for evaluations on the target domain are also presented for reference. For fair comparisons, those methods are implemented with the same backbone architecture and training settings (batch size, learning rate, etc.) as ours. We report the quantitative comparison results for each type of nuclei and the averaged value over all classes under a 3-fold cross-validation setting. Paired t-tests are also conducted between our method and the others on class-averaged overall scores. The resulting p-values for all tests are below 0.05, indicating the proposed method significantly outperforms the approaches in comparison.

In Table 1, we present the results for classification task under cross-organ domain shifts. It is observed that our method achieves significant improvements in terms of class-averaged scores in all the three adaptation scenarios. The advancement can be attributed to our proposed method’s capacity of identifying nuclei with ambiguous semantics and bypassing performance degradation incurred by inaccurate category pseudo-labels. The results are further discussed in Section 4.4. In Table 3, the quantitative performance of category-wise instance segmentation is presented. With the attained high-quality nuclei type identification results, our method concurrently yields appealing accuracy regarding class-wise and -averaged PQ. Additionally, as depicted in Fig. 5, most competing methods fail to recognize epithelial cells and undesirably categorize them into neoplastic ones. On the contrary, our proposed method successfully distinguishes the two types of nuclei with very similar visual attributes.

To further verify the generalizability of our method, we evaluate its efficacy under both organ and stain shifts. Following [41], we adopt CoNSep and PanNuke to construct the adaptation benchamark and perform bi-directional experiments. We adopt the evaluation metrics in [41] to jointly consider the performance of nuclei detection and classification tasks. PQ scores are also presented for segmentation quality evaluation. As shown in Table 4, we compare our method with the state-of-the-art UDA object recognition approaches. The empirical improvements of our method in the sophisticated cross-domain setting are consistent with previous experiments and observations, which substantiates the effectiveness and robustness of our method against various types of domain shifts in histology data.

Refer to caption

Figure 5: Qualitative comparison of nuclei recognition results. Images in the top two rows are from the testis, whereas images in the third and fourth rows are from the thyroid and bile-duct, respectively. In each sub-figure, red, yellow, blue, and green contours correspond to the nuclei of neoplastic, epithelial, connective, and inflammatory cells, respectively.

4.3 Ablation Study

To validate the efficacy of key components in the proposed method, we perform ablation studies on the classification and category-wise instance segmentation tasks by evaluating with several variants of the method. The corresponding quantitative comparison results are reported in Table 5, where MD denotes the mutual distillation across architectures and SA is the estimated trade-offs for self-adaptive dynamic distillation. It is remarked that all the components have positive impacts on improving the overall classification accuracy. In specific, solely employing TCD or the combination of NCD and MD can already lead to competitive results. It justifies the importance to explore the implicit biological correspondences for cross-domain nuclei recognition. By integrating those constituents together, we reach peak performance. Moreover, the employment of instance-adaptive guidance further boosts the F𝐹Fitalic_F score by around 3%percent33\%3 %. For instance segmentation, we observe that NCD tends to have a negative impact on the overall accuracy. The reason could be that in NCD, we use the global pooling strategy, which inevitably discards the fine-grained spatial information. In contrast, TCD exhibits beneficial effects on segmentation. It is in virtue of the design that when performing image-level nuclei masking, we keep all the spatial details and introduce reliable masks, which subsequently serves as a guidance for segmentation.

Table 5: Ablation study to verify the efficacy of key components in our method. The class-averaged overall F𝐹Fitalic_F and PQ𝑃𝑄PQitalic_P italic_Q scores are presented for each case. Tes., Thyr. and Bile. stand as the abbreviations for testis, thyroid, and bile-duct, respectively. \checkmark marks indicate the utilized modules. The best results are highlighted in bold.
TCD NCD MD SA Tes. (F) Tes. (PQ) Thyr. (F) Thyr. (PQ) Bile. (F) Bile. (PQ)
Settings 0.409 0.222 0.290 0.173 0.393 0.220
\checkmark 0.446 0.232 0.326 0.189 0.427 0.227
\checkmark \checkmark 0.472 0.240 0.346 0.180 0.431 0.214
\checkmark \checkmark 0.508 0.256 0.382 0.197 0.465 0.234
\checkmark \checkmark \checkmark 0.557 0.294 0.401 0.215 0.473 0.250
\checkmark \checkmark \checkmark 0.554 0.327 0.417 0.231 0.448 0.247
\checkmark \checkmark \checkmark \checkmark 0.589 0.338 0.438 0.220 0.488 0.256

To investigate the beneficial impact of our proposed method in greater detail, we perform sensitivity analysis on the loss weighting terms λ={λ1,λ2,λ3}superscript𝜆subscript𝜆1subscript𝜆2subscript𝜆3\lambda^{*}=\{\lambda_{1},\lambda_{2},\lambda_{3}\}italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = { italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT } set, which correspond to the scaling factors of tissue correspondence discovery, nuclear correspondence discovery, and self-adaptive dynamic distillation, respectively. The results are depicted in Fig. 6.

Refer to caption
Figure 6: Sensitivity analysis of loss weighting items on the Breast\rightarrowTestis setting.

For each weighting item, we scale it by a factor of 0.1, 0.3, 1.0, 3.0, and 10.0, respectively, and keep other items fixed. The corresponding quantitative evaluation results for classification and category-aware instance segmentation are reported. All the experiments are conducted on the Breast\rightarrowTestis setting. It can be observed that our main setting (i.e., when all scale factors equal to 1.0) achieves superior performance compared with most adjusted settings. In addition, when decreasing the influential factor of λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, there exists a noticeable performance drop for both tasks. This finding is consistent with the results of ablation studies and substantiates the importance of the proposed instance-adaptive dynamic distillation strategy. On the other hand, increasing the influential factor of λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT has detrimental effects on instance segmentation, which is mainly attributed to the introduced global pooling strategy.

4.4 Discussions

Identification of Nuclei with Ambiguous Semantics.  Regarding the results in Table 1, the advancements of our method can be mainly attributed to its capability for precisely identifying neoplastic and epithelial cells.

Refer to caption

Figure 7: Examples of H&E-stained histology image regions. In each sub-figure, blue and green rectangles correspond to the nuclei of connective and inflammatory cells, respectively.

Different from connective and inflammatory cells which possess distinct individual shape and texture characteristics (i.e., the connective cell typically exhibits a flat polygon pattern in terms of geometric shape and the inflammatory cell is much darker than others in color space, as showcased in Fig. 7), the ambiguity between neoplastic and epithelial cells makes them indistinguishable for object semantics-conditioned model under cross-domain scenarios. It is challenging to distinguish those two types of cells solely based on their appearance attributes. To this end, existing methods [4, 45] which perform UDA with object-wise alignment struggle to find the decision boundary to separate neoplastic and epithelial cells. In contrast, by exploiting the informative correlations across biological structures, which demonstrate stronger visual contrast, our proposed method significantly lifts the performance to distinguish those two types of cells with analogous morphological traits.

Sidestepping Reliance on Biased Pseudo-labels.  With respect to methods built upon category pseudo-labels [45, 41, 46], they bring relatively limited empirical gains for cross-domain nuclei recognition. For example, in the Breast\rightarrowTestis setting, the overall average F𝐹Fitalic_F scores of those methods are exceeded by the ones that do not depend on category pseudo-labels by almost 5%percent55\%5 %. The degradation is on the pitfalls of the successive error accumulation caused by biased category pseudo-labels, which is inevitable considering the drastic model collapse across sampling organs and staining protocols. In this regard, with specifically devised surrogate tasks as bridges for model transfer across domains, our proposed method gets rid of the self-training scheme and the reliance on category pseudo-labels, which contributes to remarkable improvements in consequence.

Statistical Analysis.  The key advancement of our method, as previously discussed, is its capability to distinguish nuclei with ambiguous morphological characteristics. We therefore conduct detailed statistical analysis on the nuclear classification results regarding epithelial cells. The resulting box plots and p-values shown in Fig. 8 substantiate the statistical significance of our achieved improvements.

Refer to caption

Figure 8: Statistical analysis on nuclear type recognition results for morphologically ambiguous epithelial cells with paired t-tests.

5 Conclusion

In this work, we propose a holistic framework to facilitate cross-domain cellular nuclei recognition via exploitation of implicit biological relationships at image and instance feature levels. Additionally, we devise self-adaptive dynamic distillation to further leverage the rich relational contexts inherently present in nuclear communities with instance-aware trade-offs across model architectures. Experiments on several cross-domain settings with organ and stain shifts demonstrate that our method addresses the common issues existing in the state-of-the-art UDA object recognition approaches and achieves compelling performance. In future work, we will investigate more challenging yet practical domain adaption scenarios when the cross-domain shifts concurrently stand for class distributions.

References

  • [1] Abousamra, S., Belinsky, D., Van Arnam, J., Allard, F., Yee, E., Gupta, R., Kurc, T., Samaras, D., Saltz, J., Chen, C.: Multi-class cell detection using spatial context representation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4005–4014 (2021)
  • [2] Cao, S., Joshi, D., Gui, L.Y., Wang, Y.X.: Contrastive mean teacher for domain adaptive object detectors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 23839–23848 (2023)
  • [3] Chen, C., Dou, Q., Chen, H., Qin, J., Heng, P.A.: Unsupervised bidirectional cross-modality adaptation via deeply synergistic image and feature alignment for medical image segmentation. IEEE Transactions on Medical Imaging 39(7), 2494–2505 (2020)
  • [4] Chen, Y., Li, W., Sakaridis, C., Dai, D., Van Gool, L.: Domain adaptive faster r-cnn for object detection in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3339–3348 (2018)
  • [5] Deng, J., Xu, D., Li, W., Duan, L.: Harmonious teacher for cross-domain object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 23829–23838 (2023)
  • [6] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 (2021)
  • [7] Fan, J., Liu, D., Chang, H., Cai, W.: Learning to generalize over subpartitions for heterogeneity-aware domain adaptive nuclei segmentation. International Journal of Computer Vision pp. 1–24 (2024)
  • [8] Fan, J., Liu, D., Chang, H., Huang, H., Chen, M., Cai, W.: Taxonomy adaptive cross-domain adaptation in medical imaging via optimization trajectory distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 21174–21184 (2023)
  • [9] Gamper, J., Koohbanani, N.A., Benet, K., Khuram, A., Rajpoot, N.: Pannuke: an open pan-cancer histology dataset for nuclei instance segmentation and classification. In: European Congress on Digital Pathology. pp. 11–19. Springer (2019)
  • [10] Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning. pp. 1180–1189. PMLR (2015)
  • [11] Graham, S., Jahanifar, M., Azam, A., Nimir, M., Tsang, Y.W., Dodd, K., Hero, E., Sahota, H., Tank, A., Benes, K., et al.: Lizard: A large-scale dataset for colonic nuclear instance segmentation and classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 684–693 (2021)
  • [12] Graham, S., Vu, Q.D., Raza, S.E.A., Azam, A., Tsang, Y.W., Kwak, J.T., Rajpoot, N.: Hover-net: Simultaneous segmentation and classification of nuclei in multi-tissue histology images. Medical Image Analysis 58, 101563 (2019)
  • [13] Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., Xu, Y., et al.: A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(1), 87–110 (2022)
  • [14] Han, X., Qi, L., Yu, Q., Zhou, Z., Zheng, Y., Shi, Y., Gao, Y.: Deep symmetric adaptation network for cross-modality medical image segmentation. IEEE Transactions on Medical Imaging 41(1), 121–132 (2021)
  • [15] He, H., Wang, J., Wei, P., Xu, F., Ji, X., Liu, C., Chen, J.: Toposeg: Topology-aware nuclear instance segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 21307–21316 (2023)
  • [16] He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2961–2969 (2017)
  • [17] He, Z., Zhang, L., Gao, X., Zhang, D.: Multi-adversarial faster-rcnn with paradigm teacher for unrestricted object detection. International Journal of Computer Vision 131(3), 680–700 (2023)
  • [18] Hinton, G.: How to represent part-whole hierarchies in a neural network. Neural Computation pp. 1–40 (2022)
  • [19] Hsu, J., Chiu, W., Yeung, S.: Darcnn: Domain adaptive region-based convolutional neural network for unsupervised instance segmentation in biomedical images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1003–1012 (2021)
  • [20] Huang, J., Guan, D., Xiao, A., Lu, S.: Rda: Robust domain adaptation via fourier adversarial attacking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8988–8999 (2021)
  • [21] Huang, J., Li, H., Wan, X., Li, G.: Affine-consistent transformer for multi-class cell nuclei detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 21384–21393 (2023)
  • [22] Kennerley, M., Wang, J.G., Veeravalli, B., Tan, R.T.: 2pcnet: Two-phase consistency training for day-to-night unsupervised domain adaptive object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11484–11493 (2023)
  • [23] Kumar, N., Verma, R., Anand, D., Zhou, Y., Onder, O.F., Tsougenis, E., Chen, H., Heng, P.A., Li, J., Hu, Z., et al.: A multi-organ nucleus segmentation challenge. IEEE Transactions on Medical Imaging 39(5), 1380–1391 (2019)
  • [24] Li, C., Liu, D., Li, H., Zhang, Z., Lu, G., Chang, X., Cai, W.: Domain adaptive nuclei instance segmentation and classification via category-aware feature alignment and pseudo-labelling. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 715–724. Springer (2022)
  • [25] Li, W., Liu, X., Yuan, Y.: Sigma: Semantic-complete graph matching for domain adaptive object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5291–5300 (2022)
  • [26] Liu, D., Zhang, D., Song, Y., Zhang, F., O’Donnell, L., Huang, H., Chen, M., Cai, W.: Unsupervised instance segmentation in microscopy images via panoptic domain adaptation and task re-weighting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4243–4252 (2020)
  • [27] Liu, P., Bilgic, M.: Relational classification of biological cells in microscopy images. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 344–352 (2021)
  • [28] Lu, C., Romo-Bucheli, D., Wang, X., Janowczyk, A., Ganesan, S., Gilmore, H., Rimm, D., Madabhushi, A.: Nuclear shape and orientation features from h&e images predict survival in early-stage estrogen receptor-positive breast cancers. Laboratory investigation 98(11), 1438–1448 (2018)
  • [29] Murez, Z., Kolouri, S., Kriegman, D., Ramamoorthi, R., Kim, K.: Image to image translation for domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4500–4509 (2018)
  • [30] Nair, T., Precup, D., Arnold, D.L., Arbel, T.: Exploring uncertainty measures in deep networks for multiple sclerosis lesion detection and segmentation. Medical image analysis 59, 101557 (2020)
  • [31] Peters, J., Janzing, D., Schölkopf, B.: Elements of causal inference: foundations and learning algorithms. The MIT Press (2017)
  • [32] Rendeiro, A.F., Ravichandran, H., Bram, Y., Chandar, V., Kim, J., Meydan, C., Park, J., Foox, J., Hether, T., Warren, S., et al.: The spatial landscape of lung pathology during covid-19 progression. Nature 593(7860), 564–569 (2021)
  • [33] Ryu, J., Puche, A.V., Shin, J., Park, S., Brattoli, B., Lee, J., Jung, W., Cho, S.I., Paeng, K., Ock, C.Y., et al.: Ocelot: Overlapped cell on tissue dataset for histopathology. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 23902–23912 (2023)
  • [34] Saranrittichai, P., Mummadi, C.K., Blaiotta, C., Munoz, M., Fischer, V.: Overcoming shortcut learning in a target domain by generalizing basic visual factors from a source domain. In: European Conference on Computer Vision. pp. 294–309. Springer (2022)
  • [35] Shin, H., Kim, H., Kim, S., Jun, Y., Eo, T., Hwang, D.: Sdc-uda: Volumetric unsupervised domain adaptation framework for slice-direction continuous cross-modality medical image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7412–7421 (2023)
  • [36] Tyagi, A.K., Mohapatra, C., Das, P., Makharia, G., Mehra, L., AP, P., et al.: Degpr: Deep guided posterior regularization for multi-class cell detection and counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 23913–23923 (2023)
  • [37] Verma, R., Kumar, N., Patil, A., Kurian, N.C., Rane, S., Graham, S., Vu, Q.D., Zwager, M., Raza, S.E.A., Rajpoot, N., et al.: Monusac2020: A multi-organ nuclei segmentation and classification challenge. IEEE Transactions on Medical Imaging 40(12), 3413–3423 (2021)
  • [38] Wu, F., Zhuang, X.: Unsupervised domain adaptation with variational approximation for cardiac segmentation. IEEE Transactions on Medical Imaging 40(12), 3555–3567 (2021)
  • [39] Wu, H., Wang, Z., Song, Y., Yang, L., Qin, J.: Cross-patch dense contrastive learning for semi-supervised segmentation of cellular nuclei in histopathologic images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11666–11675 (2022)
  • [40] Xing, F., Cornish, T.C., Bennett, T.D., Ghosh, D.: Bidirectional mapping-based domain adaptation for nucleus detection in cross-modality microscopy images. IEEE Transactions on Medical Imaging 40(10), 2880–2896 (2020)
  • [41] Yang, S., Zhang, J., Huang, J., Lovell, B.C., Han, X.: Minimizing labeling cost for nuclei instance segmentation and classification with cross-domain images and weak labels. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 697–705 (2021)
  • [42] Zhang, H., Zhang, Y.F., Liu, W., Weller, A., Schölkopf, B., Xing, E.P.: Towards principled disentanglement for domain generalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8024–8034 (2022)
  • [43] Zhang, Y., Li, M., Li, R., Jia, K., Zhang, L.: Exact feature distribution matching for arbitrary style transfer and domain generalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8035–8045 (2022)
  • [44] Zhao, Z.Q., Zheng, P., Xu, S.t., Wu, X.: Object detection with deep learning: A review. IEEE Transactions on Neural Networks and Learning Systems 30(11), 3212–3232 (2019)
  • [45] Zheng, Y., Huang, D., Liu, S., Wang, Y.: Cross-domain object detection through coarse-to-fine feature adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13766–13775 (2020)
  • [46] Zhou, W., Du, D., Zhang, L., Luo, T., Wu, Y.: Multi-granularity alignment domain adaptation for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9581–9590 (2022)
  • [47] Zhou, Z., Qi, L., Yang, X., Ni, D., Shi, Y.: Generalizable cross-modality medical image segmentation via style augmentation and dual normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20856–20865 (2022)