Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
11institutetext: School of Electronic Information and Communications, Huazhong University of Science and Technology
11email: {xyy_2001,wnn2000,hustlyu,xinyang2014,z_yan}@hust.edu.cn
22institutetext: School of Engineering, Hong Kong University of Science and Technology 22email: timcheng@ust.hk

FedIA: Federated Medical Image Segmentation with Heterogeneous Annotation Completeness

Yangyang Xiang Equal contribution.11    Nannan Wu ⋆1⋆1    Li Yu 11    Xin Yang 11   
Kwang-Ting Cheng
22
   Zengqiang Yan1(🖂)
Abstract

Federated learning has emerged as a compelling paradigm for medical image segmentation, particularly in light of increasing privacy concerns. However, most of the existing research relies on relatively stringent assumptions regarding the uniformity and completeness of annotations across clients. Contrary to this, this paper highlights a prevalent challenge in medical practice: incomplete annotations. Such annotations can introduce incorrectly labeled pixels, potentially undermining the performance of neural networks in supervised learning. To tackle this issue, we introduce a novel solution, named FedIA. Our insight is to conceptualize incomplete annotations as noisy data (i.e., low-quality data), with a focus on mitigating their adverse effects. We begin by evaluating the completeness of annotations at the client level using a designed indicator. Subsequently, we enhance the influence of clients with more comprehensive annotations and implement corrections for incomplete ones, thereby ensuring that models are trained on accurate data. Our method’s effectiveness is validated through its superior performance on two extensively used medical image segmentation datasets, outperforming existing solutions. The code is available at https://github.com/HUSTxyy/FedIA.

Keywords:
Federated learning Incomplete annotation Noisy label learning Segmentation.

1 Introduction

Recent progress in federated learning (FL) [13] has facilitated the collaborative training of unified models across multiple decentralized entities in a privacy-preserving manner [2, 6, 19, 22]. In medical domains, FL has seen extensive application in training segmentation models for distinct lesions and organs [18, 8, 20]. Nevertheless, an essential limitation in current research is the insufficient consideration of the diversity in annotation completeness among clients.

This issue primarily stems from the varying standards of annotation adopted by various clients. As depicted in Fig. 1, certain clients (i.e., client k) provide complete annotations for comprehensive diagnosis and analysis. Conversely, others (i.e., client i and j) may possess incomplete annotations where only partial regions are marked, to minimize labeling costs, which are adequate only for basic image-level assessments (e.g., rapid screening).

Refer to caption
Figure 1: Heterogeneity in annotation completeness among clients. Red and blue solid lines represent the boundaries of marked lesions and unmarked lesions, respectively.

Given this heterogeneity in annotation completeness, training a segmentation model via FL poses significant challenges. The inclusion of clients with incomplete annotations creates a situation where these clients are considered to be of lower quality since partial positive regions are mislabeled as background. Such imperfect annotations can negatively affect the overall performance of the model due to the memory effect of neural networks [12, 11]. To tackle this, in this paper, we focus on the important yet under-explored problem: How to pursue better FL under heterogeneity in annotation completeness?

Within the realm of FL, there has been some work focusing on data heterogeneity [4, 5, 7], but the heterogeneity in annotation completeness has been often overlooked. As for strategies to diminish the negative impact of clients with low-quality labels, these solutions predominantly focus on the classification task [23, 21], which is suboptimal when applied to the segmentation task. Although FedA3I [20] has recently addressed the heterogeneity in annotation quality specific to segmentation, its underlying assumption, where mislabeled pixels mainly distribute near objects’ boundaries, renders it ineffective against the challenge of incomplete annotations. Consequently, developing an effective approach to address this critical issue remains an area in need of further exploration and insight.

In this study, we tackle the pressing problem of heterogeneity in annotation completeness by introducing FedIA, a FL solution that is cognizant of and adaptively corrects for client annotation completeness. Our foundational insight is to perceive incomplete annotations as akin to noisy data. We commence by developing an early model robust against the noise associated with incomplete annotations, which then serves as a basis for evaluating each client’s level of annotation completeness. Subsequently, our aggregation process prioritizes clients with higher annotation completeness, and clients undertake annotation corrections before local model updating supervised by incomplete annotations. Our approach has been tested on two real-world medical datasets: a brain multiple sclerosis MRI dataset and a COVID-19 lesion CT dataset. Results show that FedIA outperforms other SOTA methods designed to address noisy/imperfect annotations.

The contributions of this paper are three-fold: (1) A new FL problem concentrating on heterogeneity in annotation completeness; (2) A novel solution named FedIA to tackle incomplete annotations; (3) Extensive evaluation to demonstrate the superiority of the proposed solution.

2 Methodology

2.1 Preliminaries and Overview

This paper focus on a a single-class multi-lesion111This means an image can contain multiple lesions, each forming a connected region. segmentation problem in a federated scenario. Given K𝐾Kitalic_K clients, each client possesses its private dataset Dk={(xi𝒳H×W×C,y~i𝒴={0,1}H×W)}i=1nkD_{k}=\{\left(x_{i}\in\mathcal{X}\subseteq\right.\left.\mathbb{R}^{\mathrm{H}% \times\mathrm{W}\times\mathrm{C}},\tilde{y}_{i}\in\mathcal{Y}=\{0,1\}^{\mathrm% {H}\times\mathrm{W}}\right)\}_{i=1}^{n_{k}}italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_X ⊆ blackboard_R start_POSTSUPERSCRIPT roman_H × roman_W × roman_C end_POSTSUPERSCRIPT , over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_Y = { 0 , 1 } start_POSTSUPERSCRIPT roman_H × roman_W end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where nksubscript𝑛𝑘{n_{k}}italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the size of Dksubscript𝐷𝑘D_{k}italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and (xi,y~i)subscript𝑥𝑖subscript~𝑦𝑖\left(x_{i},\tilde{y}_{i}\right)( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) represents the image-annotation pair characterized by dimensions: height HH\mathrm{H}roman_H, width WW\mathrm{W}roman_W, and channel CC\mathrm{C}roman_C. Contrary to an ideal situation, the annotations in our case are considered imperfect due to incompleteness, with not every lesion being marked. The completeness ratio ak=ck,in/ck,igsubscript𝑎𝑘subscriptsuperscript𝑐𝑛𝑘𝑖subscriptsuperscript𝑐𝑔𝑘𝑖a_{k}=c^{n}_{k,i}/c^{g}_{k,i}italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_c start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT / italic_c start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT, indicating the proportion of marked lesions to the total actual lesions within Dksubscript𝐷𝑘D_{k}italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, which remains identical among samples in Dksubscript𝐷𝑘D_{k}italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT but differs across clients.

Our objective is to devise a robust algorithm capable of diminishing the negative effects of incomplete annotations on the global model’s accuracy. The cornerstone of our approach involves deriving an initial model that is minimally affected by noise through the utilization of extensive noisy data, followed by assessing each client’s annotation completeness ratio based on this initial model. The strategy prioritizes learning from clients with higher completeness rates (i.e., higher-quality data), thereby enhancing the model’s performance. Furthermore, a mechanism is incorporated to correct incomplete annotations at a certain stage of the learning process, using a specially designed metric based on Intersection over Union (IoU). The overview of FedIA is illustrated in Fig. 2.

Refer to caption
Figure 2: Overview of the proposed FedIA. The first stage is the early learning phase, the global model is updated by FedAvg [13]. The second is the modification stage, re-weighting each client by calculating its annotation completeness rate and correcting incomplete annotations synchronously. In the last stage, local models are trained with the corrected labels and aggregated for federated updating through FedAvg [13].

2.2 Annotation Completeness Estimation

Assessing the level of annotation completeness across clients is imperative, as it directly influences the tailored handling of each client’s data. To accomplish this, obtaining a model that is unaffected by imperfect annotations becomes essential. Our approach begins by interpreting incomplete annotations as noisy labels, with unmarked lesions contributing noise by altering pixel-level labels from 1 to 0. Drawing inspiration from the early learning phenomenon in noisy label learning [11, 10, 12], which posits that neural networks initially adapt to clean labels in the early stages before progressively accommodating noisy labels, we can place confidence in the training process despite the prevalence of noisy labels and utilize the early model phase to gauge annotation completeness across clients. Specifically, we develop an early-stage global model parameterized by θTsubscript𝜃𝑇\theta_{T}italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, capable of basic segmentation, by undergoing a warm-up period for T𝑇Titalic_T communication rounds employing FedAvg [13]. This process involves the server aggregating client models based on their respective data contributions to formulate the global model, defined as:

θt=k=1Knknθt,k,subscript𝜃𝑡superscriptsubscript𝑘1𝐾subscript𝑛𝑘𝑛subscript𝜃𝑡𝑘\theta_{t}=\sum_{k=1}^{K}\frac{n_{k}}{n}\theta_{t,k},italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG italic_θ start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT , (1)

where t𝑡titalic_t denotes the current training round under the constraint 1tT1𝑡𝑇1\leq t\leq T1 ≤ italic_t ≤ italic_T, and n𝑛nitalic_n represents the amount of data from all clients, i.e., n=k=1Knk𝑛superscriptsubscript𝑘1𝐾subscript𝑛𝑘n=\sum_{k=1}^{K}n_{k}italic_n = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. In this stage, local optimization of each client is established on:

minθk(xi,yi)Dkdc(f(xi;θt,k),y~i),subscriptsubscript𝜃𝑘subscriptsubscript𝑥𝑖subscript𝑦𝑖subscript𝐷𝑘subscript𝑑𝑐𝑓subscript𝑥𝑖subscript𝜃𝑡𝑘subscript~𝑦𝑖\min_{\theta_{k}}\sum_{(x_{i},y_{i})\in D_{k}}\ell_{dc}\left(f\left(x_{i};% \theta_{t,k}\right),\tilde{y}_{i}\right),roman_min start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_d italic_c end_POSTSUBSCRIPT ( italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT ) , over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (2)

where f:𝒳[0,1]H×W:𝑓𝒳superscript01HWf:\mathcal{X}\rightarrow[0,1]^{\mathrm{H}\times\mathrm{W}}italic_f : caligraphic_X → [ 0 , 1 ] start_POSTSUPERSCRIPT roman_H × roman_W end_POSTSUPERSCRIPT is the neural network and dc:[0,1]H×W×𝒴+:subscript𝑑𝑐superscript01HW𝒴subscript\ell_{dc}:[0,1]^{\mathrm{H}\times\mathrm{W}}\times\mathcal{Y}\rightarrow% \mathbb{R}_{+}roman_ℓ start_POSTSUBSCRIPT italic_d italic_c end_POSTSUBSCRIPT : [ 0 , 1 ] start_POSTSUPERSCRIPT roman_H × roman_W end_POSTSUPERSCRIPT × caligraphic_Y → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is the Dice loss [14]. After obtaining the warm-up model parameterized by θTsubscript𝜃𝑇\theta_{T}italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT that is relatively unaffected by noise, the annotation completeness rate a^ksubscript^𝑎𝑘\hat{a}_{k}over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of each client k𝑘kitalic_k is estimated by the following formula:

a^k=i=1nkck,ini=1nkck,ip,subscript^𝑎𝑘superscriptsubscript𝑖1subscript𝑛𝑘superscriptsubscript𝑐𝑘𝑖𝑛superscriptsubscript𝑖1subscript𝑛𝑘superscriptsubscript𝑐𝑘𝑖𝑝\hat{a}_{k}=\frac{\sum_{i=1}^{n_{k}}c_{k,i}^{n}}{\sum_{i=1}^{n_{k}}c_{k,i}^{p}},over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_ARG , (3)

where ck,insuperscriptsubscript𝑐𝑘𝑖𝑛c_{k,i}^{n}italic_c start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and ck,ipsuperscriptsubscript𝑐𝑘𝑖𝑝c_{k,i}^{p}italic_c start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT denote the number of lesions in the noisy label y~isubscript~𝑦𝑖\tilde{y}_{i}over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and in the predicted map yip=f(xi;θT)superscriptsubscript𝑦𝑖𝑝𝑓subscript𝑥𝑖subscript𝜃𝑇y_{i}^{p}=f(x_{i};\theta_{T})italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT = italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ), respectively.

2.3 Annotation Completeness-Aware Aggregation

Quantity-based aggregation (i.e., FedAvg) is susceptible to noise caused by incomplete annotations, especially when such annotations are numerous [20]. To mitigate this, clients with higher annotation quality should dominate FL more. Therefore, we calculate a completeness-aware aggregation weight wktsuperscriptsubscript𝑤𝑘𝑡w_{k}^{t}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT at round t𝑡titalic_t for each client k𝑘kitalic_k, defined as

wkt=exp(a^kkt)j=1Kexp(a^jjt),superscriptsubscript𝑤𝑘𝑡subscript^𝑎𝑘superscriptsubscript𝑘𝑡superscriptsubscript𝑗1𝐾subscript^𝑎𝑗superscriptsubscript𝑗𝑡w_{k}^{t}=\frac{\exp\left(\frac{\hat{a}_{k}}{\ell_{k}^{t}}\right)}{\sum_{j=1}^% {K}\exp\left(\frac{\hat{a}_{j}}{\ell_{j}^{t}}\right)},italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = divide start_ARG roman_exp ( divide start_ARG over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( divide start_ARG over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG ) end_ARG , (4)

where ktsuperscriptsubscript𝑘𝑡\ell_{k}^{t}roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT denotes the average loss of client k𝑘kitalic_k at round t𝑡titalic_t calculated by

kt=(xi,yi)Dkdc(f(xi;θt,k),y~i)nk.superscriptsubscript𝑘𝑡subscriptsubscript𝑥𝑖subscript𝑦𝑖subscript𝐷𝑘subscript𝑑𝑐𝑓subscript𝑥𝑖subscript𝜃𝑡𝑘subscript~𝑦𝑖subscript𝑛𝑘\ell_{k}^{t}=\frac{\sum_{(x_{i},y_{i})\in D_{k}}\ell_{dc}\left(f\left(x_{i};% \theta_{t,k}\right),\tilde{y}_{i}\right)}{n_{k}}.roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_d italic_c end_POSTSUBSCRIPT ( italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT ) , over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG . (5)

Generally, the observed loss is lower when annotation completeness is elevated. Consequently, the server prioritizes clients exhibiting lower losses, effectively reducing the negative effects of imprecise estimation of aksubscript𝑎𝑘a_{k}italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT on the weighting process, potentially arising from inappropriate selection of T𝑇Titalic_T.

2.4 Client-wise Adaptive Correction

The volume of data significantly influences the performance of neural networks, and datasets characterized by low annotation completeness represent valuable resources that should not be overlooked. Hence, rectifying incomplete annotations to acquire cleaner data for further training of the model is essential. Given that different clients pose datasets with varying levels of annotation completeness, the onset of noise impact and their robustness to noise vary accordingly.

To capture this information, in the early learning phase (i.e., 1tT1𝑡𝑇1\leq t\leq T1 ≤ italic_t ≤ italic_T), we compute IoU values every round for each client k𝑘kitalic_k and fit it with the first-order polynomial function:

IoU¯k(t)=lkt+bk,subscript¯IoU𝑘𝑡subscript𝑙𝑘𝑡subscript𝑏𝑘\overline{\mathrm{IoU}}_{k}(t)=l_{k}\cdot t+b_{k},over¯ start_ARG roman_IoU end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) = italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ italic_t + italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , (6)

where lksubscript𝑙𝑘l_{k}italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and bksubscript𝑏𝑘b_{k}italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are two parameters of the polynomial function. After early learning and in sync with the completeness-aware aggregation, we monitor the change of IoUksubscriptIoU𝑘\mathrm{IoU}_{k}roman_IoU start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT every round. Any client satisfying the following formula will correct its annotations in the next round:

IoU¯k(t)IoUk(t)>λ,subscript¯IoU𝑘𝑡subscriptIoU𝑘𝑡𝜆\overline{\mathrm{IoU}}_{k}(t)-\mathrm{IoU}_{k}(t)>\lambda,over¯ start_ARG roman_IoU end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) - roman_IoU start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) > italic_λ , (7)

where IoUk(t)subscriptIoU𝑘𝑡\mathrm{IoU}_{k}(t)roman_IoU start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) denotes the actual IoUIoU\mathrm{IoU}roman_IoU value of client k𝑘kitalic_k at round t𝑡titalic_t. λ𝜆\lambdaitalic_λ is an adjustable hyperparameter, set to 0.03 by default. In addition, the client only corrects annotations for which its model output predicted probability has confidence above a certain threshold setting of 0.8. It is worth noting that we only correct the pixels with a value of 0 because only false negative lesions and no false positive lesions are presented in this setting.

3 Experiments

3.1 Datasets and Implementation Details

3.1.1 Datasets.

Two public medical image segmentation datasets are included:

  1. 1.

    Two real-world multiple sclerosis datasets, focusing on the segmentation of white matter lesions (WML) in 3D magnetic resonance (MR) brain images, denoted as MS, including MSSEG-1 [1] and PubMRI [9]. In the task, we only use the FLAIR modality, in which the lesions are relatively clear.

  2. 2.

    The widely-used COVID dataset, aiming at segmentation and quantification of lung lesions caused by SARS-CoV-2 infection from computed tomography (CT) images, denoted as LUNG.[16]

Each dataset is divided into training and test sets by a ratio of 8:2, whose training set is then randomly split into four clients. For computational efficiency, all 3D samples are converted into 2D slices and resized to 256×\times×256 pixels.

To verify the robustness to different degrees of incompleteness of our method, several settings are used for evaluation. Specifically, for MS, the annotation completeness rate of the k𝑘kitalic_k-th client is set as 20%×k10%×m+40%percent20𝑘percent10𝑚percent4020\%\times k-10\%\times m+40\%20 % × italic_k - 10 % × italic_m + 40 %. And we conduct four sets of experiments, i.e., m=0,1,2,3𝑚0123m=0,1,2,3italic_m = 0 , 1 , 2 , 3. For LUNG, three different settings are used, and the completeness rates are formulated as 10%×k30%×m+70%percent10𝑘percent30𝑚percent7010\%\times k-30\%\times m+70\%10 % × italic_k - 30 % × italic_m + 70 %, where m=0,1,2𝑚012m=0,1,2italic_m = 0 , 1 , 2.

3.1.2 Incomplete Annotation Generation

When doctors or other professionals label multi-lesion data, they tend to label one lesion at the 3D volume level before annotating another. Therefore, to simulate real noise and generate incomplete annotation v~jsubscript~𝑣𝑗\tilde{v}_{j}over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, lesions are randomly removed at the 3D level. This process allows us to mimic real-world conditions more accurately. Specifically, we first set the annotation completeness of each client as aksubscript𝑎𝑘a_{k}italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, which is unknown during training. Then, we calculate the number of lesion-connected components cjnsuperscriptsubscript𝑐𝑗𝑛c_{j}^{n}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT in each 3D sample vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and randomly choose cjnsuperscriptsubscript𝑐𝑗𝑛c_{j}^{n}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT lesion regions, where cjn=cjgaksuperscriptsubscript𝑐𝑗𝑛superscriptsubscript𝑐𝑗𝑔subscript𝑎𝑘c_{j}^{n}=c_{j}^{g}\cdot a_{k}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ⋅ italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Only the chosen lesion regions are kept as well-annotated while others are set as background (i.e., incomplete/missing annotation).

3.1.3 Implementation Details.

In this work, U-Net [15] is adopted as the foundational model architecture for FL. The FL training process is designed to include a total of 300 communication rounds, with each local training phase consisting of a single local epoch. During local training, the model undergoes optimization via the Adam optimizer with momentum terms set to (0.9,0.99)0.90.99(0.9,0.99)( 0.9 , 0.99 ), a batch size of 4, and an initial learning rate of 1e-4. To accommodate the early learning strategy, the initial learning round, T𝑇Titalic_T, is set as 10 for MS and 40 for LUNG.

3.2 Comparison with State-of-the-art Methods

In our analysis, we benchmark FedIA against recent leading methods tailored to address label noise in both classification and segmentation tasks, including ELR (NeurIPS’20) [12] and ADELE (CVPR’22) [11], which leverage the early learning phenomenon to prevent model overfitting to noisy labels; RMD (TMI’23) [3], which mitigates annotation noise in medical imaging through mutual distillation; NR-Dice (TMI’20) [17], introducing a noise-robust Dice loss to combat noisy labels; FedNoRo (IJCAI’23) [21], designed to manage class-imbalanced noisy data; and FedCorr (CVPR’22) [23], employing the LID score to identity noisy clients. Additionally, we incorporate the universally recognized FL framework, FedAvg [13], as a baseline for comparison. Detailed implementations of these methods are available in the supplementary material.

Quantitative comparison results on MS and LUNG measured by Dice coefficient (%) are summarized in Table 1. Notably, our FedIA exhibits consistent performance even as annotation completeness diminishes, in contrast to some methods whose effectiveness wanes with decreased annotation completeness. This demonstrates that FedIA surpasses other sophisticated methods across various datasets and configurations, underscoring the robustness and efficacy of our strategy in tackling the challenge.

Table 1: Comparison results with state-of-the-art methods under the MS and LUNG settings. c0/c1/c2/c3subscript𝑐0subscript𝑐1subscript𝑐2subscript𝑐3c_{0}/c_{1}/c_{2}/c_{3}italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT means the annotation completeness rates aksubscript𝑎𝑘a_{k}italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of clients are ck×10%subscript𝑐𝑘percent10c_{k}\times 10\%italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT × 10 %, corresponding to the setting of m𝑚mitalic_m. The average results (%) from the last ten rounds are reported. The best results are marked in bold.
Methods From MS LUNG
m=0𝑚0m=0italic_m = 0 m=1𝑚1m=1italic_m = 1 m=2𝑚2m=2italic_m = 2 m=3𝑚3m=3italic_m = 3 m=0𝑚0m=0italic_m = 0 m=1𝑚1m=1italic_m = 1 m=2𝑚2m=2italic_m = 2
4/6/8/10 3/5/7/9 2/4/6/8 1/3/5/7 7/8/9/10 4/5/6/7 1/2/3/4
FedAvg AISTATS’17 60.88 55.77 34.65 13.98 54.40 46.74 28.47
ELR NeurIPS’20 63.01 57.97 35.91 9.50 49.29 35.01 11.22
NR-Dice TMI’20 69.19 64.42 60.16 23.60 58.36 54.32 33.75
ADELE CVPR’22 61.34 58.63 25.01 0 54.33 44.14 17.89
FedCorr CVPR’22 62.03 57.12 30.35 0 55.24 50.56 23.78
RMD TMI’23 62.77 59.15 41.60 17.15 48.60 32.79 9.37
FedNoRo IJCAI’23 67.09 60.78 39.74 31.53 49.86 37.25 19.16
FedIA Ours 74.73 74.03 69.22 56.53 59.42 55.34 44.72
Refer to caption
(a) GT
Refer to caption
(b) FedIA
Refer to caption
(c) FedNoRo
Refer to caption
(d) RMD
Refer to caption
(e) NR-Dice
Refer to caption
(f) others
Figure 3: Qualitative comparison on MS where others represents FedAvg, ELR, ADELE, and FedCorr failing to segment any lesion. Red, blue and green color show the prediction of true-positive, false-negative and false-positive regions, respectively.

Exemplar qualitative comparison on MS in annotation completeness setting of 10%, 30%, 50%, 70% is illustrated in Fig. 3. FedIA effectively recalls all lesions with fewer false positives, leading to the best segmentation performance. Comparatively, FedNoRo, RMD, and NR-Dice suffer from extensive false negatives, resulting in noisy segmentation. What’s worse, FedAvg and other comparison methods completely fail to segment any lesion, indicating the difficulty in addressing heterogeneous annotation completeness and the necessity and value of FedIA. More qualitative results are available in the supplementary material.

3.3 Ablation Study

3.3.1 Component-wise Study.

We conduct an ablation study by separately removing the Annotation Completeness-Aware Aggregation (ACAG) and the Client-wise Adaptive Correction (CAC) components from FedIA as summarized in Table 3. We observe that FedAvg can benefit from both components, particularly under the lowest annotation completeness settings. This phenomenon demonstrates the effectiveness of our designs against annotation noise. The best performance is typically achieved when both components are incorporated.

3.3.2 Impact of the Early Learning Round T𝑇Titalic_T.

It is worth noting that the early learning phase, denoted by T𝑇Titalic_T, is set differently for the two tasks as LUNG presenting a more complex learning challenge compared to MS, essentially requiring a longer early learning period. This variation prompts a relevant question regarding the optimal number of training rounds necessary for effective early training. To address this, we perform ablation studies on LUNG under various T𝑇Titalic_T settings: 10, 20, 30, and 40 as summarized in Table 3. The results indicate that our method exhibits considerable robustness to changes in T𝑇Titalic_T. Notably, FedIA consistently outperforms the baseline FedAvg across all tested T𝑇Titalic_T selections, demonstrating its robustness and superior performance irrespective of the early learning duration. More ablation studies are available in the supplementary material for reference.

Table 2: Component-wise study.
FedAvg ACAG CAC MS
0 1 2 3
60.88 55.77 34.65 13.98
71.91 70.34 66.95 43.40
74.23 73.67 67.43 37.74
74.73 74.03 69.22 56.53
Table 3: Impact of the Round T𝑇Titalic_T
T𝑇Titalic_T LUNG
0 1 2
10 56.53 52.25 46.88
20 59.36 54.78 45.37
30 59.52 55.07 44.74
40 59.42 55.34 44.72

4 Conclusion

In this study, we tackle a significant yet overlooked challenge in federated medical image segmentation: how to enhance FL against heterogeneity in annotation completeness. We approach incomplete annotations as akin to noisy data, employing strategies to mitigate their negative impacts denoted as FedIA. FedIA involves initially assessing the level of annotation completeness at the client level through designed indicators. Then, it prioritizes clients with greater annotation completeness and undertakes corrective measures for those with incomplete ones, aiming to ensure that the training process is mainly based on accurate knowledge. After rigorously evaluated through a line of experiments on two extensively utilized medical image segmentation datasets, experimental results affirm the effectiveness of FedIA, showcasing its advantage over current leading approaches. We believe that the issue formulated and the proposed solution will pave the way for more practical FL applications in complex medical scenarios.

Acknowledgement. This work was supported in part by the National Natural Science Foundation of China under Grants 62202179 and 62271220, and in part by the Natural Science Foundation of Hubei Province of China under Grant 2022CFB585. The computation is supported by the HPC Platform of HUST.

References

  • [1] Commowick, O., Istace, A., Kain, M., Laurent, B., Leray, F., Simon, M., Pop, S.C., Girard, P., Ameli, R., Ferré, J.C., et al.: Objective evaluation of multiple sclerosis lesion segmentation using a data management and processing infrastructure. Scientific reports 8(1), 13650 (2018)
  • [2] Dou, Q., So, T.Y., Jiang, M., Liu, Q., Vardhanabhuti, V., Kaissis, G., Li, Z., Si, W., Lee, H.H., Yu, K., et al.: Federated deep learning for detecting COVID-19 lung abnormalities in CT: a privacy-preserving multinational validation study. npj Digit. Medicine 4(1),  60 (2021)
  • [3] Fang, C., Wang, Q., Cheng, L., Gao, Z., Pan, C., Cao, Z., Zheng, Z., Zhang, D.: Reliable mutual distillation for medical image segmentation under imperfect annotations. IEEE Trans. Med. Imaging (2023)
  • [4] Huang, W., Ye, M., Du, B.: Learn from others and be yourself in heterogeneous federated learning. In: CVPR (2022)
  • [5] Huang, W., Ye, M., Shi, Z., Du, B.: Generalizable heterogeneous federated cross-correlation and instance similarity learning. TPAMI (2023)
  • [6] Huang, W., Ye, M., Shi, Z., Li, H., Du, B.: Rethinking federated learning with domain shift: A prototype view. In: CVPR (2023)
  • [7] Huang, W., Ye, M., Shi, Z., Wan, G., Li, H., Du, B., Yang, Q.: A federated learning for generalization, robustness, fairness: A survey and benchmark. arXiv (2023)
  • [8] Jiang, M., Roth, H.R., Li, W., Yang, D., Zhao, C., Nath, V., Xu, D., Dou, Q., Xu, Z.: Fair federated medical image segmentation via client contribution estimation. In: CVPR. pp. 16302–16311 (2023)
  • [9] Lesjak, Ž., Galimzianova, A., Koren, A., Lukin, M., Pernuš, F., Likar, B., Špiclin, Ž.: A novel public mr image dataset of multiple sclerosis patients with lesion segmentations based on multi-rater consensus. Neuroinformatics 16, 51–63 (2018)
  • [10] Li, J., Socher, R., Hoi, S.C.: DivideMix: Learning with noisy labels as semi-supervised learning. In: ICLR (2020)
  • [11] Liu, S., Liu, K., Zhu, W., Shen, Y., Fernandez-Granda, C.: Adaptive early-learning correction for segmentation from noisy annotations. In: CVPR. pp. 2606–2616 (2022)
  • [12] Liu, S., Niles-Weed, J., Razavian, N., Fernandez-Granda, C.: Early-learning regularization prevents memorization of noisy labels. In: NeurIPS. vol. 33, pp. 20331–20342 (2020)
  • [13] McMahan, B., Moore, E., Ramage, D., Hampson, S., y Arcas, B.A.: Communication-efficient learning of deep networks from decentralized data. In: AISTATS. pp. 1273–1282 (2017)
  • [14] Milletari, F., Navab, N., Ahmadi, S.A.: V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In: 3DV. pp. 565–571 (2016)
  • [15] Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomedical image segmentation. In: MICCAI. pp. 234–241. Springer (2015)
  • [16] Roth, H.R., Xu, Z., Tor-Díez, C., Jacob, R.S., Zember, J., Molto, J., Li, W., Xu, S., Turkbey, B., Turkbey, E., et al.: Rapid artificial intelligence solutions in a pandemic—the covid-19-20 lung ct lesion segmentation challenge. MIA 82, 102605 (2022)
  • [17] Wang, G., Liu, X., Li, C., Xu, Z., Ruan, J., Zhu, H., Meng, T., Li, K., Huang, N., Zhang, S.: A noise-robust framework for automatic segmentation of COVID-19 pneumonia lesions from CT images. IEEE Trans. Med. Imaging 39(8), 2653–2663 (2020)
  • [18] Wang, J., Jin, Y., Wang, L.: Personalizing federated medical image segmentation via local calibration. In: ECCV. pp. 456–472 (2022)
  • [19] Wu, N., Kuang, Z., Yan, Z., Yu, L.: From optimization to generalization: Fair federated learning against quality shift via inter-client sharpness matching. In: IJCAI (2024)
  • [20] Wu, N., Sun, Z., Yan, Z., Yu, L.: FedA3I: Annotation quality-aware aggregation for federated medical image segmentation against heterogeneous annotation noise. In: AAAI (2024)
  • [21] Wu, N., Yu, L., Jiang, X., Cheng, K.T., Yan, Z.: FedNoRo: Towards noise-robust federated learning by addressing class imbalance and label noise heterogeneity. In: IJCAI (2023)
  • [22] Wu, N., Yu, L., Yang, X., Cheng, K.T., Yan, Z.: FedIIC: Towards robust federated learning for class-imbalanced medical image classification. In: MICCAI (2023)
  • [23] Xu, J., Chen, Z., Quek, T.Q., Chong, K.F.E.: FedCorr: Multi-stage federated learning for label noise correction. In: CVPR. pp. 10184–10193 (2022)