Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
\UseTblrLibrary

booktabs

11institutetext: Institute for AI Industry Research (AIR), Tsinghua University 22institutetext: Harvard Ophthalmology AI Lab, Harvard University 22email: liwenyi19@mails.ucas.ac.cn, zhaohao@air.tsinghua.edu.cn
11footnotetext: * Indicates Equal Contribution. Indicates Corresponding Author.

FairDiff: Fair Segmentation with Point-Image Diffusion

Wenyi Li* 11    Haoran Xu* 11    Guiyu Zhang* 11    Huan-ang Gao 11    Mingju Gao 11    Mengyu Wang 22    Hao Zhao 11
Abstract

Fairness is an important topic for medical image analysis, driven by the challenge of unbalanced training data among diverse target groups and the societal demand for equitable medical quality. In response to this issue, our research adopts a data-driven strategy—enhancing data balance by integrating synthetic images. However, in terms of generating synthetic images, previous works either lack paired labels or fail to precisely control the boundaries of synthetic images to be aligned with those labels. To address this, we formulate the problem in a joint optimization manner, in which three networks are optimized towards the goal of empirical risk minimization and fairness maximization. On the implementation side, our solution features an innovative Point-Image Diffusion architecture, which leverages 3D point clouds for improved control over mask boundaries through a point-mask-image synthesis pipeline. This method outperforms significantly existing techniques in synthesizing scanning laser ophthalmoscopy (SLO) fundus images. By combining synthetic data with real data during the training phase using a proposed Equal Scale approach, our model achieves superior fairness segmentation performance compared to the state-of-the-art fairness learning models. Code is available at https://github.com/wenyi-li/FairDiff.

Keywords:
Fairness Learning Image Synthesis Fundus Image Segmentation Diffusion Models

1 Introduction

The pursuit of fairness in medical image analysis is an important topic because training data for different groups are usually unbalanced while the society calls for equitable medical quality across diverse target groups. To address the issue of unfairness, there are two principal strategies, optimization-driven and data-driven approaches. The optimization-driven approaches [32, 16, 21, 10, 18] apply fair learning methods during training over perception models [7, 27, 8], such as adjusting the weight of loss for different sensitive attributes or incorporating additional regularization losses to minimize bias across groups. By contrast, a more principled solution, what we name as data-driven approach, addresses the root cause of the issue—unbalanced data distribution through the augmentation of datasets with additional images from underrepresented groups. However, acquiring medical images, such as scanning laser ophthalmoscopy (SLO) fundus images, from vulnerable and under-represented populations proves challenging [22, 25]. Therefore, we resort to image synthesis methods [20, 33, 6] to generate additional data, aiming to improve the fairness of results.

A recent study FairSeg [28] proposes the first fairness dataset for medical segmentation, including 10,000 SLO fundus images from 10,000 patients with pixel-wise disc and cup segmentation mask annotation. The cup-disc area is important for diagnosing a range of eye conditions. However, the anatomical structures of the fundus vary across different racial groups. For instance, Blacks often have a larger cup-to-disc ratio than other races, and Asians are more prone to angle-closure glaucoma than Whites. Therefore, the diversity of the cup-disc area present challenges to the synthesis of images.

Refer to caption
Figure 1: Comparison of Traditional Noise-Image Diffusion and Our Point-Image Diffusion Methods. We transform 2D mask data into a 3D point cloud format, leveraging the spatial coordinates to delineate boundaries more accurately.

Previous medical image synthesis works [14, 5, 3] mainly focused on the direct generation of medical images. Although these synthesized images often closely mimic the distribution of real images, they lack paired labels, and the process of annotating these images is time-consuming and labor-intensive. Several mask-to-image works, like Freemask [30] and OASIS [26], can generate images from masks but utilize the same set of masks for both real and synthetic samples, which lead to a lack of diversity. SEGGEN [31] proposed MaskSyn which can generate masks through 2D diffusion model [19]. However, accurately controlling these mask boundaries in a two-dimensional space using GANs [9, 1, 13] and 2D diffusion models [12, 20, 4] is challenging, because of the inherent constraints in capturing boundaries with pixel-level detail [23, 24].

To address these challenges, we explore the utilization of sampling points that convert pixel boundaries into spatial coordinates to enhance boundary control. Sampling points on a 2D mask results in a 2D point cloud. However, to distinguish different category boundaries, we transform the point cloud into a 3D format for better point feature learning, where the z-axis is used to mark categories, while the x and y coordinates retain their roles as the plane coordinates. The process is illustrated in Fig. 1.

In this paper, we formulate the fairness problem in a joint optimization manner, in which three networks are optimized towards the goal of empirical risk minimization and fairness maximization. On the implementation side, we introduce a novel Point-Image Diffusion architecture. In this framework, we first generate segmentation masks using point cloud diffusion and then synthesize images based on the control of these synthesized masks. After acquiring a substantial amount of synthetic data for various minority groups, we employ an equal-scale data-combining approach to ensure that the populations of each sensitive attribute group are balanced.

Contributions. In summary, our contributions are as follows: (1) We introduce a novel Point-Image diffusion approach for medical image synthesis, which utilizes 3D point clouds to enhance mask boundary control. This method outperforms significantly existing techniques in SLO fundus image synthesis. (2) Downstream segmentation tasks verified the efficacy of the synthesized data, demonstrating improvement in model segmentation performance. (3) By simply integrating synthetic and real data in the training phase, we achieve superior fairness segmentation performance compared to the state-of-the-art fairness learning models.

Refer to caption
Figure 2: Overview of Our Fairness-aware Point-Image Diffusion Framework.

2 Methodology

2.1 Overview

Preliminaries. Fairness in a segmentation model can be defined as the model’s ability to score evenly for images of different target groups. Consider a dataset D𝐷Ditalic_D that is made up of image pairs (x,y)𝑥𝑦(x,y)( italic_x , italic_y ). Here, x𝑥xitalic_x represents the input image and y𝑦yitalic_y is the ground truth segmentation mask. y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG refers to the predicted mask. To measure the fairness of segmentation, we introduce a metric called Fairnessθ(D)subscriptFairness𝜃𝐷\text{Fairness}_{\theta}(D)Fairness start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_D ) on the dataset D𝐷Ditalic_D of segmentation method θ𝜃\thetaitalic_θ. If we have sensitive attributes S={s0,,si,sk}𝑆subscript𝑠0subscript𝑠𝑖subscript𝑠𝑘S=\{s_{0},...,s_{i},...s_{k}\}italic_S = { italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }, each attribute can divide the population into groups G={g1,,gn}𝐺subscript𝑔1subscript𝑔𝑛G=\{g_{1},...,g_{n}\}italic_G = { italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }. The fairness metric for segmentation can be represented as:

Fairnessθ(D)=i=1k(VarG[Mθ(y^,y|si)])subscriptFairness𝜃𝐷superscriptsubscript𝑖1𝑘subscriptVar𝐺delimited-[]subscriptM𝜃^𝑦conditional𝑦subscript𝑠𝑖\text{Fairness}_{\theta}(D)=-\sum_{i=1}^{k}\left(\text{Var}_{G}\left[\text{M}_% {\theta}(\hat{y},y|s_{i})\right]\right)Fairness start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_D ) = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( Var start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ M start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG , italic_y | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] ) (1)

where Mθ(y^,y|si)subscriptM𝜃^𝑦conditional𝑦subscript𝑠𝑖\text{M}_{\theta}(\hat{y},y|s_{i})M start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG , italic_y | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the performance metrics such as mIoU or Dice coefficient, VarGsubscriptVar𝐺\text{Var}_{G}Var start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT is the variance of metrics within the groups under sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Overview of Framework. To achieve fairness, our research primarily explores data-driven methodologies, focusing on the use of synthetic data. As illustrated in Fig 2, we introduce a novel Point-Image architecture designed specifically for the high-quality SLO fundus images synthesis. The first step is to generate segmentation masks from a Gaussian noise 𝒩𝒩\mathcal{N}caligraphic_N using a 3D diffusion model with parameters ϕitalic-ϕ\phiitalic_ϕ , and then to generate images x𝑥xitalic_x from the masks using a 2D diffusion model with parameters ψ𝜓\psiitalic_ψ. After integrating data from real datasets with synthesized data, we proceed to train them using a segmentation model that is parameterized by θ𝜃\thetaitalic_θ. The overall optimization can be defined as:

minθmaxϕ,ψFairness(D,θ,ϕ,ψ)subscript𝜃subscriptitalic-ϕ𝜓Fairness𝐷𝜃italic-ϕ𝜓\min_{\theta}\max_{\phi,\psi}\text{Fairness}(D,\theta,\phi,\psi)roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_ϕ , italic_ψ end_POSTSUBSCRIPT Fairness ( italic_D , italic_θ , italic_ϕ , italic_ψ ) (2)

2.2 Point-Mask Generation

To produce diverse cup-disc shape fundus images for SLO and acquire precise label maps for synthetic image data, we first augment segmentation masks using the labels from actual real-world data.

Transformation to 3D Point Cloud. Given a 2D mask image of size W×H𝑊𝐻W\times Hitalic_W × italic_H, where W𝑊Witalic_W and H𝐻Hitalic_H are the width and height of the image. The function f:IP:𝑓𝐼𝑃f:I\rightarrow Pitalic_f : italic_I → italic_P maps I𝐼Iitalic_I to a 3D point cloud P𝑃Pitalic_P for training. f𝑓fitalic_f is defined as follows:

f(I)={pi=(xi,yi,zi)|(xi,yi)Boundary(I),zi=g((xi,yi),I)}𝑓𝐼conditional-setsubscript𝑝𝑖subscript𝑥𝑖subscript𝑦𝑖subscript𝑧𝑖formulae-sequencesubscript𝑥𝑖subscript𝑦𝑖Boundary𝐼subscript𝑧𝑖𝑔subscript𝑥𝑖subscript𝑦𝑖𝐼f(I)=\{p_{i}=(x_{i},y_{i},z_{i})\,|\,(x_{i},y_{i})\in\text{Boundary}(I),\,z_{i% }=g((x_{i},y_{i}),I)\}italic_f ( italic_I ) = { italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ Boundary ( italic_I ) , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_g ( ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_I ) } (3)

where (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) are the coordinates of a pixel in I𝐼Iitalic_I, Boundary(I)Boundary𝐼\text{Boundary}(I)Boundary ( italic_I ) represents those pixels that are situated at the segmentation boundaries, g((xi,yi),I)𝑔subscript𝑥𝑖subscript𝑦𝑖𝐼g((x_{i},y_{i}),I)italic_g ( ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_I ) is a function that assigns the zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT value based on the pixel’s position. g𝑔gitalic_g is defined as follows:

g((xi,yi),I)={z0if (xi,yi) is on the boundary of the Cupz0if (xi,yi) is on the boundary of the Disc𝑔subscript𝑥𝑖subscript𝑦𝑖𝐼casessubscript𝑧0if (xi,yi) is on the boundary of the Cupsubscript𝑧0if (xi,yi) is on the boundary of the Discg((x_{i},y_{i}),I)=\begin{cases}z_{0}&\text{if $(x_{i},y_{i})$ is on the % boundary of the Cup}\\ -z_{0}&\text{if $(x_{i},y_{i})$ is on the boundary of the Disc}\end{cases}italic_g ( ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_I ) = { start_ROW start_CELL italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL if ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is on the boundary of the Cup end_CELL end_ROW start_ROW start_CELL - italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL if ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is on the boundary of the Disc end_CELL end_ROW (4)

Point Cloud Diffusion for Generation. After converting the existing 2D labels into 3D point clouds, we employ a point cloud diffusion model based on [15] to learn the distribution of these point clouds. The primary training goal of this model is to simulate the reverse of a random diffusion process, learning to move from a normal distribution to the distribution of real point clouds. During the training phase, we introduce varying degrees of random noise into the point clouds and ensure that the denoising model predicts noise that closely matches the actual noise added.

For each group gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the sensitive attribute S𝑆Sitalic_S, we train a point cloud diffusion model ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Since ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can effectively capture the characteristics of the cup-disc contour for different groups, we can selectively augment samples for different groups, particularly for minority populations. Through this approach, we prepare the label sets for subsequent procedure.

2.3 Mask-Image Generation

Given the generated mask m𝑚mitalic_m as condition c𝑐citalic_c, the next step is to synthesize an image x𝑥xitalic_x . This integration of m𝑚mitalic_m into the neural network is achieved by introducing an extra condition c𝑐citalic_c into a neural network block, via an architecture known as ControlNet [33]. This method involves freezing the original Stable Diffusion block’s parameters ΦΦ\Phiroman_Φ, replicating it into a trainable copy with parameters ΦcsubscriptΦ𝑐\Phi_{c}roman_Φ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, and connecting these blocks with two zero-initialized 1×1111\times 11 × 1 convolutional layers. Specifically, the mask m𝑚mitalic_m is encoded into tokens cf=E(ci)subscript𝑐𝑓𝐸subscript𝑐𝑖c_{f}=E(c_{i})italic_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = italic_E ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), which are then fed into ControlNet. The output of ControlNet ycsubscript𝑦𝑐y_{c}italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is given by:

yc=F(x;Φ)+Z(F(x+Z(c;Φz1);Φc);Φz2)subscript𝑦𝑐𝐹𝑥Φ𝑍𝐹𝑥𝑍𝑐subscriptΦ𝑧1subscriptΦ𝑐subscriptΦ𝑧2y_{c}=F(x;\Phi)+Z\left(F\left(x+Z(c;\Phi_{z1});\Phi_{c}\right);\Phi_{z2}\right)italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_F ( italic_x ; roman_Φ ) + italic_Z ( italic_F ( italic_x + italic_Z ( italic_c ; roman_Φ start_POSTSUBSCRIPT italic_z 1 end_POSTSUBSCRIPT ) ; roman_Φ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ; roman_Φ start_POSTSUBSCRIPT italic_z 2 end_POSTSUBSCRIPT ) (5)

where ycsubscript𝑦𝑐y_{c}italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is the output from the ControlNet block, Z(;)𝑍Z(\cdot;\cdot)italic_Z ( ⋅ ; ⋅ ) represents the zero convolutional layers, and Φz1subscriptΦ𝑧1\Phi_{z1}roman_Φ start_POSTSUBSCRIPT italic_z 1 end_POSTSUBSCRIPT and Φz2subscriptΦ𝑧2\Phi_{z2}roman_Φ start_POSTSUBSCRIPT italic_z 2 end_POSTSUBSCRIPT are the parameters of the two zero convolutional layers. At the beginning of training, ycsubscript𝑦𝑐y_{c}italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT equals y𝑦yitalic_y due to the zero initialization of the zero convolutional layers’ weights and biases, ensuring no harmful noise is introduced into the network’s hidden states. As training progresses, the zero convolutional layers gradually adapt the output based on the input condition cfsubscript𝑐𝑓c_{f}italic_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, thereby achieving control over the original feature map x𝑥xitalic_x.

2.4 Equal-Scale Data Combination

The method of combining real and synthetic data is straightforward, termed Equal-Scale Data Combination, which balances the sample sizes across all sensitive groups. Assume Dr={xr,1,xr,2,,xr,Nr}subscript𝐷𝑟subscript𝑥𝑟1subscript𝑥𝑟2subscript𝑥𝑟subscript𝑁𝑟D_{r}=\{x_{r,1},x_{r,2},\ldots,x_{r,N_{r}}\}italic_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT italic_r , 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_r , 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_r , italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT } and Ds={xs,1,xs,2,,xs,Ns}subscript𝐷𝑠subscript𝑥𝑠1subscript𝑥𝑠2subscript𝑥𝑠subscript𝑁𝑠D_{s}=\{x_{s,1},x_{s,2},\ldots,x_{s,N_{s}}\}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT italic_s , 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_s , 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_s , italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT } as the sets of sample points from the real data distribution Prsubscript𝑃𝑟P_{r}italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and the synthetic data distribution Pssubscript𝑃𝑠P_{s}italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, respectively. Here, Nrsubscript𝑁𝑟N_{r}italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and Nssubscript𝑁𝑠N_{s}italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT denote the number of samples in the real and synthetic datasets, respectively. The equal-scale combination process involves augmenting the dataset with synthetic samples for underrepresented groups or possibly subsampling overrepresented groups. For each sensitive group g𝑔gitalic_g, if Ng,r<Ntargetsubscript𝑁𝑔𝑟subscript𝑁𝑡𝑎𝑟𝑔𝑒𝑡N_{g,r}<N_{target}italic_N start_POSTSUBSCRIPT italic_g , italic_r end_POSTSUBSCRIPT < italic_N start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT, generate (NtargetNg,r)subscript𝑁𝑡𝑎𝑟𝑔𝑒𝑡subscript𝑁𝑔𝑟(N_{target}-N_{g,r})( italic_N start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT - italic_N start_POSTSUBSCRIPT italic_g , italic_r end_POSTSUBSCRIPT ) synthetic samples from Pssubscript𝑃𝑠P_{s}italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT specific to group g𝑔gitalic_g, resulting in a combined set Dg,ssuperscriptsubscript𝐷𝑔𝑠D_{g,s}^{*}italic_D start_POSTSUBSCRIPT italic_g , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. If Ng,r>Ntargetsubscript𝑁𝑔𝑟subscript𝑁𝑡𝑎𝑟𝑔𝑒𝑡N_{g,r}>N_{target}italic_N start_POSTSUBSCRIPT italic_g , italic_r end_POSTSUBSCRIPT > italic_N start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT, random sample Ntargetsubscript𝑁𝑡𝑎𝑟𝑔𝑒𝑡N_{target}italic_N start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT samples from Dg,rsuperscriptsubscript𝐷𝑔𝑟D_{g,r}^{*}italic_D start_POSTSUBSCRIPT italic_g , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Ntargetsubscript𝑁𝑡𝑎𝑟𝑔𝑒𝑡N_{target}italic_N start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT is the target sample size, which could be based on the size of the largest group, a specified threshold for fairness. The combined dataset D𝐷Ditalic_D for training can be represented as:

D=g(Dg,rDg,s)𝐷subscript𝑔superscriptsubscript𝐷𝑔𝑟superscriptsubscript𝐷𝑔𝑠D=\bigcup_{g}(D_{g,r}^{*}\cup D_{g,s}^{*})italic_D = ⋃ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_g , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∪ italic_D start_POSTSUBSCRIPT italic_g , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) (6)

3 Experiments and Results

3.1 Setup

Datasets. We use Harvard-FairSeg [28] as the real SLO fundus image dataset. It includes six critical attributes for comprehensive studies on fairness, which are age, gender, race, ethnicity, language preference, and marital status. The fairness and segmentation results of all models, whether using synthetic data or not, are tested on the test split of 2,000 real SLO fundus images.

Segmentation Models. To verify the impact of our synthetic data on segmentation and fairness, we selected two segmentation models, including a small model TransUNet [2] and a larger model SAMed [34] (the experiments on the latter are provided in the supplementary).

Training Details. For the training of image synthesis, we utilize 512 points to sample the boundaries of each original mask. These point clouds are then normalized. Training is performed on NVIDIA 3090 GPUs with a batch size of 48 and a learning rate of 1e-4 across 2,000,000 steps. For the training of the segmentation model, following the experimental setup of Fairseg [28], we employed a combination of cross-entropy and Dice losses as the training loss and used the AdamW optimizer. To enable effective comparisons, TransUNet was trained with a base learning rate of 0.01 and a momentum of 0.9 over 300 epochs, while SAMed was set with a base learning rate of 0.005 and a momentum of 0.9, undergoing training for 100 epochs. The batch size for both was set to 48. For the number of training samples, we have controlled it to be 8000, whether using all real data or a mix of real and synthetic data.

3.2 Synthesis Quality results

Metrics. To evaluate the generation quality, we employ the Fréchet inception distance (FID) [11], minimum matching distance (MMD) and the coverage score (COV). The detailed definitions of these metrics can be found in the supplementary materials.

Results. We compare our Point-Image image generation method with several state-of-the-art methods, including Stabel Diffusion 1.5 [20], pix2pixHD [29], OASIS [26], SPADE [17] and ControlNet [33]. As shown in Tab. 3.2, our method significantly outperforms existing techniques in SLO fundus image synthesis. Notably, our approach achieves the lowest FID score, indicating that our generated images bear a closer resemblance to the actual images when compared to other methods. Furthermore, the MMD results suggest that our method also more accurately replicates the distribution of the original image dataset.

Ablation over two-stage diffusion. Comparing with ControlNet [33] (one-stage label-to-image synthesis), our two-stage pipeline, where we first sample labels and then synthesis images, shows effectiveness in generating diverse images, as reflected by the highest COV (Coverage) score among the methods evaluated. The enhancement in image quality and diversity underscores the efficacy of our image synthesis technique. Fig. 3 visualizes the results of image synthesis.

Refer to caption
Figure 3: Visualization Results of Different Image Synthesis Results.
Table 1: Comparison of Synthesis Quality.
{tblr}

ccccc \topruleMethod & FID↓ MMD↓ COV↑
\midruleSD1.5[20] 167.39 33.21 3.13
pix2pixHD[29] 157.02 22.73 4.63
OASIS[26] 89.92 28.57 3.07
SPADE[17] 77.26 23.82 5.75
ControlNet[33] (w/o Point-Mask) 67.29 23.60 9.45
Ours (w/ Point-Mask) 60.51 20.06 10.83
\bottomrule

3.3 Fairness Segmentation Results

Metrics. Following prior work [28], we measure the fairness segmentation results using Equity-Scaled Segmentation Performance (ESSP), which is defined as

ESSP=({(z,y)})1+StdevESSPsuperscript𝑧𝑦1Stdev\text{ESSP}=\frac{\mathcal{L}(\{(z^{\prime},y)\})}{1+\text{Stdev}}ESSP = divide start_ARG caligraphic_L ( { ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y ) } ) end_ARG start_ARG 1 + Stdev end_ARG (7)

where \mathcal{L}caligraphic_L is the Dice or IoU metric. The ES-Dice and ES-IoU metrics consider both segmentation performance and fairness across all groups. The conventional Overall Dice and IoU metrics only assess the segmentation performance.

Table 2: TransUNet segmentation performance of Optic Cup and Rim (Sensitive attribute = Race)

Overall Overall Overall Overall Asian Black White Asian Black White Method ES-Dice\uparrow Dice\uparrow ES-IoU\uparrow IoU\uparrow Dice\uparrow Dice\uparrow Dice\uparrow IoU\uparrow IoU\uparrow IoU\uparrow Cup TransUNet 0.8372 0.8481 0.7409 0.7532 0.8270 0.8489 0.8503 0.7277 0.7576 0.7551 TransUNet+ADV 0.8325 0.8410 0.7345 0.7432 0.8246 0.8417 0.8426 0.7260 0.7482 0.7440 TransUNet+GroupDRO 0.8313 0.8442 0.7359 0.7479 0.8197 0.8469 0.8464 0.7232 0.7529 0.7495 TransUNet+FairSeg 0.8350 0.8464 0.7374 0.7497 0.8248 0.8484 0.8484 0.7247 0.7550 0.7513 Ours(Equal Scale) 0.8397 0.8480 0.7441 0.7529 0.8320 0.8483 0.8497 0.7352 0.7572 0.7540 Rim TransUNet 0.7604 0.7927 0.6393 0.6706 0.7457 0.7307 0.8106 0.6160 0.5991 0.6913 TransUNet+ADV 0.7579 0.7906 0.6371 0.6682 0.7413 0.7286 0.8087 0.6116 0.5982 0.6888 TransUNet+GroupDRO 0.7564 0.7896 0.6351 0.6674 0.7470 0.7229 0.8080 0.6183 0.5899 0.6887 TransUNet+FairSeg 0.7628 0.7950 0.6410 0.6725 0.7479 0.7325 0.8130 0.6185 0.6020 0.6935 Ours(Equal Scale) 0.7697 0.7999 0.6494 0.6797 0.7565 0.7427 0.8165 0.6279 0.6114 0.6994

Table 3: TransUNet segmentation performance of Optic Cup and Rim (Sensitive attribute = Gender)

Overall Overall Overall Overall Male Female Male Female Method ES-Dice\uparrow Dice\uparrow ES-IoU\uparrow IoU\uparrow Dice\uparrow Dice\uparrow IoU\uparrow IoU\uparrow Cup TransUNet 0.8448 0.8481 0.7502 0.7532 0.8458 0.8513 0.7508 0.7564 TransUNet+ADV 0.8343 0.8345 0.7351 0.7356 0.8344 0.8348 0.7361 0.7350 TransUNet+GroupDRO 0.8426 0.8478 0.7473 0.7522 0.8441 0.8528 0.7483 0.7575 TransUNet+FairSeg 0.8477 0.8489 0.7502 0.7530 0.8494 0.8514 0.7505 0.7556 Ours(Equal Scale) 0.8461 0.8505 0.7522 0.7564 0.8474 0.8548 0.7531 0.7610 Rim TransUNet 0.7895 0.7927 0.6673 0.6706 0.7951 0.7894 0.6736 0.6665 TransUNet+ADV 0.7783 0.7852 0.6553 0.6630 0.7905 0.7779 0.6699 0.6534 TransUNet+GroupDRO 0.7901 0.7917 0.6681 0.6699 0.7930 0.7900 0.6716 0.6677 TransUNet+FairSeg 0.7893 0.7898 0.6698 0.6698 0.7924 0.7932 0.6678 0.6653 Ours(Equal Scale) 0.7945 0.7981 0.6745 0.6780 0.8007 0.7944 0.6811 0.6737

Table 4: TransUNet segmentation performance of Optic Cup and Rim (Sensitive attribute = Language)

Overall Overall Overall Overall English Spanish Others English Spanish Others Method ES-Dice\uparrow Dice\uparrow ES-IoU\uparrow IoU\uparrow Dice\uparrow Dice\uparrow Dice\uparrow IoU\uparrow IoU\uparrow IoU\uparrow Cup TransUNet 0.8255 0.8481 0.7273 0.7532 0.8469 0.8972 0.8531 0.7516 0.8166 0.7592 TransUNet+ADV 0.8071 0.8312 0.7056 0.7323 0.8296 0.8833 0.8338 0.7301 0.7990 0.7376 TransUNet+GroupDRO 0.8231 0.8416 0.7231 0.7442 0.8398 0.8844 0.8571 0.7421 0.7993 0.7605 TransUNet+FairSeg 0.8277 0.8481 0.7289 0.7523 0.8467 0.8934 0.8562 0.7504 0.8109 0.7619 Ours(Equal Scale) 0.8358 0.8497 0.7353 0.7542 0.8479 0.8809 0.8686 0.7519 0.8033 0.7755 Rim TransUNet 0.7721 0.7927 0.6525 0.6706 0.7940 0.8165 0.7633 0.6721 0.6950 0.6398 TransUNet+ADV 0.7690 0.7884 0.6501 0.6666 0.7903 0.7964 0.7501 0.6687 0.6717 0.6265 TransUNet+GroupDRO 0.7691 0.7857 0.6468 0.6613 0.7867 0.8057 0.7628 0.6625 0.6800 0.6355 TransUNet+FairSeg 0.7725 0.7898 0.6524 0.6668 0.7909 0.8106 0.7661 0.6680 0.6865 0.6424 Ours(Equal Scale) 0.7783 0.7959 0.6578 0.6743 0.7970 0.8161 0.7711 0.6755 0.6969 0.6469

Table 5: TransUNet segmentation performance of Optic Cup and Rim (Sensitive attribute = Ethnicity)

Overall Overall Overall Overall Hispanic Non-Hispanic Hispanic Non-Hispanic Method ES-Dice\uparrow Dice\uparrow ES-IoU\uparrow IoU\uparrow Dice\uparrow Dice\uparrow IoU\uparrow IoU\uparrow Cup TransUNet 0.8339 0.8481 0.7366 0.7532 0.8463 0.8704 0.7508 0.7826 TransUNet+ADV 0.8171 0.8320 0.7149 0.7315 0.8304 0.8561 0.7294 0.7622 TransUNet+GroupDRO 0.8376 0.8482 0.7406 0.7526 0.8468 0.8648 0.7507 0.7735 TransUNet+FairSeg 0.8388 0.8483 0.7412 0.7542 0.8501 0.8661 0.7515 0.7764 Ours(Equal Scale) 0.8439 0.8462 0.7402 0.7500 0.8485 0.8448 0.7664 0.7477 Rim TransUNet 0.7848 0.7927 0.6650 0.6706 0.7914 0.8057 0.6695 0.6815 TransUNet+ADV 0.7793 0.7841 0.6570 0.6602 0.7829 0.7915 0.6590 0.6658 TransUNet+GroupDRO 0.7924 0.7943 0.6694 0.6733 0.7936 0.7901 0.6728 0.6646 TransUNet+FairSeg 0.7884 0.7939 0.6710 0.6754 0.7943 0.8040 0.6697 0.6789 Ours(Equal Scale) 0.7886 0.7902 0.6666 0.6670 0.7917 0.7888 0.6649 0.6657

Results. In our comparative analysis, we examine the performance of our Equal Scale Data Combination method against several state-of-the-art fairness-learning approaches, including ADV [16], GroupDRO [21], and FairSeg [28]. This evaluation encompasses experiments conducted across four sensitive attributes. The detailed results for TransUNet are presented from Table 2 to Table 5. Due to limitations in space, the results of SAMed are included in the supplementary materials. From the perspective of racial fairness, Tab. 2 highlights our Equal Scale method’s effectiveness, achieving the highest ES-Dice in both the Cup and Rim area among all racial groups Asian, Black, and White, with remarkable scores of 0.8397 and 0.7697, respectively. The ES-IoU metric also supports our method, highlighting its effectiveness in achieving both accurate and equitable segmentation. Tab. 3, Tab. 4 and Tab. 5 also demonstrate the capability to enhance fairness metrics (ES-Dice & ES-IoU) and segmentation performance (Dice & IoU).

4 Conclusion

In this study, we analyze fairness within the context of medical image segmentation and address the challenge of data imbalance through the use of synthetic data. We present a novel Point-Image Diffusion method tailored for synthesizing SLO fundus images, which significantly outperforms existing techniques in this domain. By incorporating both synthetic and real data during the training phase utilizing the Equal Scale method, we achieve a comprehensive improvement in both accuracy and fairness across various sensitive attributes.

References

  • [1] Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: International conference on machine learning. pp. 214–223. PMLR (2017)
  • [2] Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y.: Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 (2021)
  • [3] Dalmaz, O., Yurt, M., Çukur, T.: Resvit: Residual vision transformers for multimodal medical image synthesis. IEEE Transactions on Medical Imaging 41(10), 2598–2614 (2022)
  • [4] Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34, 8780–8794 (2021)
  • [5] Friedrich, P., Wolleb, J., Bieder, F., Durrer, A., Cattin, P.C.: Wdm: 3d wavelet diffusion models for high-resolution medical image synthesis. arXiv preprint arXiv:2402.19043 (2024)
  • [6] Gao, H.a., Gao, M., Li, J., Li, W., Zhi, R., Tang, H., Zhao, H.: Scp-diff: Photo-realistic semantic image synthesis with spatial-categorical joint prior. arXiv preprint arXiv:2403.09638 (2024)
  • [7] Gao, H.a., Tian, B., Li, P., Chen, X., Zhao, H., Zhou, G., Chen, Y., Zha, H.: From semi-supervised to omni-supervised room layout estimation using point clouds. In: 2023 IEEE International Conference on Robotics and Automation (ICRA). pp. 2803–2810. IEEE (2023)
  • [8] Gao, H.a., Tian, B., Li, P., Zhao, H., Zhou, G.: Dqs3d: Densely-matched quantization-aware semi-supervised 3d detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 21905–21915 (2023)
  • [9] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. Advances in neural information processing systems 27 (2014)
  • [10] Hardt, M., Price, E., Srebro, N.: Equality of opportunity in supervised learning. Advances in neural information processing systems 29 (2016)
  • [11] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30 (2017)
  • [12] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems 33, 6840–6851 (2020)
  • [13] Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4401–4410 (2019)
  • [14] Khader, F., Mueller-Franzes, G., Arasteh, S.T., Han, T., Haarburger, C., Schulze-Hagen, M., Schad, P., Engelhardt, S., Baessler, B., Foersch, S., et al.: Medical diffusion–denoising diffusion probabilistic models for 3d medical image generation. arXiv preprint arXiv:2211.03364 (2022)
  • [15] Luo, S., Hu, W.: Diffusion probabilistic models for 3d point cloud generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2837–2845 (2021)
  • [16] Madras, D., Creager, E., Pitassi, T., Zemel, R.: Learning adversarially fair and transferable representations. In: International Conference on Machine Learning. pp. 3384–3393. PMLR (2018)
  • [17] Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2337–2346 (2019)
  • [18] Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J., Weinberger, K.Q.: On fairness and calibration. Advances in neural information processing systems 30 (2017)
  • [19] Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)
  • [20] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
  • [21] Sagawa, S., Koh, P.W., Hashimoto, T.B., Liang, P.: Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731 (2019)
  • [22] Sharma, A., Palaniappan, L.: Improving diversity in medical research. Nature Reviews Disease Primers 7(1),  74 (2021)
  • [23] Shen, J., Lu, S., Qu, R., Zhao, H., Zhang, L., Chang, A., Zhang, Y., Fu, W., Zhang, Z.: A boundary-guided transformer for measuring distance from rectal tumor to anal verge on magnetic resonance images. Patterns 4(4) (2023)
  • [24] Shen, J., Lu, S., Qu, R., Zhao, H., Zhang, Y., Chang, A., Zhang, L., Fu, W., Zhang, Z.: Measuring distance from lowest boundary of rectal tumor to anal verge on ct images using pyramid attention pooling transformer. Computers in Biology and Medicine 155, 106675 (2023)
  • [25] Shepherd, V.: An under-represented and underserved population in trials: methodological, structural, and systemic barriers to the inclusion of adults lacking capacity to consent. Trials 21(1),  1–8 (2020)
  • [26] Sushko, V., Schönfeld, E., Zhang, D., Gall, J., Schiele, B., Khoreva, A.: You only need adversarial supervision for semantic image synthesis. arXiv preprint arXiv:2012.04781 (2020)
  • [27] Tian, B., Liu, M., Gao, H.a., Li, P., Zhao, H., Zhou, G.: Unsupervised road anomaly detection with language anchors. In: 2023 IEEE international conference on robotics and automation (ICRA). pp. 7778–7785. IEEE (2023)
  • [28] Tian, Y., Shi, M., Luo, Y., Kouhana, A., Elze, T., Wang, M.: Fairseg: A large-scale medical image segmentation dataset for fairness learning using segment anything model with fair error-bound scaling. In: International Conference on Learning Representations (ICLR) (2024)
  • [29] Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional gans. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8798–8807 (2018)
  • [30] Yang, L., Xu, X., Kang, B., Shi, Y., Zhao, H.: Freemask: Synthetic images with dense annotations make stronger segmentation models. Advances in Neural Information Processing Systems 36 (2024)
  • [31] Ye, H., Kuen, J., Liu, Q., Lin, Z., Price, B., Xu, D.: Seggen: Supercharging segmentation models with text2mask and mask2img synthesis. arXiv preprint arXiv:2311.03355 (2023)
  • [32] Zemel, R., Wu, Y., Swersky, K., Pitassi, T., Dwork, C.: Learning fair representations. In: International conference on machine learning. pp. 325–333. PMLR (2013)
  • [33] Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3836–3847 (2023)
  • [34] Zhang, Y., Shen, Z., Jiao, R.: Segment anything model for medical image segmentation: Current applications and future directions. Computers in Biology and Medicine p. 108238 (2024)

Appendix 0.A Point Cloud Diffusion for Generation

The forward diffusion process can be modeled as a Markov chain.

q(xi(1:T)|xi(0))=t=1Tq(xi(t)|xi(t1))𝑞conditionalsubscriptsuperscript𝑥:1𝑇𝑖subscriptsuperscript𝑥0𝑖superscriptsubscriptproduct𝑡1𝑇𝑞conditionalsubscriptsuperscript𝑥𝑡𝑖subscriptsuperscript𝑥𝑡1𝑖q(x^{(1:T)}_{i}|x^{(0)}_{i})=\prod_{t=1}^{T}q(x^{(t)}_{i}|x^{(t-1)}_{i})italic_q ( italic_x start_POSTSUPERSCRIPT ( 1 : italic_T ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_q ( italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (8)

The training process is helping the model to learn the flow from original shape distribution to a noise distribution, and learn the noise predictor θ𝜃\thetaitalic_θ of each step.The generation of point clouds can be treated as the reverse of the diffusion process.

pθ(x(0:T)|z)=p(x(T))t=1Tpθ(x(t1)|x(t),z)subscript𝑝𝜃conditionalsuperscript𝑥:0𝑇𝑧𝑝superscript𝑥𝑇superscriptsubscriptproduct𝑡1𝑇subscript𝑝𝜃conditionalsuperscript𝑥𝑡1superscript𝑥𝑡𝑧p_{\theta}(x^{(0:T)}|z)=p(x^{(T)})\prod_{t=1}^{T}p_{\theta}(x^{(t-1)}|x^{(t)},z)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( 0 : italic_T ) end_POSTSUPERSCRIPT | italic_z ) = italic_p ( italic_x start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ) ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_z ) (9)
pθ(x(t1)|x(t),z)=𝒩(x(t1)|μθ(x(t),t,z),βtI)subscript𝑝𝜃conditionalsuperscript𝑥𝑡1superscript𝑥𝑡𝑧𝒩conditionalsuperscript𝑥𝑡1subscript𝜇𝜃superscript𝑥𝑡𝑡𝑧subscript𝛽𝑡𝐼p_{\theta}(x^{(t-1)}|x^{(t)},z)=\mathcal{N}(x^{(t-1)}|\mu_{\theta}(x^{(t)},t,z% ),\beta_{t}I)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_z ) = caligraphic_N ( italic_x start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT | italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_t , italic_z ) , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_I ) (10)

where μθsubscript𝜇𝜃\mu_{\theta}italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is the estimated mean implemented by a neural network parameterized by θ𝜃\thetaitalic_θ. z𝑧zitalic_z is the latent encoding the target shape of the point cloud. The starting distribution p(𝐱(T))𝑝superscript𝐱𝑇p(\mathbf{x}^{(T)})italic_p ( bold_x start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ) is set to a standard normal distribution 𝒩(0,𝐈)𝒩0𝐈\mathcal{N}(0,\mathbf{I})caligraphic_N ( 0 , bold_I ).

Appendix 0.B Synthesis Quality Metrics

The MMD-score measures the fidelity of generated samples, calculates the mean of the minimum matching distances between generated samples and real samples, used to evaluate the quality of the generative model. We define the distance D𝐷Ditalic_D between image I1subscript𝐼1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and image I2subscript𝐼2I_{2}italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as

D(I1,I2)=1cos(θ)2𝐷subscript𝐼1subscript𝐼21𝜃2D(I_{1},I_{2})=\frac{1-\cos(\theta)}{2}italic_D ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG 1 - roman_cos ( italic_θ ) end_ARG start_ARG 2 end_ARG (11)

where cos(θ)𝜃\cos(\theta)roman_cos ( italic_θ ) represents the cosine similarity between the two images. The COV-score denotes the proportion of real samples that match at least one image in the generated images, for generated set Sgsubscript𝑆𝑔S_{g}italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT and the reference real set Srsubscript𝑆𝑟S_{r}italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, the COV-score is

COV(Sg,Sr)=|{argminI2SrD(I1,I2)|I1Sg}||Sr|COVsubscript𝑆𝑔subscript𝑆𝑟conditional-setsubscriptsubscript𝐼2subscript𝑆𝑟𝐷subscript𝐼1subscript𝐼2subscript𝐼1subscript𝑆𝑔subscript𝑆𝑟\text{COV}(S_{g},S_{r})=\frac{|\{\arg\min_{I_{2}\in S_{r}}D(I_{1},I_{2})|I_{1}% \in S_{g}\}|}{|S_{r}|}COV ( italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) = divide start_ARG | { roman_arg roman_min start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_D ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT } | end_ARG start_ARG | italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | end_ARG (12)

Appendix 0.C More Experimental Results

Table 6: SAMed segmentation performance of Optic Cup and Rim (Sensitive attribute = Race)

Overall Overall Overall Overall Asian Black White Asian Black White Method ES-Dice\uparrow Dice\uparrow ES-IoU\uparrow IoU\uparrow Dice\uparrow Dice\uparrow Dice\uparrow IoU\uparrow IoU\uparrow IoU\uparrow Cup SAMed 0.8600 0.8671 0.7729 0.7813 0.8568 0.8730 0.8670 0.7688 0.7905 0.7808 SAMed+ADV 0.8640 0.8698 0.7769 0.7840 0.8590 0.8705 0.8708 0.7709 0.7882 0.7846 SAMed+GroupDRO 0.8634 0.8695 0.7767 0.7838 0.8583 0.8704 0.8706 0.7711 0.7886 0.7842 SAMed+FairSeg 0.8617 0.8671 0.7741 0.7808 0.8587 0.8708 0.8672 0.7708 0.7882 0.7804 Ours 0.8619 0.8660 0.7737 0.7796 0.8606 0.8702 0.8657 0.7744 0.7892 0.7782 Rim SAMed 0.8000 0.8291 0.6919 0.7217 0.7890 0.7758 0.8444 0.6743 0.6587 0.7399 SAMed+ADV 0.7935 0.8235 0.6835 0.7138 0.7801 0.7691 0.8395 0.6635 0.6498 0.7325 SAMed+GroupDRO 0.8011 0.8302 0.6930 0.7230 0.7952 0.7748 0.8454 0.6822 0.6568 0.7410 SAMed+FairSeg 0.8036 0.8323 0.6963 0.7260 0.7952 0.7789 0.8473 0.6825 0.6620 0.7439 Ours 0.8041 0.8311 0.6966 0.7242 0.7968 0.7808 0.8452 0.6840 0.6646 0.7409

Table 7: SAMed segmentation performance of Optic Cup and Rim (Sensitive attribute = Gender)

Overall Overall Overall Overall Male Female Male Female Method ES-Dice\uparrow Dice\uparrow ES-IoU\uparrow IoU\uparrow Dice\uparrow Dice\uparrow IoU\uparrow IoU\uparrow Cup SAMed 0.8637 0.8671 0.7773 0.7813 0.8647 0.8703 0.7783 0.7855 SAMed+ADV 0.8658 0.8667 0.7787 0.7803 0.8661 0.8675 0.7791 0.7820 SAMed+GroupDRO 0.8670 0.8671 0.7803 0.7808 0.8672 0.8670 0.7804 0.7814 SAMed+FairSeg 0.8678 0.8702 0.7807 0.7823 0.8718 0.8756 0.7851 0.7879 Ours 0.8676 0.8698 0.7809 0.7844 0.8683 0.8718 0.7817 0.7881 Rim SAMed 0.8251 0.8291 0.7175 0.7217 0.8319 0.8252 0.7252 0.7169 SAMed+ADV 0.8263 0.8309 0.7188 0.7236 0.8342 0.8263 0.7276 0.7181 SAMed+GroupDRO 0.8274 0.8320 0.7205 0.7253 0.8353 0.8274 0.7292 0.7198 SAMed+FairSeg 0.8289 0.8318 0.7227 0.7253 0.8338 0.8289 0.7274 0.7223 Ours 0.8221 0.8265 0.7132 0.7177 0.8297 0.8221 0.7214 0.7125

Table 8: SAMed segmentation performance of Optic Cup and Rim (Sensitive attribute = Language)

Overall Overall Overall Overall English Spanish Others English Spanish Others Method ES-Dice\uparrow Dice\uparrow ES-IoU\uparrow IoU\uparrow Dice\uparrow Dice\uparrow Dice\uparrow IoU\uparrow IoU\uparrow IoU\uparrow Cup SAMed 0.8490 0.8671 0.7603 0.7813 0.8652 0.9077 0.8838 0.7791 0.8338 0.8001 SAMed+ADV 0.8485 0.8686 0.7586 0.7830 0.8668 0.9131 0.8820 0.7808 0.8432 0.7982 SAMed+GroupDRO 0.8530 0.8702 0.7640 0.7847 0.8684 0.9085 0.8849 0.7825 0.8360 0.8019 SAMed+FairSeg 0.8527 0.8684 0.7646 0.7826 0.8670 0.9034 0.8794 0.7810 0.8268 0.7937 Ours 0.8518 0.8676 0.7624 0.7810 0.8659 0.9029 0.8815 0.7789 0.8271 0.7968 Rim SAMed 0.8070 0.8291 0.7006 0.7217 0.8305 0.8534 0.7989 0.7234 0.7468 0.6871 SAMed+ADV 0.8087 0.8295 0.7019 0.7217 0.8307 0.8528 0.8015 0.7231 0.7463 0.6900 SAMed+GroupDRO 0.8136 0.8311 0.7075 0.7239 0.8322 0.8493 0.8065 0.7253 0.7411 0.6954 SAMed+FairSeg 0.8100 0.8313 0.7038 0.7244 0.8328 0.8511 0.7992 0.7263 0.7436 0.6865 Ours 0.8036 0.8245 0.6944 0.7145 0.8258 0.8472 0.7955 0.7160 0.7377 0.6805

Table 9: SAMed segmentation performance of Optic Cup and Rim (Sensitive attribute = Ethnicity)

Overall Overall Overall Overall Hispanic Non-Hispanic Hispanic Non-Hispanic Method ES-Dice\uparrow Dice\uparrow ES-IoU\uparrow IoU\uparrow Dice\uparrow Dice\uparrow IoU\uparrow IoU\uparrow Cup SAMed 0.8519 0.8671 0.7645 0.7813 0.8653 0.8904 0.7790 0.8100 SAMed+ADV 0.8544 0.8678 0.7657 0.7814 0.8661 0.8883 0.7791 0.8080 SAMed+GroupDRO 0.8594 0.8698 0.7718 0.7840 0.8682 0.8855 0.7819 0.8044 SAMed+FairSeg 0.8611 0.8685 0.7753 0.7845 0.8704 0.8824 0.7904 0.8070 Ours 0.8625 0.8664 0.7730 0.7793 0.8714 0.8650 0.7889 0.7775 Rim SAMed 0.8221 0.8291 0.7164 0.7217 0.8277 0.8397 0.7203 0.7307 SAMed+ADV 0.8260 0.8323 0.7206 0.7257 0.8308 0.8416 0.7241 0.7342 SAMed+GroupDRO 0.8237 0.8299 0.7178 0.7224 0.8284 0.8390 0.7208 0.7298 SAMed+FairSeg 0.8296 0.8331 0.7215 0.7242 0.8349 0.8408 0.7278 0.7329 Ours 0.8186 0.8234 0.7112 0.7136 0.8306 0.8222 0.7171 0.7124