Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: axessibility
  • failed: gradient-text
  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY-NC-SA 4.0
arXiv:2403.15389v1 [cs.CV] 22 Mar 2024

DiffusionMTL: Learning Multi-Task Denoising Diffusion Model
from Partially Annotated Data

Hanrong Ye and Dan Xu
Department of Computer Science and Engineering, HKUST
Clear Water Bay, Kowloon, Hong Kong
{hyeae, danxu}@cse.ust.hk
Abstract

Recently, there has been an increased interest in the practical problem of learning multiple dense scene understanding tasks from partially annotated data, where each training sample is only labeled for a subset of the tasks. The missing of task labels in training leads to low-quality and noisy predictions, as can be observed from state-of-the-art methods. To tackle this issue, we reformulate the partially-labeled multi-task dense prediction as a pixel-level denoising problem, and propose a novel multi-task denoising diffusion framework coined as DiffusionMTL. It designs a joint diffusion and denoising paradigm to model a potential noisy distribution in the task prediction or feature maps and generate rectified outputs for different tasks. To exploit multi-task consistency in denoising, we further introduce a Multi-Task Conditioning strategy, which can implicitly utilize the complementary nature of the tasks to help learn the unlabeled tasks, leading to an improvement in the denoising performance of the different tasks. Extensive quantitative and qualitative experiments demonstrate that the proposed multi-task denoising diffusion model can significantly improve multi-task prediction maps, and outperform the state-of-the-art methods on three challenging multi-task benchmarks, under two different partial-labeling evaluation settings. The code is available at https://prismformore.github.io/diffusionmtl/. 111The paper is accepted by CVPR 2024.

1 Introduction

Refer to caption
Figure 1: Motivative illustration of the proposed DiffusionMTL for multi-task partially supervised dense prediction. The model denoise the manually decayed multi-task prediction or feature maps (denoted as {𝐗S0,,𝐗ST}subscriptsuperscript𝐗0𝑆subscriptsuperscript𝐗𝑇𝑆\{\mathbf{X}^{0}_{S},...,\mathbf{X}^{T}_{S}\}{ bold_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , … , bold_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT }, T,S𝑇𝑆T,Sitalic_T , italic_S are the numbers of tasks and steps separately) in a step by step manner, and obtain the denoised outputs {𝐗00,,𝐗0T}subscriptsuperscript𝐗00subscriptsuperscript𝐗𝑇0\{\mathbf{X}^{0}_{0},...,\mathbf{X}^{T}_{0}\}{ bold_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , bold_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT }. The denoising process is guided by the designed multi-task condition feature 𝐅condsubscript𝐅𝑐𝑜𝑛𝑑{\mathbf{F}}_{cond}bold_F start_POSTSUBSCRIPT italic_c italic_o italic_n italic_d end_POSTSUBSCRIPT.

Multi-task learning for dense scene understanding [1, 2, 3, 4] is an important research topic that has recently gained a lot of attention from computer vision researchers. It aims at jointly learning multiple scene-related dense prediction tasks, including semantic segmentation, surface normal estimation, depth estimation, etc. This multi-task learning problem has dual superiority over traditional single-task learning. On the one hand, multi-task models are naturally more efficient than single-task models with similar structures because different tasks can share some network modules. On the other hand, different tasks are able to help each other and improve overall performance by sharing information through cross-task consistency [5]. However, annotating a real-world multi-task learning dataset at the pixel level is a daunting task. As an alternative, collecting data annotated for different tasks and using them to train a multi-task model is a much more feasible approach. This motivates recent work [6] that defines an important new problem known as “Multi-Task Partially Supervised Learning (MTPSL)", where each training sample contains labels for a subset of the tasks, rather than all tasks. As there is a lack of multi-task labels for each training sample, the partially supervised multi-task learning problem is more challenging compared to the fully supervised multi-task learning problem. To handle this problem, previous state-of-the-art models [6] focus on improving label efficiency by enforcing cross-task consistency. They train an additional network to construct a joint feature space for each task pair, which helps improve the multi-task optimization process and demonstrates promising multi-task performance under MTPSL. Despite their success in improving model performance, the sparsity of training labels in MTPSL still inevitably leads to noisy prediction maps which can be observed from previous state-of-the-art models, as shown in Figure 7. Therefore, there is a need for a new methodology to effectively denoise the noisy multi-task dense predictions to improve the multi-task prediction quality.

To address the above-mentioned problem, we propose a novel multi-task denoising diffusion framework that can effectively remove noise from the dense predictions and rectify multi-task prediction maps. We formulate the multi-task dense prediction problem as a joint pixel-level denoising and generation process, and propose a new multi-task model coined as “DiffusionMTL”. DiffusionMTL learns to denoise noisy multi-task predictions with the help of diffusion models [7], which are particularly effective in recovering data distribution from noisy input. It jointly performs the diffusion and denoising processes to discover potential noisy distributions of the multi-task prediction maps, and learns to rectify the prediction maps. We further present two distinct diffusion mechanisms: Prediction Diffusion and Feature Diffusion. Prediction Diffusion learns to remove noise from the multi-task prediction maps, while Feature Diffusion learns to refine the multi-task feature maps. Unlike typical diffusion models used for image synthesis [8], our denoising network is designed to achieve two objectives simultaneously. Firstly, it must reverse the Markovian noise diffusion process, i.e., remove the manually added noise from the input maps. Secondly, it is encouraged to generate higher-quality multi-task predictions from the noisy input, thereby improving overall multi-task prediction performance. Furthermore, to exploit multi-task consistency in the denoising process, we design a Multi-Task Conditioning mechanism for our DiffusionMTL model. This mechanism utilizes the prediction maps generated by the decoders of all the tasks to effectively facilitate the denoising process of a target task. Meanwhile, the outputs of the unlabeled tasks also receive supervision signals from the ground-truth labels of other tasks, allowing our DiffusionMTL to not only enhance the denoising performance of the labeled tasks but also facilitate the learning of unlabeled tasks.

To evaluate the effectiveness of our approach for multi-task partially supervised learning, we have conducted extensive experiments on three challenging partially-annotated multi-task datasets, namely PASCAL, NYUD, and Cityscapes. Both quantitative and qualitative results demonstrate the effectiveness of the proposed DiffusionMTL model, and show that DiffusionMTL significantly outperforms the current state-of-the-art method by a large margin, using the same backbone and fewer model parameters.

In summary, the contribution of this paper is threefold:

  • We propose the first multi-task denoising diffusion framework for the partially labeled multi-task dense prediction problem. The innovative framework reformulates multi-task dense prediction as a joint pixel-level diffusion and denoising process, which empowers us to generate rectified higher-quality multi-task predictions.

  • We develop a novel Multi-Task Denoising Diffusion Network specifically designed to address the issue of noise in initial prediction maps. An effective Multi-Task Conditioning mechanism is designed in our diffusion model to enhance the denoising performance. We further devise two effective diffusion mechanisms, namely Prediction Diffusion and Feature Diffusion, for refining task signals in prediction and feature spaces separately.

  • Extensive experiments have been conducted on three prevalent partial-labeling multi-task benchmarks under two different settings, which clearly validate the effectiveness of our proposal. Our method demonstrates significant performance improvements compared to the previous state-of-the-art methods.

Refer to caption
Figure 2: Illustration of the proposed DiffusionMTL (Prediction Diffusion) framework for the MTPSL setting. DiffusionMTL first uses an initial backbone model for producing starter prediction maps for all tasks. To denoise the initial prediction maps and generate rectified maps, we propose a Multi-Task Denoising Diffusion Network (MTDNet). MTDNet involves a diffusion process and a denoising process. During training, the initial prediction map of the labeled target task 𝒯𝒯\mathcal{T}caligraphic_T is gradually degraded by applying noise, resulting in the noisy prediction map 𝐏S𝒯superscriptsubscript𝐏𝑆𝒯{\mathbf{P}}_{S}^{\mathcal{T}}bold_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT. Then, we utilize a Multi-Task Conditioned Denoiser (referred to as the “Denoiser") to denoise 𝐏S𝒯superscriptsubscript𝐏𝑆𝒯{\mathbf{P}}_{S}^{\mathcal{T}}bold_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT iteratively over S𝑆Sitalic_S steps, resulting in a clean prediction map 𝐏0𝒯superscriptsubscript𝐏0𝒯{\mathbf{P}}_{0}^{\mathcal{T}}bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT that is supervised by the ground-truth label. For better learning of unlabeled tasks, we propose a Multi-Task Conditioning mechanism in the denoising process to stimulate information sharing across different tasks. During inference, the diffusion and denoising processes are applied to all tasks to produce denoised multi-task prediction maps.
Refer to caption
Figure 3: Illustration of the proposed DiffusionMTL (Feature Diffusion), which conducts noise decay and denoising on initial feature maps 𝐅init𝒯superscriptsubscript𝐅𝑖𝑛𝑖𝑡𝒯{\mathbf{F}}_{init}^{\mathcal{T}}bold_F start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT. The denoised feature maps 𝐅0𝒯superscriptsubscript𝐅0𝒯{\mathbf{F}}_{0}^{\mathcal{T}}bold_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT are projected to the final prediction map 𝐏0𝒯superscriptsubscript𝐏0𝒯{\mathbf{P}}_{0}^{\mathcal{T}}bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT with a task head after the denoising.

2 Related Work

Multi-Task Dense Scene Understanding with Partially Annotated Data Multi-task learning (MTL) for dense scene understanding has been widely studied in recent years [5, 1, 3, 9, 10, 11, 12, 13, 14, 15, 16]. By learning several tasks together, MTL enhances the computational efficiency of both training and inference compared to single-task models while achieving better performance [17, 18, 4]. To improve the performance of multi-task learning, some researchers have focused on improving the optimization process of MTL by designing loss functions [6, 19, 20, 2, 21] and manipulating gradients [22, 23, 24, 25, 26], while other researchers work on designing powerful multi-task model architectures [27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37]. It is worth noting that the aforementioned methods are mainly designed for fully-supervised multi-task learning, where the labels of all tasks are assumed to be given for each training image. However, in real-world scenarios, it is not always feasible to obtain labels for all tasks, and we may have data with only some of the tasks available. To address this issue, a new problem named Multi-Task Partially Supervised Learning (MTPSL) has been defined by [6]. In MTPSL, the training samples are only partially annotated for the tasks, which poses new challenges to multi-task learning due to the sparsity of labels in the training data. To meet the challenge, XTC [6] has been proposed to better leverage partial annotations by improving label efficiency. It maps the label spaces of different tasks into one joint feature space and utilizes cross-task consistency to learn tasks without labels for each training sample. Although this approach has shown promising results, it still inevitably suffers from noisy predictions because the model is under-trained with a limited number of ground-truth labels. To directly tackle the noisy prediction problem, our proposal takes a distinct approach by designing a novel multi-task denoising framework to improve the quality of multi-task prediction maps.

Refer to caption
Figure 4: Pipeline of a single step s𝑠sitalic_s in the denoising process of DiffusionMTL (Prediction Diffusion). Multi-Task Conditioning: The initial prediction maps for all tasks are projected to task-specific features and then stacked. The stacked features are then processed with a 3×3333\times 33 × 3 convolution to reduce the channel dimension, resulting in a Multi-Task Condition Feature 𝐅condsubscript𝐅𝑐𝑜𝑛𝑑{\mathbf{F}}_{cond}bold_F start_POSTSUBSCRIPT italic_c italic_o italic_n italic_d end_POSTSUBSCRIPT which is shared across all tasks. Multi-Task Conditioned Denoiser: The denoiser consists of several cross-attention transformer blocks, which learn to denoise input conditioned on 𝐅condsubscript𝐅𝑐𝑜𝑛𝑑{\mathbf{F}}_{cond}bold_F start_POSTSUBSCRIPT italic_c italic_o italic_n italic_d end_POSTSUBSCRIPT. For its input, we perform a 3×3333\times 33 × 3 convolution on the noisy prediction map 𝐏s𝒯superscriptsubscript𝐏𝑠𝒯{\mathbf{P}}_{s}^{\mathcal{T}}bold_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT and combine the output with the step embedding, obtaining a task embedding 𝐄s𝒯superscriptsubscript𝐄𝑠𝒯{\mathbf{E}}_{s}^{\mathcal{T}}bold_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT. The denoiser takes 𝐅condsubscript𝐅𝑐𝑜𝑛𝑑{\mathbf{F}}_{cond}bold_F start_POSTSUBSCRIPT italic_c italic_o italic_n italic_d end_POSTSUBSCRIPT as query input and 𝐄ssubscript𝐄𝑠{\mathbf{E}}_{s}bold_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT as key and value inputs. We use a task-specific head to obtain the denoised prediction map 𝐏s1𝒯superscriptsubscript𝐏𝑠1𝒯{\mathbf{P}}_{s-1}^{\mathcal{T}}bold_P start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT, which is the input of the next denoising step s1𝑠1s-1italic_s - 1.

Diffusion Models Diffusion models [38, 7] are a class of generative models that have been widely used for image synthesis tasks, and have achieved state-of-the-art performance on several benchmarks [7, 39, 8, 40, 41]. However, adapting diffusion models for multi-task dense prediction is not straightforward. Although some attempts have been made to apply diffusion models to single-task deterministic problems including image classification [42], segmentation [43, 44, 45, 46, 47] and detection [48], they are not suitable for multi-task dense scene understanding. In this paper, we propose a novel multi-task diffusion model that can denoise noisy prediction maps for multiple tasks and obtain finer results under the multi-task partially-supervised setting. Our approach represents an exploratory advancement in diffusion models and has the potential to inspire the design of future diffusion models for deterministic tasks.

3 The Proposed DiffusionMTL Approach

In this section, we will introduce the details of our proposed multi-task denoising diffusion framework, DiffusionMTL, as illustrated in Fig. 2. DiffusionMTL has two steps: (i) First, an initial backbone model generates preliminary prediction maps for multiple dense scene understanding tasks. (ii) Second, a proposed Multi-Task Denoising Diffusion Network (MTDNet) takes in the noisy initial multi-task prediction maps and produces refined prediction results. These two parts are trained together in an end-to-end manner with partially annotated data.

3.1 Initial Backbone Model

We adopt a classic encoder-decoder structure for the multi-task dense prediction [4, 17, 32]. The initial backbone model utilizes a task-shared encoder fencsubscript𝑓𝑒𝑛𝑐f_{enc}italic_f start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT, which accepts an input image 𝐈H×W×3𝐈superscript𝐻𝑊3{\mathbf{I}}\in\mathbb{R}^{H\times W\times 3}bold_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT (where H𝐻Hitalic_H and W𝑊Witalic_W represent height and width, respectively) and projects it to obtain a multi-channel backbone feature map 𝐅backboneH×W×Csubscript𝐅𝑏𝑎𝑐𝑘𝑏𝑜𝑛𝑒superscriptsuperscript𝐻superscript𝑊𝐶{\mathbf{F}}_{backbone}\in\mathbb{R}^{H^{\prime}\times W^{\prime}\times C}bold_F start_POSTSUBSCRIPT italic_b italic_a italic_c italic_k italic_b italic_o italic_n italic_e end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_C end_POSTSUPERSCRIPT. The backbone feature map has a height of Hsuperscript𝐻H^{\prime}italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, a width of Wsuperscript𝑊W^{\prime}italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and C𝐶Citalic_C channels. It is shared by all the tasks. And then, to generate task-specific feature maps for T𝑇Titalic_T tasks, we adopt a series of task-specific decoders {fdec1,fdec2,,fdecT}superscriptsubscript𝑓𝑑𝑒𝑐1superscriptsubscript𝑓𝑑𝑒𝑐2superscriptsubscript𝑓𝑑𝑒𝑐𝑇\{f_{dec}^{1},f_{dec}^{2},...,f_{dec}^{T}\}{ italic_f start_POSTSUBSCRIPT italic_d italic_e italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_d italic_e italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_d italic_e italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } with identical network structures and different network parameters. The generated initial task feature maps from the decoders are notated as {𝐅init1,𝐅init2,,𝐅initT}superscriptsubscript𝐅𝑖𝑛𝑖𝑡1superscriptsubscript𝐅𝑖𝑛𝑖𝑡2superscriptsubscript𝐅𝑖𝑛𝑖𝑡𝑇\{{\mathbf{F}}_{init}^{1},{\mathbf{F}}_{init}^{2},...,{\mathbf{F}}_{init}^{T}\}{ bold_F start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_F start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , bold_F start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT }. For the t𝑡titalic_t-th task, we compute the task-specific initial feature map 𝐅inittsuperscriptsubscript𝐅𝑖𝑛𝑖𝑡𝑡{\mathbf{F}}_{init}^{t}bold_F start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT as:

𝐅initt=fdect(fenc(𝐈)).superscriptsubscript𝐅𝑖𝑛𝑖𝑡𝑡superscriptsubscript𝑓𝑑𝑒𝑐𝑡subscript𝑓𝑒𝑛𝑐𝐈{\mathbf{F}}_{init}^{t}=f_{dec}^{t}(f_{enc}({\mathbf{I}})).bold_F start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_d italic_e italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT ( bold_I ) ) . (1)

To compute an initial dense prediction map 𝐏inittsuperscriptsubscript𝐏𝑖𝑛𝑖𝑡𝑡{\mathbf{P}}_{init}^{t}bold_P start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for the t𝑡titalic_t-th task, we apply a task-specific 1×1111\times 11 × 1 convolution fpredtsuperscriptsubscript𝑓𝑝𝑟𝑒𝑑𝑡f_{pred}^{t}italic_f start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT on the corresponding task feature map 𝐅inittsuperscriptsubscript𝐅𝑖𝑛𝑖𝑡𝑡{\mathbf{F}}_{init}^{t}bold_F start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT:

𝐏initt=fpredt(𝐅initt).superscriptsubscript𝐏𝑖𝑛𝑖𝑡𝑡superscriptsubscript𝑓𝑝𝑟𝑒𝑑𝑡superscriptsubscript𝐅𝑖𝑛𝑖𝑡𝑡\displaystyle{\mathbf{P}}_{init}^{t}=f_{pred}^{t}({\mathbf{F}}_{init}^{t}).bold_P start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( bold_F start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) . (2)

In this way, we obtain the T𝑇Titalic_T initial prediction maps of all T𝑇Titalic_T tasks. The initial prediction maps are noisy as we can observe from Fig. 2. We aim to rectify the noisy multi-task prediction maps with the following MTDNet.

3.2 Multi-Task Denoising Diffusion Network

In this paper, we put forward a novel diffusion model, named Multi-Task Denoising Diffusion Network (MTDNet) for denoising the aforementioned noisy prediction maps. To achieve this goal, we design two orthogonal diffusion mechanisms in our unified MTDNet, focusing on different signal domains: (i) Prediction Diffusion and (ii) Feature Diffusion. These mechanisms differ in terms of the signal space in which the diffusion model is applied. Feature Diffusion refines the task-specific features within a high-dimensional latent space, while Prediction Diffusion directly improves the initial task predictions in the output space. Feature Diffusion facilitates a comprehensive improvement of high-level visual information within an expanded latent space, while Prediction Diffusion demonstrates effective denoising capabilities along with better computational efficiency. More details about their difference will be provided when describing the different components of MTDNet. As shown in Fig. 2 for Prediction Diffusion and Fig. 3 for Feature Diffusion, we follow the DDPM paradigm [7], which is separated into 2 processes: a diffusion process and a denoising process. During the diffusion process, we incrementally degrade the information in the initial prediction maps by applying noise with a Markovian chain. In the denoising process, we introduce a novel denoising network that is trained to generate clean prediction maps from the degraded ones in an iterative manner. Without loss of generality, we elaborate on the one-label setting of the multi-task partially supervised learning, where we assume that the current training sample has label for only one task (task 𝒯𝒯\mathcal{T}caligraphic_T). We name this labeled task as the “target task".

3.2.1 Diffusion Process

In the diffusion process (or “forward process”) [7], we construct a fixed Markov chain with a total length of S𝑆Sitalic_S steps. For the target task 𝒯𝒯\mathcal{T}caligraphic_T, Gaussian noise is gradually applied to the initial map 𝐗init𝒯superscriptsubscript𝐗𝑖𝑛𝑖𝑡𝒯{\mathbf{X}}_{init}^{\mathcal{T}}bold_X start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT in a step-by-step manner. Here 𝐗init𝒯superscriptsubscript𝐗𝑖𝑛𝑖𝑡𝒯{\mathbf{X}}_{init}^{\mathcal{T}}bold_X start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT is the initial prediction map 𝐏init𝒯superscriptsubscript𝐏𝑖𝑛𝑖𝑡𝒯{\mathbf{P}}_{init}^{\mathcal{T}}bold_P start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT (in Prediction Diffusion) or initial task feature map 𝐅init𝒯superscriptsubscript𝐅𝑖𝑛𝑖𝑡𝒯{\mathbf{F}}_{init}^{\mathcal{T}}bold_F start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT (in Feature Diffusion). Suppose the decayed map at denoising step s𝑠sitalic_s of target task 𝒯𝒯\mathcal{T}caligraphic_T is 𝐗s𝒯superscriptsubscript𝐗𝑠𝒯{\mathbf{X}}_{s}^{\mathcal{T}}bold_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT, the diffusion process q𝑞qitalic_q can be formulated as:

q(𝐗s𝒯|𝐗init𝒯)=𝒩(𝐗s𝒯|α¯s𝐗init𝒯,(1α¯s)𝐈),𝑞conditionalsuperscriptsubscript𝐗𝑠𝒯superscriptsubscript𝐗𝑖𝑛𝑖𝑡𝒯𝒩conditionalsuperscriptsubscript𝐗𝑠𝒯subscript¯𝛼𝑠superscriptsubscript𝐗𝑖𝑛𝑖𝑡𝒯1subscript¯𝛼𝑠𝐈q({\mathbf{X}}_{s}^{\mathcal{T}}|{\mathbf{X}}_{init}^{\mathcal{T}})=\mathcal{N% }({\mathbf{X}}_{s}^{\mathcal{T}}|\sqrt{\bar{\alpha}_{s}}{\mathbf{X}}_{init}^{% \mathcal{T}},(1-\bar{\alpha}_{s}){\mathbf{I}}),italic_q ( bold_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT | bold_X start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ) = caligraphic_N ( bold_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT | square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG bold_X start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) bold_I ) , (3)

where {α¯s,s1,2,,S}formulae-sequencesubscript¯𝛼𝑠𝑠12𝑆\{\bar{\alpha}_{s},s\in 1,2,...,S\}{ over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ∈ 1 , 2 , … , italic_S } are hyperparameters. In practice, we could directly compute the final decayed map. Through mathematical derivation, the decayed prediction map of target task 𝒯𝒯\mathcal{T}caligraphic_T at the final step S𝑆Sitalic_S can be formulated as 𝐗S𝒯=α¯S𝐗init𝒯+1α¯Sϵsuperscriptsubscript𝐗𝑆𝒯subscript¯𝛼𝑆superscriptsubscript𝐗𝑖𝑛𝑖𝑡𝒯1subscript¯𝛼𝑆italic-ϵ{\mathbf{X}}_{S}^{\mathcal{T}}=\sqrt{\bar{\alpha}_{S}}{\mathbf{X}}_{init}^{% \mathcal{T}}+\sqrt{1-\bar{\alpha}_{S}}\epsilonbold_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_ARG bold_X start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_ARG italic_ϵ, where ϵ𝒩(0,𝐈)similar-toitalic-ϵ𝒩0𝐈\epsilon\sim\mathcal{N}(0,\mathbf{I})italic_ϵ ∼ caligraphic_N ( 0 , bold_I ). For more theoretical details please refer to [7].

Pseudocode 1 DiffusionMTL under one-label setting
1:function DiffusionMTL(version,mode,𝐈,𝐋,𝒯,S,Tversionmode𝐈𝐋𝒯𝑆𝑇\text{version},\text{mode},{\mathbf{I}},{\mathbf{L}},\mathcal{T},S,Tversion , mode , bold_I , bold_L , caligraphic_T , italic_S , italic_T)
2:    Input: version \in {‘Prediction Diffusion’, ‘Feature Diffusion’}, mode \in {‘train’, ‘infer’}, input image 𝐈𝐈{\mathbf{I}}bold_I, label map 𝐋𝐋{\mathbf{L}}bold_L, target task 𝒯𝒯\mathcal{T}caligraphic_T, diffusion steps S𝑆Sitalic_S, number of tasks T𝑇Titalic_T
3:    Output: Training loss or denoised prediction map 𝐏0𝒯subscriptsuperscript𝐏𝒯0{\mathbf{P}}^{\mathcal{T}}_{0}bold_P start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
4:    𝐅backbonefenc(𝐈)subscript𝐅𝑏𝑎𝑐𝑘𝑏𝑜𝑛𝑒subscript𝑓𝑒𝑛𝑐𝐈{\mathbf{F}}_{backbone}\leftarrow f_{enc}({\mathbf{I}})bold_F start_POSTSUBSCRIPT italic_b italic_a italic_c italic_k italic_b italic_o italic_n italic_e end_POSTSUBSCRIPT ← italic_f start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT ( bold_I )
5:    for t1,2,,T𝑡12𝑇t\leftarrow 1,2,\dots,Titalic_t ← 1 , 2 , … , italic_T do
6:         𝐅inittfdect(𝐅backbone)superscriptsubscript𝐅𝑖𝑛𝑖𝑡𝑡superscriptsubscript𝑓𝑑𝑒𝑐𝑡subscript𝐅𝑏𝑎𝑐𝑘𝑏𝑜𝑛𝑒{\mathbf{F}}_{init}^{t}\leftarrow f_{dec}^{t}({\mathbf{F}}_{backbone})bold_F start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← italic_f start_POSTSUBSCRIPT italic_d italic_e italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( bold_F start_POSTSUBSCRIPT italic_b italic_a italic_c italic_k italic_b italic_o italic_n italic_e end_POSTSUBSCRIPT ) \triangleright Compute initial task features
7:         𝐏inittfpredt(𝐅dect)superscriptsubscript𝐏𝑖𝑛𝑖𝑡𝑡superscriptsubscript𝑓𝑝𝑟𝑒𝑑𝑡superscriptsubscript𝐅𝑑𝑒𝑐𝑡{\mathbf{P}}_{init}^{t}\leftarrow f_{pred}^{t}({\mathbf{F}}_{dec}^{t})bold_P start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← italic_f start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( bold_F start_POSTSUBSCRIPT italic_d italic_e italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) \triangleright Compute initial prediction maps
8:    end for
9:    if version === ‘Prediction Diffusion’ then
10:         𝐗init𝒯𝐏init𝒯superscriptsubscript𝐗𝑖𝑛𝑖𝑡𝒯superscriptsubscript𝐏𝑖𝑛𝑖𝑡𝒯{\mathbf{X}}_{init}^{\mathcal{T}}\leftarrow{\mathbf{P}}_{init}^{\mathcal{T}}bold_X start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ← bold_P start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT
11:         𝐅condfcond(𝐏init1,𝐏init2,,𝐏initT)subscript𝐅𝑐𝑜𝑛𝑑subscript𝑓𝑐𝑜𝑛𝑑superscriptsubscript𝐏𝑖𝑛𝑖𝑡1superscriptsubscript𝐏𝑖𝑛𝑖𝑡2superscriptsubscript𝐏𝑖𝑛𝑖𝑡𝑇{\mathbf{F}}_{cond}\leftarrow f_{cond}({\mathbf{P}}_{init}^{1},{\mathbf{P}}_{% init}^{2},\dots,{\mathbf{P}}_{init}^{T})bold_F start_POSTSUBSCRIPT italic_c italic_o italic_n italic_d end_POSTSUBSCRIPT ← italic_f start_POSTSUBSCRIPT italic_c italic_o italic_n italic_d end_POSTSUBSCRIPT ( bold_P start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_P start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , bold_P start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT )
12:    else if version === ‘Feature Diffusion’ then
13:         𝐗init𝒯𝐅init𝒯superscriptsubscript𝐗𝑖𝑛𝑖𝑡𝒯superscriptsubscript𝐅𝑖𝑛𝑖𝑡𝒯{\mathbf{X}}_{init}^{\mathcal{T}}\leftarrow{\mathbf{F}}_{init}^{\mathcal{T}}bold_X start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ← bold_F start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT
14:         𝐅condfcond(𝐅init1,𝐅init2,,𝐅initT)subscript𝐅𝑐𝑜𝑛𝑑subscript𝑓𝑐𝑜𝑛𝑑superscriptsubscript𝐅𝑖𝑛𝑖𝑡1superscriptsubscript𝐅𝑖𝑛𝑖𝑡2superscriptsubscript𝐅𝑖𝑛𝑖𝑡𝑇{\mathbf{F}}_{cond}\leftarrow f_{cond}({\mathbf{F}}_{init}^{1},{\mathbf{F}}_{% init}^{2},\dots,{\mathbf{F}}_{init}^{T})bold_F start_POSTSUBSCRIPT italic_c italic_o italic_n italic_d end_POSTSUBSCRIPT ← italic_f start_POSTSUBSCRIPT italic_c italic_o italic_n italic_d end_POSTSUBSCRIPT ( bold_F start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_F start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , bold_F start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT )
15:    end if
16:    ϵ𝒩(0,𝐈)similar-toitalic-ϵ𝒩0𝐈\epsilon\sim\mathcal{N}(0,{\mathbf{I}})italic_ϵ ∼ caligraphic_N ( 0 , bold_I ) \triangleright Sample noise
17:    𝐗S𝒯α¯S𝐗init𝒯+1α¯Sϵsuperscriptsubscript𝐗𝑆𝒯subscript¯𝛼𝑆superscriptsubscript𝐗𝑖𝑛𝑖𝑡𝒯1subscript¯𝛼𝑆italic-ϵ{\mathbf{X}}_{S}^{\mathcal{T}}\leftarrow\sqrt{\bar{\alpha}_{S}}{\mathbf{X}}_{% init}^{\mathcal{T}}+\sqrt{1-\bar{\alpha}_{S}}\epsilonbold_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ← square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_ARG bold_X start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_ARG italic_ϵ \triangleright Diffusion process
18:    for sS,S1,,1𝑠𝑆𝑆11s\leftarrow S,S-1,\dots,1italic_s ← italic_S , italic_S - 1 , … , 1 do
19:         𝐗s1𝒯Denoiser(𝐗s𝒯,s,𝐅cond)subscriptsuperscript𝐗𝒯𝑠1Denoisersubscriptsuperscript𝐗𝒯𝑠𝑠subscript𝐅𝑐𝑜𝑛𝑑{\mathbf{X}}^{\mathcal{T}}_{s-1}\leftarrow\mathrm{Denoiser}({\mathbf{X}}^{% \mathcal{T}}_{s},s,{\mathbf{F}}_{cond})bold_X start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ← roman_Denoiser ( bold_X start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s , bold_F start_POSTSUBSCRIPT italic_c italic_o italic_n italic_d end_POSTSUBSCRIPT ) \triangleright Denoising step
20:    end for
21:    if version === ‘Prediction Diffusion’ then
22:         𝐏0𝒯𝐗0𝒯superscriptsubscript𝐏0𝒯superscriptsubscript𝐗0𝒯{\mathbf{P}}_{0}^{\mathcal{T}}\leftarrow{\mathbf{X}}_{0}^{\mathcal{T}}bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ← bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT
23:    else if version === ‘Feature Diffusion’ then
24:         𝐏0𝒯fhead𝒯(𝐗0𝒯)superscriptsubscript𝐏0𝒯superscriptsubscript𝑓𝑒𝑎𝑑𝒯superscriptsubscript𝐗0𝒯{\mathbf{P}}_{0}^{\mathcal{T}}\leftarrow f_{head}^{\mathcal{T}}({\mathbf{X}}_{% 0}^{\mathcal{T}})bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ← italic_f start_POSTSUBSCRIPT italic_h italic_e italic_a italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ( bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ) \triangleright Final task head
25:    end if
26:    if mode === ‘train’ then
27:         return compute_loss(𝐏0𝒯,𝐋)compute_losssubscriptsuperscript𝐏𝒯0𝐋\text{compute\_loss}({\mathbf{P}}^{\mathcal{T}}_{0},{\mathbf{L}})compute_loss ( bold_P start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_L ) \triangleright Compute loss with available label of target task
28:    else if mode === ‘infer’ then
29:         return 𝐏0𝒯subscriptsuperscript𝐏𝒯0{\mathbf{P}}^{\mathcal{T}}_{0}bold_P start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT \triangleright Output denoised prediction map
30:    end if
31:end function

3.2.2 Denoising Process

As the core component of our MTDNet, the denoising process involves designing a Multi-Task Conditioned Denoiser, referred to as “Denoiser”, to denoise the noisy multi-task prediction maps or feature maps. Specifically, given the decayed noisy map of target task 𝐗S𝒯superscriptsubscript𝐗𝑆𝒯{\mathbf{X}}_{S}^{\mathcal{T}}bold_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT from the diffusion process, Denoiser generates 𝐗S1𝒯superscriptsubscript𝐗𝑆1𝒯{\mathbf{X}}_{S-1}^{\mathcal{T}}bold_X start_POSTSUBSCRIPT italic_S - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT, 𝐗S2𝒯superscriptsubscript𝐗𝑆2𝒯{\mathbf{X}}_{S-2}^{\mathcal{T}}bold_X start_POSTSUBSCRIPT italic_S - 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT,…, 𝐗0𝒯superscriptsubscript𝐗0𝒯{\mathbf{X}}_{0}^{\mathcal{T}}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT in an iterative manner. In the following, we will first introduce a novel Multi-Task Conditioning strategy, and then describe how Denoiser computes the denoised map in each denoising step. Fig. 4 illustrates the computation pipeline for a single denoising step of Prediction Diffusion.

Multi-Task Conditioning To help denoise the prediction or feature maps of the labeled tasks, as well as enable learning unlabeled tasks under a partially annotated setting, we propose a Multi-Task Conditioning strategy in the denoising process. It first obtains a “multi-task condition feature” (denoted as 𝐅condsubscript𝐅𝑐𝑜𝑛𝑑{\mathbf{F}}_{cond}bold_F start_POSTSUBSCRIPT italic_c italic_o italic_n italic_d end_POSTSUBSCRIPT) from the initial maps of all the tasks. 𝐅condsubscript𝐅𝑐𝑜𝑛𝑑{\mathbf{F}}_{cond}bold_F start_POSTSUBSCRIPT italic_c italic_o italic_n italic_d end_POSTSUBSCRIPT captures the joint multi-task information, which is later used to condition the denoising network. To obtain 𝐅condsubscript𝐅𝑐𝑜𝑛𝑑{\mathbf{F}}_{cond}bold_F start_POSTSUBSCRIPT italic_c italic_o italic_n italic_d end_POSTSUBSCRIPT, we first project the initial multi-task maps to feature space via a 3×3333\times 33 × 3 convolution and obtain task-specific features for all T𝑇Titalic_T tasks. Once we have obtained task-specific features for all T𝑇Titalic_T tasks, we combine them along the channel dimension. The resulting tensor is then subjected to a 3×3333\times 33 × 3 convolutional layer, which reduces the channel dimension to C𝐶Citalic_C and then flattens the spatial dimension, resulting in a vector known as the multi-task condition feature 𝐅condsubscript𝐅𝑐𝑜𝑛𝑑{\mathbf{F}}_{cond}bold_F start_POSTSUBSCRIPT italic_c italic_o italic_n italic_d end_POSTSUBSCRIPT. We refer to this computational process as fcondsubscript𝑓𝑐𝑜𝑛𝑑f_{cond}italic_f start_POSTSUBSCRIPT italic_c italic_o italic_n italic_d end_POSTSUBSCRIPT.

Multi-Task Conditioned Denoiser We illustrate the structure of denoiser in Fig. 4. The denoiser is a series of cross-attention transformer blocks, taking in the noisy map 𝐗s𝒯superscriptsubscript𝐗𝑠𝒯{\mathbf{X}}_{s}^{\mathcal{T}}bold_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT, the denoising step s𝑠sitalic_s, and the multi-task condition feature 𝐅condsubscript𝐅𝑐𝑜𝑛𝑑{\mathbf{F}}_{cond}bold_F start_POSTSUBSCRIPT italic_c italic_o italic_n italic_d end_POSTSUBSCRIPT as input, and generates the denoised map 𝐗s1𝒯superscriptsubscript𝐗𝑠1𝒯{\mathbf{X}}_{s-1}^{\mathcal{T}}bold_X start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT. 𝐗s1𝒯superscriptsubscript𝐗𝑠1𝒯{\mathbf{X}}_{s-1}^{\mathcal{T}}bold_X start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT is used as input for the next step in an iterative manner. Specifically, we start by projecting the noisy map of the target task to a C𝐶Citalic_C-channel task embedding via a 3×3333\times 33 × 3 convolution and flatten its spatial dimension. Then we embed the denoising step s𝑠sitalic_s using a typical sinusoidal embedding module [8]. We call this process “Step Embedding”. We add the step embedding to the task embedding. The resulting task embedding 𝐄s𝒯subscriptsuperscript𝐄𝒯𝑠{\mathbf{E}}^{\mathcal{T}}_{s}bold_E start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT assumes the role of key and value tensors (𝐊,𝐕𝐊𝐕{\mathbf{K}},{\mathbf{V}}bold_K , bold_V) in the subsequent transformer blocks, and 𝐅condsubscript𝐅𝑐𝑜𝑛𝑑{\mathbf{F}}_{cond}bold_F start_POSTSUBSCRIPT italic_c italic_o italic_n italic_d end_POSTSUBSCRIPT is supplied to the transformer blocks as the query 𝐐𝐐{\mathbf{Q}}bold_Q:

𝐐𝐅cond,𝐊𝐄s𝒯,𝐕𝐄s𝒯.formulae-sequence𝐐subscript𝐅𝑐𝑜𝑛𝑑formulae-sequence𝐊subscriptsuperscript𝐄𝒯𝑠𝐕subscriptsuperscript𝐄𝒯𝑠{\mathbf{Q}}\leftarrow{\mathbf{F}}_{cond},~{}~{}{\mathbf{K}}\leftarrow{\mathbf% {E}}^{\mathcal{T}}_{s},~{}~{}{\mathbf{V}}\leftarrow{\mathbf{E}}^{\mathcal{T}}_% {s}.bold_Q ← bold_F start_POSTSUBSCRIPT italic_c italic_o italic_n italic_d end_POSTSUBSCRIPT , bold_K ← bold_E start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_V ← bold_E start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT . (4)

The transformer blocks receive 𝐐𝐐{\mathbf{Q}}bold_Q, 𝐊𝐊{\mathbf{K}}bold_K, and 𝐕𝐕{\mathbf{V}}bold_V as input. Each block comprises linear normalization, cross-attention, and feed-forward networks as shown in Fig. 4. For a more comprehensive understanding of the details of transformer, please consult [49]. Here, the cross-attention transformer blocks absorb information from the task embedding 𝐄s𝒯subscriptsuperscript𝐄𝒯𝑠{\mathbf{E}}^{\mathcal{T}}_{s}bold_E start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT guided by the multi-task conditioning feature 𝐅condsubscript𝐅𝑐𝑜𝑛𝑑{\mathbf{F}}_{cond}bold_F start_POSTSUBSCRIPT italic_c italic_o italic_n italic_d end_POSTSUBSCRIPT.

The output procedure of a denoising step is different in Prediction Diffusion and Feature Diffusion. In Prediction Diffusion, the output of the transformer blocks is reshaped to a spatial map and projected to prediction map 𝐏s1𝒯superscriptsubscript𝐏𝑠1𝒯{\mathbf{P}}_{s-1}^{\mathcal{T}}bold_P start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT using a task-specific head which consists of several convolutional layers with ReLU. In Feature Diffusion, the output of transformer blocks is projected to a feature map 𝐅s1𝒯superscriptsubscript𝐅𝑠1𝒯{\mathbf{F}}_{s-1}^{\mathcal{T}}bold_F start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT, which serves as the output of this step (i.e. 𝐗s1𝒯superscriptsubscript𝐗𝑠1𝒯{\mathbf{X}}_{s-1}^{\mathcal{T}}bold_X start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT). More implementation details can be found in Sec. 4.1. We can formulate each step in the denoising process as:

𝐗s1𝒯subscriptsuperscript𝐗𝒯𝑠1\displaystyle{\mathbf{X}}^{\mathcal{T}}_{s-1}bold_X start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT =Denoiser(𝐗s𝒯,s,𝐅cond),absentDenoisersubscriptsuperscript𝐗𝒯𝑠𝑠subscript𝐅𝑐𝑜𝑛𝑑\displaystyle=\mathrm{Denoiser}({\mathbf{X}}^{\mathcal{T}}_{s},s,{\mathbf{F}}_% {cond}),= roman_Denoiser ( bold_X start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s , bold_F start_POSTSUBSCRIPT italic_c italic_o italic_n italic_d end_POSTSUBSCRIPT ) , (5)
s𝑠\displaystyle sitalic_s S,S1,,1.absent𝑆𝑆11\displaystyle\in S,S-1,...,1.∈ italic_S , italic_S - 1 , … , 1 .

For the final output after S𝑆Sitalic_S denoising steps, Prediction Diffusion directly generates the denoised prediction map of target task 𝐏0𝒯superscriptsubscript𝐏0𝒯{\mathbf{P}}_{0}^{\mathcal{T}}bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT, which is used to compute task-specific loss supervised by the available ground-truth label. In Feature Diffusion, we need a final task-specific head fhead𝒯superscriptsubscript𝑓𝑒𝑎𝑑𝒯f_{head}^{\mathcal{T}}italic_f start_POSTSUBSCRIPT italic_h italic_e italic_a italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT to project the denoised feature maps 𝐅0𝒯superscriptsubscript𝐅0𝒯{\mathbf{F}}_{0}^{\mathcal{T}}bold_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT to the final prediction map 𝐏0𝒯superscriptsubscript𝐏0𝒯{\mathbf{P}}_{0}^{\mathcal{T}}bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT. We present the detailed training and inference pipelines of DiffusionMTL in Pseudocode 1.

3.3 Model Optimization

The whole DiffusionMTL model can be trained under the MTPSL setting in an end-to-end manner. Specifically, for each training sample, we apply task-specific losses on both the initial prediction maps as well as the final denoised prediction maps of the tasks with labels. For the unlabeled tasks, there are no ground-truth supervision signals, but the task-specific decoders are able to be implicitly trained via the proposed Multi-Task Conditioning. More details about losses are introduced in the supplemental materials.

4 Experiments

4.1 Experimental Setup

Datasets and Tasks Following the pioneering work in multi-task partially supervised learning [6], we adopt three prevalent multi-task datasets with dense annotations, i.e. PASCAL [50], NYUD [51], and Cityscapes [52]. PASCAL is a comprehensive dataset providing images of both indoor and outdoor scenes. There are 4,998 training images and 5,105 testing images, with labels of semantic segmentation, human parsing, and object boundary detection. Additionally, [1] generates pseudo labels for surface normals estimation and saliency detection. NYUD (or NYUD-v2) provides images of indoor scenes as well as dense annotations for 13-class semantic segmentation and depth estimation. The images are resized to 288×\times×384. The surface normals can be generated from depth. The training set contains 795 images, while the testing set contains 654 images. Cityscapes captures street scenes of different cities with fine pixel-level annotations. Following [6, 10], we use 7-class semantic segmentation and monocular depth estimation tasks in the experiments. The images are resized to 128×\times×256. There are 2,975 training images and 500 validation images.

Task Metrics We adopt the same metrics for different tasks as previous work [6]. We use the mean Intersection over Union (mIoU) to evaluate the performance of the semantic segmentation (Semseg) and human parsing (Parsing) tasks, while the absolute error (absErr) is used for evaluating the monocular depth estimation task (Depth). For the surface normal estimation task (Normal), we use the mean error of angles (mErr) as the evaluation metric, while for the saliency detection task (Saliency), we use the maximal F-measure (maxF). The object boundary detection task (Boundary) is evaluated using the optimal-dataset-scale F-measure (odsF). To quantify the overall multi-task performance relative to the single-task baseline, we calculate the mean relative difference across all tasks, denoted as Multi-task Performance (MTL Perf ΔmsubscriptΔ𝑚\Delta_{m}roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT[1].

MTPSL Evaluation Settings There are two evaluation settings for multi-task partially supervised learning [6]: (i) one-label setting, where each training image has the ground-truth label of only one task. (ii) random-label setting, where the number of labeled tasks for each image is random. We use exactly the same image-task label mappings as [6] for a strictly fair comparison.

Refer to caption
Figure 5: Visualization of the prediction maps at different processes on Cityscapes. Our DiffusionMTL effectively denoises the noisy prediction maps of both tasks.
# labels Method Semseg Depth MTL Perf
mIoU \mathbf{\uparrow} absErr \mathbf{\downarrow} ΔmsubscriptΔ𝑚\Delta_{m}roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT \mathbf{\uparrow}
one / random Single-Task 75.82 0.0125 -
MTL Baseline 73.19 0.0168 -18.81%
SS [6] 71.67 0.0178 -
XTC [6] 74.90 0.0161 -
XTC* [6] 73.36 0.0158 -14.74%
DiffusionMTL (Prediction) 74.90 0.0131 -2.79%
DiffusionMTL (Feature) 75.67 0.0130 -2.13%
Table 1: Comparison with SOTAs on Cityscapes. The proposed DiffusionMTL demonstrates superior performance on both tasks. One-label setting is equivalent to the random-label setting on Cityscapes. “*” denotes re-implemented results.
[Uncaptioned image] Figure 6: Study of training DiffusionMTL with varying numbers of diffusion steps on Cityscapes. Adding diffusion steps increases the performance and FLOPs.
Backbone Method #Params FLOPS Semseg Parsing Saliency Normal Boundary MTL Perf mIoU \mathbf{\uparrow} mIoU \mathbf{\uparrow} maxF \mathbf{\uparrow} mErr \mathbf{\downarrow} odsF \mathbf{\uparrow} ΔmsubscriptΔ𝑚\Delta_{m}roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT \mathbf{\uparrow} R18 DiffusionMTL (Feature) 133M 676G 57.78 58.98 77.82 16.11 64.50 +3.65% DiffusionMTL (Prediction) 133M 628G 59.43 56.79 77.57 16.20 64.00 +3.23% —w/o Diffusion 133M 628G 57.77 56.39 77.44 17.34 60.60 +0.11% —w/o Multi-Task Cond 125M 558G 55.95 55.90 77.22 17.87 61.50 -1.33% R50 DiffusionMTL (Feature) 159M 742G 58.78 61.91 77.07 16.49 66.20 0.77% DiffusionMTL (Prediction) 159M 694G 60.92 59.94 77.58 17.31 63.80 -0.66% —w/o Diffusion 159M 694G 58.10 58.69 76.64 17.50 62.80 -2.86% —w/o Multi-Task Cond 150M 625G 57.29 59.37 76.90 17.74 63.70 -2.91% Table 2: Ablation study on PASCAL. “w/o Diffusion” indicates replacing the diffusion model with an iterative refinement model using an identical network structure. “w/o Multi-Task Cond” means removing Multi-Task Conditioning.
# labels Method PASCAL NYUD
#Params FLOPS Semseg Parsing Saliency Normal Boundary MTL Perf Semseg Depth Normal MTL Perf
mIoU \mathbf{\uparrow} mIoU \mathbf{\uparrow} maxF \mathbf{\uparrow} mErr \mathbf{\downarrow} odsF \mathbf{\uparrow} ΔmsubscriptΔ𝑚\Delta_{m}roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT \mathbf{\uparrow} mIoU \mathbf{\uparrow} absErr \mathbf{\downarrow} mErr \mathbf{\downarrow} ΔmsubscriptΔ𝑚\Delta_{m}roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT \mathbf{\uparrow}
one Single-Task Baseline 219M 817G 50.34 59.05 77.43 16.59 64.40 - 45.28 0.4802 25.93 -
MTL Baseline 157M 608G 49.71 56.00 74.50 16.85 62.80 -2.85% 43.92 0.5138 26.44 -3.99%
SS [6] - - 45.00 54.00 61.70 16.90 62.40 - 27.52 0.6499 33.58 -
XTC [6] - - 49.50 55.80 61.70 17.00 65.10 - 30.36 0.6088 32.08 -
XTC* [6] 173M 608G 55.08 56.72 77.06 16.93 63.70 +0.37% 43.97 0.5140 26.30 -3.79%
DiffusionMTL (Prediction) 133M 628G 59.43 56.79 77.57 16.20 64.00 +3.23% 44.97 0.5137 26.17 -2.86%
DiffusionMTL (Feature) 133M 676G 57.78 58.98 77.82 16.11 64.50 +3.65% 44.47 0.5059 25.84 -2.27%
random Single-Task 219M 817G 51.51 57.90 80.30 15.24 67.80 - 48.25 0.4792 24.65 -
MTL Baseline 157M 608G 62.23 55.88 78.67 15.47 66.70 +2.44% 45.93 0.4839 25.53 -3.12%
SS [6] - - 59.00 55.80 64.00 15.90 66.90 - 29.50 0.6224 33.31 -
XTC [6] - - 59.00 55.60 64.00 15.90 67.80 - 34.26 0.5787 31.06 -
XTC* [6] 173M 608G 62.44 55.81 78.56 15.45 66.80 +2.52% 46.03 0.4811 25.97 -3.44%
DiffusionMTL (Prediction) 133M 628G 63.68 55.84 79.87 15.38 66.80 +3.44% 47.44 0.4803 25.26 -1.45%
DiffusionMTL (Feature) 133M 676G 62.55 56.84 80.44 14.85 67.10 +4.27% 46.82 0.4743 24.75 -0.77%
Table 3: Quantitative multi-task performance comparison with the state-of-the-arts (SOTAs) on PASCAL and NYUD. All models use a ResNet-18 as the backbone. ‘one’ means each training image has only one labeled task, while ‘random’ means each training image has a random number of labeled tasks. The proposed DiffusionMTL, including both Prediction Diffusion and Feature Diffusion, achieves significantly better performance while using fewer model parameters. “*” denotes our re-implemented results.

Implementation Details For most models in the experiments, we use ResNet-18 as backbone with ImageNet pre-trained weights provided by PyTorch. We concatenate the feature maps of the four stages of ResNet-18 along the channel dimension and process them with a 3×3333\times 33 × 3 convolution to reduce the number of channels to 512. For our DiffusionMTL, the initial multi-task backbone has a task-specific decoder for each task, using 2 residual convolutional blocks [53], followed by a 1×1111\times 11 × 1 convolution as prediction head. Each residual convolutional block contains two 3×3333\times 33 × 3 convolutions with BN and ReLU. The denoising network uses 4 task-shared cross-attention transformer blocks, which are followed by a task-specific head of four 3×3333\times 33 × 3 CONV-ReLU layers and a 1×1111\times 11 × 1 convolution for task prediction in Prediction Diffusion. We use 2 steps in the diffusion process. More implementation details can be found in the supplemental materials.

4.2 Main Experiments

Declaration of Comparison Models We consider several models for comparison to verify the effectiveness of our proposed DiffusionMTL framework: (i) “MTL Baseline” is the baseline model. It shares the same backbone as DiffusionMTL and utilizes a strong task-specific decoder for each task. It consists of 6 residual convolutional blocks, followed by a 1×1111\times 11 × 1 convolution as prediction head. (ii) “SS” and “XTC” [6] are pioneering state-of-the-art methods as introduced in related work. XTC is re-implemented on our MTL Baseline based on their official codes for fair comparison. (iii) “Single-Task” is a single-task learning version of the MTL Baseline. It contains a set of separate models where each model is trained to learn only one single task.

Refer to caption
Figure 7: Qualitative comparison between the state-of-the-art XTC [6] and our method on PASCAL under one-label setting. XTC suffers from the issue of noisy predictions. In contrast, our DiffusionMTL model learns to denoise the noisy prediction maps, resulting in significantly better multi-task prediction maps.
Refer to caption
Figure 8: Performance of the initial multi-task predictions (blue) and the denoised predictions (yellow) of DiffusionMTL (Prediction) on PASCAL, NYUD, and Cityscapes under one-label setting. Our denoising network improves the prediction quality of all 10 tasks.
Refer to caption
Figure 9: Visualization of the prediction maps at different processes on PASCAL. Our DiffusionMTL can denoise and rectify the noisy multi-task prediction maps.

Quantitative Comparison with SOTAs We compare the proposed DiffusionMTL with several strong competitors introduced in Sec. 4.2 on three widely-used benchmarks, i.e. Cityscapes, PASCAL, and NYUD. We show the results of the three benchmarks in Table 1 and Table 3, respectively. As can be observed from the tables, our DiffusionMTL demonstrates significant improvements over the competing MTL Baseline and XTC [6] on all three benchmarks. Specifically, under the challenging one-label setting on PASCAL, our DiffusionMTL (prediction) outperforms the MTL Baseline by +9.72 on Semseg and +3.07 on Saliency, while the multi-task performance ΔmsubscriptΔ𝑚\Delta_{m}roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is also improved by +6.08%. Compared with the state-of-the-art method XTC, our proposal achieves an improvement of +2.86% in terms of the multi-task performance ΔmsubscriptΔ𝑚\Delta_{m}roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. We can observe consistent performance gains under the random-label setting. Similarly, under the one-label setting of NYUD, our Feature Diffusion improves ΔmsubscriptΔ𝑚\Delta_{m}roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT by +1.52% compared with XTC. On Cityscapes, where the one-label setting is equivalent to the random-label setting, the multi-task performance ΔmsubscriptΔ𝑚\Delta_{m}roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is improved by +11.95% compared with the previous best XTC. For computational efficiency, as shown in Table 3, our model consumes only 133M network parameters, which is clearly less than 173M used by XTC [6]. These significant results can fully show the effectiveness of the proposed DiffusionMTL method, which substantially outperforms the competing methods on all benchmarks while using fewer model parameters.

Qualitative Comparison with SOTAs To examine the quality of generated multi-task predictions by DiffusionMTL, we visualize the predictions by Prediction Diffusion trained under the one-label setting of PASCAL, and compare them with the outputs of SOTA model [6] as well as ground-truth labels in Fig. 7. The images are randomly chosen from the testing set of PASCAL. We observe that the generated multi-task predictions of the previous best method are noisy, which confirms our motivation to design a multi-task denoising framework for the multi-task partially supervised learning problem. With the proposed DiffusionMTL, the prediction quality is significantly improved.

4.3 Ablation Study

We conduct comprehensive ablation experiments to evaluate the effectiveness of different components of DiffusionMTL for multi-task partially supervised learning and show the results on PASCAL one-label setting in Table 2.

Effectiveness of Multi-Task Denoising Diffusion Network To further confirm the effectiveness of the proposed multi-task denoising diffusion network (MTDNet), we replace it with an iterative refinement network using an identical network structure (i.e. cross-attention transformer blocks in Prediction Diffusion). This variant is denoted as “w/o Diffusion” in Table 2. MTDNet brings a significant multi-task performance improvement of +3.12 (ResNet-18) and +2.20 (ResNet-50) using the same computational costs, which clearly validates the effectiveness of the proposed multi-task denoising method. Moreover, we plot the performance metrics of initial predictions and final predictions in Fig. 8, which shows improvement brought by MTDNet on all 10 tasks of three benchmarks.

Effectiveness of Multi-Task Conditioning To evaluate the efficacy of the multi-task conditioning strategy, we conduct ablation experiments by removing it from DiffusionMTL (Prediction) and replacing cross-attention blocks with self-attention blocks. This variant is indicated as “w/o Multi-Task Cond”. Multi-task conditioning leads to significant improvement on all tasks, underscoring the unique importance of multi-task information sharing in the partially-labeled multi-task learning problem.

Comparison of Feature Diffusion and Prediction Diffusion As observed in Table 2, both multi-task diffusion mechanisms show significant multi-task performance on different backbones. Prediction Diffusion is more computationally efficient due to the lower dimensions of the intermediate maps, whereas Feature Diffusion achieves higher performance on most tasks by capturing more visual information in the features.

Qualitative Analysis of Denoising Effect In the denoising process, the model is designed to denoise the noisy multi-task prediction maps that are degraded by the diffusion process. To evaluate the denoising performance of our Prediction Diffusion, we provide visualizations of the multi-task prediction maps at different phases in Fig. 5 and Fig. 9. Our model generates clean and accurate multi-task prediction maps from noisy inputs, which firmly indicates the effectiveness of our multi-task denoising framework.

Influence of Diffusion Steps We plot the performance metrics against diffusion steps on Cityscapes in Fig. 6. Our observations reveal that utilizing two steps yields a notable improvement in performance compared to using just one step. Moreover, increasing the number of steps further enhances performance, albeit at a higher computational cost. Therefore, we set the default number of steps to two.

5 Conclusion

Our study aims to address the issue of noisy predictions in multi-task learning from partially annotated data. We propose a unified multi-task denoising diffusion framework that refines multi-task signals in the feature and prediction spaces separately. Additionally, we introduce an effective Multi-Task Conditioning strategy to enhance denoising performance and facilitate learning of unlabelled tasks through cross-task information sharing. Extensive experiments on three prevalent datasets validate our approach, which outperforms previous methods by a significant margin.

Acknowledgement This research is partially supported by the Early Career Scheme of the Research Grants Council (RGC) of the Hong Kong SAR under grant No. 26202321.

References

  • [1] Kevis-Kokitsi Maninis, Ilija Radosavovic, and Iasonas Kokkinos. Attentive single-tasking of multiple tasks. In CVPR, 2019.
  • [2] Amir R Zamir, Alexander Sax, Nikhil Cheerla, Rohan Suri, Zhangjie Cao, Jitendra Malik, and Leonidas J Guibas. Robust learning through cross-task consistency. In CVPR, 2020.
  • [3] Iasonas Kokkinos. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In CVPR, 2017.
  • [4] Dan Xu, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In CVPR, 2018.
  • [5] Simon Vandenhende, Stamatios Georgoulis, Wouter Van Gansbeke, Marc Proesmans, Dengxin Dai, and Luc Van Gool. Multi-task learning for dense prediction tasks: A survey. TPAMI, 44(7):3614–3633, 2021.
  • [6] Wei-Hong Li, Xialei Liu, and Hakan Bilen. Learning multiple dense prediction tasks from partially annotated data. In CVPR, 2022.
  • [7] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  • [8] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
  • [9] Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. Cross-stitch networks for multi-task learning. In CVPR, 2016.
  • [10] Shikun Liu, Edward Johns, and Andrew J Davison. End-to-end multi-task learning with attention. In CVPR, 2019.
  • [11] Yu Zhang and Qiang Yang. A survey on multi-task learning. TKDE, 2021.
  • [12] Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. In CVPR, 2018.
  • [13] Menelaos Kanakis, Thomas E Huang, David Brüggemann, Fisher Yu, and Luc Van Gool. Composite learning for robust and effective dense predictions. In WACV, 2023.
  • [14] Hanxue Liang, Zhiwen Fan, Rishov Sarkar, Ziyu Jiang, Tianlong Chen, Kai Zou, Yu Cheng, Cong Hao, and Zhangyang Wang. M3 vit: Mixture-of-experts vision transformer for efficient multi-task learning with model-accelerator co-design. In NeurIPS, 2022.
  • [15] Zitian Chen, Yikang Shen, Mingyu Ding, Zhenfang Chen, Hengshuang Zhao, Erik Learned-Miller, and Chuang Gan. Mod-squad: Designing mixture of experts as modular multi-task learners. arXiv preprint arXiv:2212.08066, 2022.
  • [16] Lukas Hoyer, Dengxin Dai, Yuhua Chen, Adrian Koring, Suman Saha, and Luc Van Gool. Three ways to improve semantic segmentation with self-supervised depth estimation. In CVPR, 2021.
  • [17] Simon Vandenhende, Stamatios Georgoulis, and Luc Van Gool. Mti-net: Multi-scale task interaction networks for multi-task learning. In ECCV, 2020.
  • [18] David Bruggemann, Menelaos Kanakis, Anton Obukhov, Stamatios Georgoulis, and Luc Van Gool. Exploring relational context for multi-task dense prediction. In ICCV, 2021.
  • [19] Siwei Yang, Hanrong Ye, and Dan Xu. Contrastive multi-task dense prediction. In AAAI, 2023.
  • [20] Shikun Liu, Stephen James, Andrew J Davison, and Edward Johns. Auto-lambda: Disentangling dynamic task relationships. TMLR, 2022.
  • [21] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR, 2018.
  • [22] Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. In NeurIPS, 2021.
  • [23] Zhao Chen, Jiquan Ngiam, Yanping Huang, Thang Luong, Henrik Kretzschmar, Yuning Chai, and Dragomir Anguelov. Just pick a sign: Optimizing deep multitask models with gradient sign dropout. In NeurIPS, 2020.
  • [24] Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In ICML, 2018.
  • [25] Zirui Wang, Yulia Tsvetkov, Orhan Firat, and Yuan Cao. Gradient vaccine: Investigating and improving multi-task optimization in massively multilingual models. In ICLR, 2020.
  • [26] Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. In NeurIPS, 2020.
  • [27] Hanrong Ye and Dan Xu. Taskprompter: Spatial-channel multi-task prompting for dense scene understanding. In ICLR, 2023.
  • [28] Roman Bachmann, David Mizrahi, Andrei Atanov, and Amir Zamir. Multimae: Multi-modal multi-task masked autoencoders. In ECCV, 2022.
  • [29] Yangyang Xu, Xiangtai Li, Haobo Yuan, Yibo Yang, and Lefei Zhang. Multi-task learning with multi-query transformer for dense prediction. TCSVT, 2023.
  • [30] Hanrong Ye and Dan Xu. Taskexpert: Dynamically assembling multi-task representations with memorial mixture-of-experts. In ICCV, 2023.
  • [31] Xiaogang Xu, Hengshuang Zhao, Vibhav Vineet, Ser-Nam Lim, and Antonio Torralba. Mtformer: Multi-task learning via transformer and cross-task reasoning. In ECCV, 2022.
  • [32] Hanrong Ye and Dan Xu. Inverted pyramid multi-task transformer for dense scene understanding. In ECCV, 2022.
  • [33] Hanrong Ye and Dan Xu. Invpt++: Inverted pyramid multi-task transformer for visual scene understanding. arXiv preprint arXiv:2306.04842, 2023.
  • [34] Wei-Hong Li, Xialei Liu, and Hakan Bilen. Universal representations: A unified look at multiple task and domain learning. arXiv preprint arXiv:2204.02744, 2022.
  • [35] Yuan Gao, Jiayi Ma, Mingbo Zhao, Wei Liu, and Alan L Yuille. Nddr-cnn: Layerwise feature fusing in multi-task cnns by neural discriminative dimensionality reduction. In CVPR, 2019.
  • [36] Yuan Gao, Haoping Bai, Zequn Jie, Jiayi Ma, Kui Jia, and Wei Liu. Mtl-nas: Task-agnostic neural architecture search towards general-purpose multi-task learning. In CVPR, 2020.
  • [37] Lijun Zhang, Xiao Liu, and Hui Guan. Automtl: A programming framework for automating efficient multi-task learning. In NeurIPS, 2021.
  • [38] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
  • [39] Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. NeurIPS, 2021.
  • [40] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  • [41] William Peebles and Saining Xie. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022.
  • [42] Xizewen Han, Huangjie Zheng, and Mingyuan Zhou. Card: Classification and regression diffusion models. arXiv preprint arXiv:2206.07275, 2022.
  • [43] Dmitry Baranchuk, Ivan Rubachev, Andrey Voynov, Valentin Khrulkov, and Artem Babenko. Label-efficient semantic segmentation with diffusion models. In ICLR, 2022.
  • [44] Tomer Amit, Eliya Nachmani, Tal Shaharbany, and Lior Wolf. Segdiff: Image segmentation with diffusion probabilistic models. arXiv preprint arXiv:2112.00390, 2021.
  • [45] Ting Chen, Lala Li, Saurabh Saxena, Geoffrey Hinton, and David J Fleet. A generalist framework for panoptic segmentation of images and videos. arXiv preprint arXiv:2210.06366, 2022.
  • [46] Zhangxuan Gu, Haoxing Chen, Zhuoer Xu, Jun Lan, Changhua Meng, and Weiqiang Wang. Diffusioninst: Diffusion model for instance segmentation. arXiv preprint arXiv:2212.02773, 2022.
  • [47] Hanrong Ye, Jason Kuen, Qing Liu, Zhe Lin, Brian Price, and Dan Xu. Seggen: Supercharging segmentation models with text2mask and mask2img synthesis. arXiv preprint arXiv:2311.03355, 2023.
  • [48] Shoufa Chen, Peize Sun, Yibing Song, and Ping Luo. Diffusiondet: Diffusion model for object detection. arXiv preprint arXiv:2211.09788, 2022.
  • [49] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
  • [50] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. IJCV, 111:98–136, 2010.
  • [51] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  • [52] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
  • [53] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.

Appendix A Supplemental Implementation Details

A.1 Additional Details about DiffusionMTL

Multi-Task Denoising Diffusion Network. This section provides additional details about the implementation of DiffusionMTL on different datasets. For our experiments, we set the default diffusion steps to 2 using a linear variance scheduler with a range from 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT to 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT. All self-attention blocks in the Denoiser use a single head.

Loss functions. For semantic segmentation, human parsing, saliency detection, and boundary detection, we use cross-entropy loss. For depth and surface normal estimation, we opt for L1 loss. The multi-task loss balance weights are the same as those used in [6].

A.2 Implementation Details on Different Datasets

For all three partial-labeling benchmarks (PASCAL, NYUD, and Cityscapes), we use exactly the same image-task label mappings as those used in [6].

PASCAL On PASCAL-Context [50] (abbreviated as “PASCAL”), in the one-label setting, there are 1000, 999, 1000, 1000, 999 images separately labeled for semantic segmentation, human parsing, surface normal estimation, saliency detection, and boundary detection. In the random-label setting, there are 450, 2553, 2480, 2445, and 2557 images labeled for semantic segmentation, human parsing, surface normal estimation, saliency detection, and boundary detection, respectively, We pad the images to a resolution of 512×512512512512\times 512512 × 512. We use the Adam optimizer and a polynomial learning rate scheduler with a base learning rate of 2×1052superscript1052\times 10^{-5}2 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. All models are trained for 100 epochs with a batch size of 6. We adopt the same data augmentations as in [17], which include random scaling, cropping, random horizontal flipping, and color jittering.

NYUD [51] In the one-label setting, 265 images are labeled for semantic segmentation, 265 images are labeled for monocular depth estimation, and 265 images are labeled for surface normal estimation. In the random-label setting, 392, 408, and 385 images are respectively labeled for these tasks. The images are resized to a resolution of 288×384288384288\times 384288 × 384. We use the Adam optimizer and a polynomial learning rate scheduler with a base learning rate of 2×1052superscript1052\times 10^{-5}2 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. All models are trained for 200 epochs with a batch size of 4. We adopt the same data augmentations as in [6], which include random cropping and random horizontal flipping.

Cityscapes [52] As we only evaluate two tasks on the Cityscapes dataset, the one-label setting is equivalent to the random-label setting. The training split contains 1,487 labeled images for semantic segmentation and 1,488 labeled images for monocular depth estimation. We adopt a learning rate of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. All models are trained for 200 epochs with a batch size of 8. The images are resized to a resolution of 128×256128256128\times 256128 × 256. We adopt the data augmentations in [6], which include random cropping and random horizontal flipping.

Appendix B Additional Quantitative Study

B.1 Comparison with SOTA refinement methods.

We conduct extensive experiments to compare our proposal with previous SOTA MTL refinement methods, including MTI-Net [17] and InvPT [32], based on the ResNet-18 baseline under the one-label setting on PASCAL dataset. The results, presented in Table 4, demonstrate the superior performance of DiffusionMTL across all tasks.

Method #Params FLOPS Train Semseg Parsing Saliency Normal Boundary MTL Perf
GPU Mem mIoU \mathbf{\uparrow} mIoU \mathbf{\uparrow} maxF \mathbf{\uparrow} mErr \mathbf{\downarrow} odsF \mathbf{\uparrow} ΔmsubscriptΔ𝑚\Delta_{m}roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT \mathbf{\uparrow}
MTL Baseline 157M 608G 6163M 49.71 56.00 74.50 16.85 62.80 -2.85%
XTC [6] 173M 608G 6409M 55.08 56.72 77.06 16.93 63.70 +0.37%
MTINet [17] 281M 589G 11533M 54.32 57.73 77.12 16.41 64.20 +1.21%
InvPT [32] 141M 1182G 9993M 56.96 57.05 77.19 16.80 63.20 +1.27%
DiffusionMTL (Prediction) 133M 628G 5703M 59.43 56.79 77.57 16.20 64.00 +3.23%
DiffusionMTL (Feature) 133M 676G 5811M 57.78 58.98 77.82 16.11 64.50 +3.65%
Table 4: One-label setting on PASCAL with ResNet-18 backbone.

B.2 Comparison under Fully-Annotated Setting

Our method can be applied to fully-annotated benchmarks. We conduct experiments on fully-annotated PASCAL dataset using ResNet-18 and show the results in Table 5. Our method demonstrates stronger performance compared to both the baseline as well as the state-of-the-art (SOTA) method XTC [6] and InvPT [32].

Method #Params FLOPS Semseg Parsing Saliency Normal Boundary MTL Perf
mIoU \mathbf{\uparrow} mIoU \mathbf{\uparrow} maxF \mathbf{\uparrow} mErr \mathbf{\downarrow} odsF \mathbf{\uparrow} ΔmsubscriptΔ𝑚\Delta_{m}roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT \mathbf{\uparrow}
STL Baseline 219M 817G 52.56 62.21 82.75 14.12 68.90 -
MTL Baseline 157M 608G 62.91 57.37 81.82 14.49 66.40 +0.90%
XTC [6] 173M 608G 63.29 57.93 82.09 14.48 66.50 +1.34%
InvPT [32] 141M 1182G 64.38 59.49 83.52 14.75 66.80 +2.31%
DiffusionMTL (Prediction) 133M 628G 64.31 58.68 83.07 14.44 67.10 +2.44%
DiffusionMTL (Feature) 133M 676G 64.62 60.14 83.99 14.17 67.80 +3.84%
Table 5: Fully-annotated setting on PASCAL with ResNet-18.

B.3 Computation and Memory Cost Comparison.

We have already shown the parameters and FLOPs comparison with the MTL baseline and XTC in Table 3 of our main paper. We further provide the training GPU memory in Table 4 of this document. Our method shows higher parameter/memory efficiency and comparable computational costs with significantly better performance.

Appendix C Additional Qualitative Study

C.1 Denoising Effectiveness of DiffusionMTL

To assess the denoising performance of our model, we visually examine the noisy multi-task prediction maps generated through the diffusion process, as well as the denoised outputs produced by Prediction Diffusion based on ResNet-18 on Cityscapes dataset under a one-label training setting. The obtained results are showcased in Fig. 10 and Fig. 11. The effectiveness of our proposed DiffusionMTL is demonstrated by its ability to successfully denoise the noisy prediction maps, resulting in significantly improved multi-task predictions that align better with the ground-truth labels. These results serve as additional validation for our motivation behind designing a robust multi-task denoising diffusion framework, addressing the challenges inherent in the multi-task partially supervised learning problem.

C.2 Comparison with SOTA

In order to further demonstrate the performance advantage of DiffusionMTL, we present a set of randomly selected samples generated by our model and the previous state-of-the-art model (i.e., XTC [6]) on Cityscapes in Fig. 12 and Fig. 13. We further compare the results on PASCAL in Fig. 14 and Fig. 15. These models are trained under the same one-label multi-task partially supervised learning setting. The superiority of prediction maps generated by DiffusionMTL in terms of accuracy is evident on both datasets. This compelling evidence serves to further validate the effectiveness of our proposed denoising diffusion model.

Refer to caption
Figure 10: Qualitative comparison of the initial multi-task predictions, decayed predictions, our denoised results, and ground-truth labels on Cityscapes under one-label setting. Our DiffusionMTL is able to rectify noisy input and generate clean prediction maps. The model used in this comparison is trained on the Cityscapes dataset under the one-label MTPSL setting.
Refer to caption
Figure 11: Qualitative comparison of the initial multi-task predictions, decayed predictions, our denoised results, and ground-truth labels on Cityscapes under one-label setting. Our DiffusionMTL is able to rectify noisy input and generate clean prediction maps. The model used in this comparison is trained on the Cityscapes dataset under the one-label MTPSL setting.
Refer to caption
Figure 12: Qualitative comparison between our method and the state-of-the-art method (i.e. XTC [6]) for depth estimation and semantic segmentation tasks in Cityscapes dataset, using the same ResNet-18 backbone. Our DiffusionMTL approach outperforms the previous state-of-the-art method in producing superior prediction maps. Notably, each training sample is labeled for only one task.
Refer to caption
Figure 13: Qualitative comparison between our method and the state-of-the-art method (i.e. XTC [6]) for depth estimation and semantic segmentation tasks on the Cityscapes dataset, using the same ResNet-18 backbone. Our DiffusionMTL approach outperforms the previous state-of-the-art method in producing superior prediction maps. Notably, each training sample is labeled for only one task.
Refer to caption
Figure 14: Qualitative comparison between our method and the state-of-the-art method (i.e. XTC [6]) in PASCAL dataset, using the same ResNet-18 backbone. Our DiffusionMTL approach outperforms the previous state-of-the-art method in producing superior prediction maps. Notably, each training sample is labeled for only one task.
Refer to caption
Figure 15: Qualitative comparison between our method and the state-of-the-art method (i.e. XTC [6]) in PASCAL dataset, using the same ResNet-18 backbone. Our DiffusionMTL approach outperforms the previous state-of-the-art method in producing superior prediction maps. Notably, each training sample is labeled for only one task.