Version 1
: Received: 8 April 2024 / Approved: 9 April 2024 / Online: 9 April 2024 (11:43:57 CEST)
How to cite:
Pan, L.; Yang, Y.; Wang, Z.; Zhang, R. Text-to-Image Segmentation with Open-Vocabulary and Multitasking. Preprints2024, 2024040631. https://doi.org/10.20944/preprints202404.0631.v1
Pan, L.; Yang, Y.; Wang, Z.; Zhang, R. Text-to-Image Segmentation with Open-Vocabulary and Multitasking. Preprints 2024, 2024040631. https://doi.org/10.20944/preprints202404.0631.v1
Pan, L.; Yang, Y.; Wang, Z.; Zhang, R. Text-to-Image Segmentation with Open-Vocabulary and Multitasking. Preprints2024, 2024040631. https://doi.org/10.20944/preprints202404.0631.v1
APA Style
Pan, L., Yang, Y., Wang, Z., & Zhang, R. (2024). Text-to-Image Segmentation with Open-Vocabulary and Multitasking. Preprints. https://doi.org/10.20944/preprints202404.0631.v1
Chicago/Turabian Style
Pan, L., Zhengkui Wang and Rui Zhang. 2024 "Text-to-Image Segmentation with Open-Vocabulary and Multitasking" Preprints. https://doi.org/10.20944/preprints202404.0631.v1
Abstract
Open-vocabulary learning has recently gained prominence as a means to enable image segmentation for arbitrary categories based on textual descriptions. This advancement has extended the applicability of segmentation systems to a broader range of generally purpose scenarios. However, current methods often revolve around specialized architectures and parameters tailored to specific segmentation tasks, resulting in a fragmented landscape of segmentation models. In response to these challenges, we introduce OVAMTSeg, a versatile framework designed for Open-Vocabulary and Multitask Image Segmentation. OVAMTSeg harnesses adaptive prompt learning to empower the model to capture category-sensitive concepts, enhancing its robustness across diverse multi-task and scenario contexts. Text prompts are employed to effectively capture semantic and contextual features of the text, while cross-attention and cross-modal interactions enable the fusion of image and text features. Furthermore, a transformer-based decoder is incorporated for dense prediction. Extensive experimental results underscore the effectiveness of OVAMTSeg, showcasing its state-of-the-art performance and superior generalization capabilities across three segmentation tasks. Notable achievements include a 47.5 mIoU in referring expression segmentation, 51.6 mIoU on Pascal-VOC with four unseen classes, 46.6 mIoU on Pascal-Context in zero-shot segmentation, 65.9 mIoU on Pascal-5i, and 35.7 mIoU on COCO-20i datasets for one-shot segmentation.
Keywords
image segmentation; open vocabulary; multitask; multi-modal interaction
Subject
Computer Science and Mathematics, Computer Vision and Graphics
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.