research-article

LLM-Enhanced Composed Image Retrieval: An Intent Uncertainty-Aware Linguistic-Visual Dual Channel Matching Model

Authors:

Yezheng LiuAuthors Info & Claims

ACM Transactions on Information Systems, Volume 43, Issue 2

Article No.: 37, Pages 1 - 30

https://doi.org/10.1145/3699715

Published: 18 January 2025 Publication History

Get Access

Abstract

Composed image retrieval (CoIR) involves a multi-modal query of the reference image and modification text describing the desired changes, allowing users to express image retrieval intents flexibly and effectively. The key of CoIR lies in how to properly reason the search intent from the multi-modal query. Existing work either aligns the composite embedding of the multi-modal query and the target image embedding in the visual domain through late-fusion or converts all images into text descriptions and leverage large language models (LLM) for text semantic reasoning. However, this single-modality reasoning approach fails to comprehensively and interpretably capture the users’ ambiguous and uncertain intents in the multi-modal queries, incurring the inconsistency between retrieved results and ground truth. Besides, the expensive manually annotated datasets limit the further performance improvement of CoIR.

To this end, this article proposes an LLM-enhanced Intent Uncertainty-Aware Linguistic-Visual Dual Channel Matching Model (IUDC), which combines the strengths of multi-modal late-fusion and LLMs for CoIR. We first construct an LLM-based triplet augmentation strategy to generate more synthetic training triplets. Based on this, the core of IUDC consists of two matching channels: the semantic matching channel is responsible for intent reasoning on the aspect-level attributes extracted by an LLM, and the visual matching channel accounts for the fine-grained visual matching between multi-modal fusion embedding and target images. Considering the intent uncertainty presented in the multi-modal queries, we introduce Probability Distribution Encoder (PDE) to project the intents as probabilistic distributions in the two matching channels. Consequently, a mutually enhanced module is designed to share knowledge between the visual and semantic representations for better representation learning. Finally, the matching scores of two channels are added to retrieve the target image. Extensive experiments conducted on two real datasets demonstrate the effectiveness and superiority of our model. Notably, with the help of the proposed LLM-based triplet augmentation strategy, our model achieves a new record of state-of-the-art performance among all datasets.

References

[1]

Kenan E. Ak, Joo Hwee Lim, Jo Yew Tham, and Ashraf A. Kassim. 2018. Efficient multi-attribute similarity learning towards attribute-based fashion search. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV ’18). IEEE, 1671–1679.

Abstract

References

Index Terms

Recommendations

Image retrieval using nonlinear manifold embedding

Image retrieval based on incremental subspace learning

Semantic Subspace Projection and Its Applications in Image Retrieval

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Full Text

Share

Share this Publication link

Share on social media

Affiliations