research-article

Hugs Bring Double Benefits: Unsupervised Cross-Modal Hashing with Multi-granularity Aligned Transformers

Authors:

Shu-Tao XiaAuthors Info & Claims

International Journal of Computer Vision, Volume 132, Issue 8

Pages 2765 - 2797

https://doi.org/10.1007/s11263-024-02009-7

Published: 18 February 2024 Publication History

Abstract

Unsupervised cross-modal hashing (UCMH) has been commonly explored to support large-scale cross-modal retrieval of unlabeled data. Despite promising progress, most existing approaches are developed on convolutional neural network and multilayer perceptron architectures, sacrificing the quality of hash codes due to limited capacity for excavating multi-modal semantics. To pursue better content understanding, we break this convention for UCMH and delve into a transformer-based paradigm. Unlike naïve adaptations via backbone substitution that overlook the heterogeneous semantics from transformers, we propose a multi-granularity learning framework called hugging to bridge the modality gap. Specifically, we first construct a fine-grained semantic space composed of a series of aggregated local embeddings that capture implicit attribute-level semantics. In the hash learning stage, we innovatively incorporate fine-grained alignment with these local embeddings to enhance global hash code alignment. Notably, this fine-grained alignment only facilitates robust cross-modal learning without complicating global hash code generation at test time, thus fully maintaining the high efficiency of hash-based retrieval. To make the most of fine-grained information, we further propose a differentiable optimized quantization algorithm and extend our framework to hugging

^{+}

. This variant neatly integrates quantization learning into the fine-grained alignment during training, producing quantization codes of local embeddings as a gift at test time, which can augment the retrieval performance through an efficient reranking stage. We instantiate simple baselines with contrastive learning objectives for hugging and hugging

^{+}

, namely HuggingHash and HuggingHash

^{+}

. Extensive experiments on 4 text-image retrieval and 2 text-video retrieval benchmark datasets show the competitive performance of HuggingHash and HuggingHash

^{+}

against state-of-the-art baselines. More encouragingly, we also validate that hugging and hugging

^{+}

are flexible and effective across various baselines, suggesting their universal applicability in the realm of UCMH.

References

[1]

An, X., Deng, J., Yang, K., Li, J., Feng, Z., Guo, J., Yang, J., & Liu, T. (2023). Unicom: Universal and compact representation learning for image retrieval. In ICLR. OpenReview.net.

Abstract

References

Recommendations

Boosting multi-kernel locality-sensitive hashing for scalable image retrieval

Unsupervised multi-graph cross-modal hashing for large-scale multimedia retrieval

Unsupervised Video Hashing with Multi-granularity Contextualization and Multi-structure Preservation

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

Share

Share this Publication link

Share on social media

Affiliations