Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Hugs Bring Double Benefits: Unsupervised Cross-Modal Hashing with Multi-granularity Aligned Transformers

Published: 18 February 2024 Publication History

Abstract

Unsupervised cross-modal hashing (UCMH) has been commonly explored to support large-scale cross-modal retrieval of unlabeled data. Despite promising progress, most existing approaches are developed on convolutional neural network and multilayer perceptron architectures, sacrificing the quality of hash codes due to limited capacity for excavating multi-modal semantics. To pursue better content understanding, we break this convention for UCMH and delve into a transformer-based paradigm. Unlike naïve adaptations via backbone substitution that overlook the heterogeneous semantics from transformers, we propose a multi-granularity learning framework called hugging to bridge the modality gap. Specifically, we first construct a fine-grained semantic space composed of a series of aggregated local embeddings that capture implicit attribute-level semantics. In the hash learning stage, we innovatively incorporate fine-grained alignment with these local embeddings to enhance global hash code alignment. Notably, this fine-grained alignment only facilitates robust cross-modal learning without complicating global hash code generation at test time, thus fully maintaining the high efficiency of hash-based retrieval. To make the most of fine-grained information, we further propose a differentiable optimized quantization algorithm and extend our framework to hugging+. This variant neatly integrates quantization learning into the fine-grained alignment during training, producing quantization codes of local embeddings as a gift at test time, which can augment the retrieval performance through an efficient reranking stage. We instantiate simple baselines with contrastive learning objectives for hugging and hugging+, namely HuggingHash and HuggingHash+. Extensive experiments on 4 text-image retrieval and 2 text-video retrieval benchmark datasets show the competitive performance of HuggingHash and HuggingHash+ against state-of-the-art baselines. More encouragingly, we also validate that hugging and hugging+ are flexible and effective across various baselines, suggesting their universal applicability in the realm of UCMH.

References

[1]
An, X., Deng, J., Yang, K., Li, J., Feng, Z., Guo, J., Yang, J., & Liu, T. (2023). Unicom: Universal and compact representation learning for image retrieval. In ICLR. OpenReview.net.
[2]
Arandjelovic, R., Gronát, P., Torii, A., Pajdla, T., & Sivic, J. (2016). Netvlad: CNN architecture for weakly supervised place recognition. In CVPR (pp. 5297–5307). IEEE Computer Society.
[3]
Asadi, N., & Lin, J. (2013). Effectiveness/efficiency tradeoffs for candidate generation in multi-stage retrieval architectures. In SIGIR (pp. 997–1000). ACM.
[4]
Babenko, A., & Lempitsky, V. S. (2014). Additive quantization for extreme vector compression. In CVPR (pp. 931–938). IEEE Computer Society.
[5]
Bain, M., Nagrani, A., Varol, G., & Zisserman, A. (2021). Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, (pp. 1708–1718). IEEE.
[6]
Bao, H., Wang, W., Dong, L., Liu, Q., Mohammed, O. K., Aggarwal, K., Som, S., Piao, S., & Wei, F. (2022b). Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. In NeurIPS.
[7]
Bao, H., Dong, L., Piao, S., & Wei, F. (2022). Beit: BERT pre-training of image transformers. In ICLR. OpenReview.net.
[8]
Bengio, Y., Léonard, N., & Courville, A. C. (2013). Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR abs/1308.3432.
[9]
Cao, Y., Liu, B., Long, M., & Wang, J. (2018). Cross-modal hamming hashing. In ECCV, volume 11205 (pp. 207–223). Springer.
[10]
Cao, Y., Long, M., Wang, J., & Liu, S. (2017). Deep visual-semantic quantization for efficient image retrieval. In CVPR (pp. 916–925). IEEE Computer Society.
[11]
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In ECCV, volume 12346 (pp. 213–229). Springer.
[12]
Chen, D. L., & Dolan, W. B. (2011). Collecting highly parallel data for paraphrase evaluation. In ACL (pp. 190–200). The Association for Computer Linguistics.
[13]
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. E. (2020a). A simple framework for contrastive learning of visual representations. In ICML, volume 119 (pp. 1597–1607). PMLR.
[14]
Chen, T., Li, L., & Sun, Y. (2020b). Differentiable product quantization for end-to-end embedding compression. In ICML, volume 119 (pp. 1617–1626). PMLR.
[15]
Chen, Y., Wang, S., Lu, J., Chen, Z., Zhang, Z., & Huang, Z. (2021). Local graph convolutional networks for cross-modal hashing. In ACM Multimedia (pp. 1921–1928). ACM.
[16]
Chen, Z., Yu, W., Li, C., Nie, L., & Xu, X. (2018). Dual deep neural networks cross-modal hashing. In AAAI (pp. 274–281). AAAI.
[17]
Chen, Y., Zhang, S., Liu, F., Chang, Z., Ye, M., & Qi, Z. (2022). Transhash: Transformer-based hamming hashing for efficient image retrieval. In ICMR (pp. 127–136). ACM.
[18]
Chua, T., Tang, J., Hong, R., Li, H., Luo, Z., & Zheng, Y. (2009). NUS-WIDE: A real-world web image database from national university of singapore. In CIVR. ACM.
[19]
Cui, H., Zhu, L., Li, J., Cheng, Z., & Zhang, Z. (2021). Two-pronged strategy: Lightweight augmented graph network hashing for scalable image retrieval. In ACM Multimedia (pp. 1432–1440). ACM.
[20]
Datar, M., Immorlica, N., Indyk, P., & Mirrokni, V. S. (2004). Locality-sensitive hashing scheme based on p-stable distributions. In SCG (pp. 253–262). ACM.
[21]
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL (pp. 4171–4186). Association for Computational Linguistics.
[22]
Ding G, Guo Y, Zhou J, and Gao Y Large-scale cross-modality search via collective matrix factorization hashing IEEE Transactions on Image Processing 2016 25 11 5427-5440
[23]
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR. OpenReview.net.
[24]
Dubey, S. R., Singh, S. K., & Chu, W. (2022). Vision transformer hashing for image retrieval. In ICME (pp. 1–6). IEEE.
[25]
Ester, M., Kriegel, H., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD (pp. 226–231). AAAI.
[26]
Fang, B., Wu, W., Liu, C., Zhou, Y., Song, Y., Wang, W., Shu, X., Ji, X., & Wang, J. (2023). UATVR: Uncertainty-adaptive text-video retrieval. In ICCV (pp. 13677–13687). IEEE.
[27]
Gabeur, V., Sun, C., Alahari, K., & Schmid, C. (2020). Multi-modal transformer for video retrieval. In ECCV, volume 12349 (pp. 214–229). Springer.
[28]
Gao, D., Jin, L., Chen, B., Qiu, M., Li, P., Wei, Y., Hu, Y., & Wang, H. (2020). Fashionbert: Text and image matching with adaptive loss for cross-modal retrieval. In SIGIR (pp. 2251–2260). ACM.
[29]
Ge, T., He, K., Ke, Q., & Sun, J. (2013). Optimized product quantization for approximate nearest neighbor search. In CVPR (pp. 2946–2953). IEEE Computer Society.
[30]
Gong Y, Lazebnik S, Gordo A, and Perronnin F Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval IEEE Transactions on Pattern Analysis and Machine Intelligence 2013 35 12 2916-2929
[31]
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. B. (2020). Momentum contrast for unsupervised visual representation learning. In CVPR (pp. 9726–9735). Computer Vision Foundation/IEEE.
[32]
He, X., Pan, Y., Tang, M., & Lv, Y. (2021). Self-supervised video retrieval transformer network. CoRR, abs/2104.07993.
[33]
Heo, J., Lee, Y., He, J., Chang, S., & Yoon, S. (2012). Spherical hashing. In CVPR (pp. 2957–2964). IEEE Computer Society.
[34]
Hoang T, Do T, Nguyen TV, and Cheung N Unsupervised deep cross-modality spectral hashing IEEE Transactions on Image Processing 2020 29 8391-8406
[35]
Hoang T, Do T, Nguyen TV, and Cheung N Multimodal mutual information maximization: A novel approach for unsupervised deep cross-modal hashing IEEE Trans. Neural Networks Learn. Syst. 2023 34 9 6289-6302
[36]
Hu, H., Xie, L., Hong, R., & Tian, Q. (2020). Creating something from nothing: Unsupervised knowledge distillation for cross-modal hashing. In CVPR, (pp. 3120–3129). Computer Vision Foundation/IEEE.
[37]
Huiskes, M. J., & Lew, M. S. (2008). The MIR flickr retrieval evaluation. In Multimedia Information Retrieval (pp. 39–43). ACM.
[38]
Humenberger M, Cabon Y, Pion N, Weinzaepfel P, Lee D, Guérin N, Sattler T, and Csurka G Investigating the role of image retrieval for visual localization: An exhaustive benchmark International Journal of Computer Vision 2022 130 7 1811-1836
[39]
Hu D, Nie F, and Li X Deep binary reconstruction for cross-modal hashing IEEE Transactions on Multimedia 2019 21 4 973-985
[40]
Hu M, Yang Y, Shen F, Xie N, Hong R, and Shen HT Collective reconstructive embeddings for cross-modal hashing IEEE Transactions on Image Processing 2019 28 6 2770-2784
[41]
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, volume 37 (pp. 448–456). JMLR.org.
[42]
Irie, G., Arai, H., & Taniguchi, Y. (2015). Alternating co-quantization for cross-modal hashing. In ICCV (pp. 1886–1894). IEEE Computer Society.
[43]
Jang, Y. K., Cho, N. I. (2021). Self-supervised product quantization for deep unsupervised image retrieval. In ICCV (pp. 12065–12074). IEEE.
[44]
Jégou H, Douze M, and Schmid C Product quantization for nearest neighbor search IEEE Transactions on Pattern Analysis and Machine Intelligence 2011 33 1 117-128
[45]
Jiang, Q., Li, W. (2017). Deep cross-modal hashing. In CVPR (pp. 3270–3278). IEEE Computer Society.
[46]
Jin, P., Huang, J., Xiong, P., Tian, S., Liu, C., Ji, X., Yuan, L., & Chen, J. (2023a). Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning. In CVPR (pp. 2472–2482). IEEE.
[47]
Jin, P., Li, H., Cheng, Z., Li, K., Ji, X., Liu, C., Yuan, L., & Chen, J. (2023b). Diffusionret: Generative text-video retrieval with diffusion model. In ICCV (pp. 2470–2481). IEEE.
[48]
Johnson J, Douze M, and Jégou H Billion-scale similarity search with gpus IEEE Trans. Big Data 2021 7 3 535-547
[49]
Kalantidis, Y., & Avrithis, Y. (2014). Locally optimized product quantization for approximate nearest neighbor search. In CVPR (pp. 2329–2336). IEEE Computer Society.
[50]
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR.
[51]
Klein, B. E., Wolf, L. (2019). End-to-end supervised product quantization for image search and retrieval. In CVPR (pp. 5041–5050). Computer Vision Foundation / IEEE.
[52]
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NeurIPS, (pp. 1106–1114).
[53]
Kumar, S., & Udupa, R. (2011). Learning hash functions for cross-view similarity search. In IJCAI (pp. 1360–1365). IJCAI/AAAI.
[54]
Le, Q. V., Mikolov, T. (2014). Distributed representations of sentences and documents. In ICML, volume 32 (pp. 1188–1196). JMLR.org.
[55]
Li, M., & Wang, H. (2021). Unsupervised deep cross-modal hashing by knowledge distillation for large-scale cross-modal retrieval. In ICMR (pp. 183–191). ACM.
[56]
Li, C., Deng, C., Wang, L., Xie, D., & Liu, X. (2019). Coupled cyclegan: Unsupervised hashing network for cross-modal retrieval. In AAAI (pp. 176–183). AAAI.
[57]
Li, G., Duan, N., Fang, Y., Gong, M., & Jiang, D. (2020a). Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In AAAI (pp. 11336–11344). AAAI.
[58]
Li, S., Li, X., Lu, J., & Zhou, J. (2021b). Self-supervised video hashing via bidirectional transformers. In CVPR (pp. 13549–13558). Computer Vision Foundation / IEEE.
[59]
Li, P., Xie, H., Ge, J., Zhang, L., Min, S., & Zhang, Y. (2022a). Dual-stream knowledge-preserving hashing for unsupervised video retrieval. In ECCV, volume 13674 (pp. 181–197). Springer.
[60]
Lin, T., Maire, M., Belongie, S. J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In ECCV, volume 8693 (pp. 740–755). Springer.
[61]
Lin, X., Tiwari, S., Huang, S., Li, M., Shou, M. Z., Ji, H., & Chang, S. (2023). Towards fast adaptation of pretrained contrastive models for multi-channel video-language retrieval. In CVPR (pp. 14846–14855). IEEE.
[62]
Liong, V. E., Lu, J., Wang, G., Moulin, P., & Zhou, J. (2015). Deep hashing for compact binary codes learning. In CVPR (pp. 2475–2483). IEEE Computer Society.
[63]
Li Q, Sun Z, He R, and Tan T A general framework for deep supervised discrete hashing International Journal of Computer Vision 2020 128 8 2204-2222
[64]
Liu, Y., Albanie, S., Nagrani, A., and Zisserman, A. (2019b). Use what you have: Video retrieval using representations from collaborative experts. In BMVC, (p. 279). BMVA.
[65]
Liu, S., Fan, H., Qian, S., Chen, Y., Ding, W., and Wang, Z. (2021a). Hit: Hierarchical transformer with momentum contrast for video-text retrieval. In ICCV (pp. 11895–11905). IEEE.
[66]
Liu, H., Ji, R., Wu, Y., Huang, F., & Zhang, B. (2017). Cross-modality binary code learning via fusion similarity hashing. In CVPR (pp. 6345–6353). IEEE Computer Society.
[67]
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021b). Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV (pp. 9992–10002). IEEE.
[68]
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019c). Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
[69]
Liu, S., Qian, S., Guan, Y., Zhan, J., & Ying, L. (2020). Joint-modal distribution-based similarity hashing for large-scale unsupervised deep cross-modal retrieval. In SIGIR (pp. 1379–1388). ACM.
[70]
Liu T Learning to rank for information retrieval Foundations and Trends in Information Retrieval 2009 3 3 225-331
[71]
Liu H, Wang R, Shan S, and Chen X Deep supervised hashing for fast image retrieval International Journal of Computer Vision 2019 127 9 1217-1234
[72]
Liu, Z., Xiong, C., Lv, Y., Liu, Z., & Yu, G. (2023). Universal vision-language dense retrieval: Learning A unified representation space for multi-modal retrieval. In ICLR. OpenReview.net.
[73]
Li F, Wang T, Zhu L, Zhang Z, and Wang X Task-adaptive asymmetric deep cross-modal hashing Knowl. Based Syst. 2021 219
[74]
Li T, Zhang Z, Pei L, and Gan Y Hashformer: Vision transformer based deep hashing for image retrieval IEEE Signal Processing Letters 2022 29 827-831
[75]
Lowe DG Distinctive image features from scale-invariant keypoints International Journal of Computer Vision 2004 60 2 91-110
[76]
Lu, J., Batra, D., Parikh, D., & Lee, S. (2019). Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS (pp. 13–23).
[77]
Lu, D., Wang, J., Zeng, Z., Chen, B., Wu, S., & Xia, S. (2021). Swinfghash: Fine-grained image retrieval via transformer-based hashing network. In BMVC (p. 432). BMVA.
[78]
Luo H, Ji L, Zhong M, Chen Y, Lei W, Duan N, and Li T Clip4clip: An empirical study of CLIP for end to end video clip retrieval and captioning Neurocomputing 2022 508 293-304
[79]
Martinez, J., Clement, J., Hoos, H. H., & Little, J. J. (2016). Revisiting additive quantization. In ECCV, volume 9906 (pp. 137–153). Springer.
[80]
Messina, N., Amato, G., Esuli, A., Falchi, F., Gennaro, C., & Marchand-Maillet, S. (2021). Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans. Multim. Comput. Commun. Appl., 17(4):128:1–128:23.
[81]
Mikriukov, G., Ravanbakhsh, M., & Demir, B. (2022). Unsupervised contrastive hashing for cross-modal retrieval in remote sensing. In ICASSP (pp. 4463–4467). IEEE.
[82]
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E. Z., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In NeurIPS (pp. 8024–8035).
[83]
Patrick, M., Huang, P., Asano, Y. M., Metze, F., Hauptmann, A. G., Henriques, J. F., & Vedaldi, A. (2021). Support-set bottlenecks for video-text representation learning. In ICLR. OpenReview.net.
[84]
Qi M, Qin J, Yang Y, Wang Y, and Luo J Semantics-aware spatial-temporal binaries for cross-modal video retrieval IEEE Transactions on Image Processing 2021 30 2989-3004
[85]
Radenovic, F., Dubey, A., Kadian, A., Mihaylov, T., Vandenhende, S., Patel, Y., Wen, Y., Ramanathan, V., & Mahajan, D. (2023). Filtering, distillation, and hard negatives for vision-language pre-training. In CVPR (pp. 6967–6977). IEEE.
[86]
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In ICML, volume 139 (pp. 8748–8763). PMLR.
[87]
Rasiwasia, N., Pereira, J. C., Coviello, E., Doyle, G., Lanckriet, G. R. G., Levy, R., & Vasconcelos, N. (2010). A new approach to cross-modal multimedia retrieval. In ACM Multimedia (pp. 251–260). ACM.
[88]
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108.
[89]
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV (pp. 618–626). IEEE Computer Society.
[90]
Shen Y, Liu L, and Shao L Unsupervised binary representation learning with deep variational networks International Journal of Computer Vision 2019 127 11–12 1614-1628
[91]
Shen HT, Liu L, Yang Y, Xu X, Huang Z, Shen F, and Hong R Exploiting subspace relation in semantic labels for cross-modal hashing IEEE Transactions on Knowledge and Data Engineering 2021 33 10 3351-3365
[92]
Shi, Y., Chung, Y. (2021). Efficient cross-modal retrieval via deep binary hashing and quantization. In BMVC (p. 409). BMVA.
[93]
Shin A, Ishii M, and Narihira T Perspectives and prospects on transformer architecture for cross-modal tasks with language and vision International Journal of Computer Vision 2022 130 2 435-454
[94]
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR.
[95]
Song, Y., & Soleymani, M. (2019). Polysemous visual-semantic embedding for cross-modal retrieval. In CVPR (pp. 1979–1988). Computer Vision Foundation/IEEE.
[96]
Song, J., Yang, Y., Yang, Y., Huang, Z., & Shen, H. T. (2013). Inter-media hashing for large-scale retrieval from heterogeneous data sources. In SIGMOD (pp. 785–796). ACM.
[97]
Song J, He T, Gao L, Xu X, Hanjalic A, and Shen HT Unified binary generative adversarial network for image retrieval and compression International Journal of Computer Vision 2020 128 8 2243-2264
[98]
Su, S., Zhong, Z., & Zhang, C. (2019). Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval. In ICCV (pp. 3027–3035). IEEE.
[99]
Sun, C., Latapie, H., Liu, G., & Yan, Y. (2022). Deep normalized cross-modal hashing with bi-direction relation reasoning. In CVPRW (pp. 4937–4945). IEEE.
[100]
Sun, C., Song, X., Feng, F., Zhao, W. X., Zhang, H., & Nie, L. (2019). Supervised hierarchical cross-modal hashing. In SIGIR (pp. 725–734). ACM.
[101]
Tan, W., Zhu, L., Guan, W., Li, J., & Cheng, Z. (2022). Bit-aware semantic transformer hashing for multi-modal retrieval. In SIGIR (pp. 982–991). ACM.
[102]
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. In ICML, volume 139 (pp. 10347–10357). PMLR.
[103]
Tu, J., Liu, X., Lin, Z., Hong, R., & Wang, M. (2022). Differentiable cross-modal hashing via multimodal transformers. In ACM Multimedia (pp. 453–461). ACM.
[104]
Tu R, Mao X, Lin Q, Ji W, Qin W, Wei W, and Huang H Unsupervised cross-modal hashing via semantic text mining IEEE Transactions of Multimedia 2023 25 8946-8957
[105]
van den Oord, A., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. CoRR, abs/1807.03748.
[106]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In NeurIPS (pp. 5998–6008).
[107]
Wang, W., Shen, Y., Zhang, H., Yao, Y., & Liu, L. (2020b). Set and rebase: Determining the semantic graph connectivity for unsupervised cross-modal hashing. In IJCAI (pp. 853–859). ijcai.org.
[108]
Wang, Y., Wang, J., Chen, B., Zeng, Z., & Xia, S. (2023). Contrastive masked autoencoders for self-supervised video hashing. In AAAI (pp. 2733–2741). AAAI.
[109]
Wang, J., Zeng, Z., Chen, B., Dai, T., & Xia, S. (2022a). Contrastive quantization with code memory for unsupervised image retrieval. In AAAI, (pp. 2468–2476). AAAI.
[110]
Wang, J., Zeng, Z., Chen, B., Wang, Y., Liao, D., Li, G., Wang, Y., & Xia, S. (2022b). Hugs are better than handshakes: Unsupervised cross-modal transformer hashing with multi-granularity alignment. In BMVC (p. 1035). BMVA.
[111]
Wang, X., Zhu, L., and Yang, Y. (2021b). T2VLAD: global-local sequence alignment for text-video retrieval. In CVPR (pp. 5079–5088). Computer Vision Foundation/IEEE.
[112]
Wang J, Liu W, Kumar S, and Chang S Learning to hash for indexing big data - A survey Proceedings of the IEEE 2016 104 1 34-57
[113]
Wang L, Yang J, Zareapoor M, and Zheng Z Cluster-wise unsupervised hashing for cross-modal similarity search Pattern Recognition 2021 111
[114]
Wang Z, Zhang Z, Luo Y, Huang Z, and Shen HT Deep collaborative discrete hashing with semantic-invariant structure construction IEEE Transactions of Multimedia 2021 23 1274-1286
[115]
Wang J, Zhang T, Song J, Sebe N, and Shen HT A survey on learning to hash IEEE Transactions on Pattern Analysis and Machine Intelligence 2018 40 4 769-790
[116]
Wang T, Zhu L, Cheng Z, Li J, and Gao Z Unsupervised deep cross-modal hashing with virtual label regression Neurocomputing 2020 386 84-96
[117]
Weiss, Y., Torralba, A., & Fergus, R. (2008). Spectral hashing. In NeurIPS (pp. 1753–1760). Curran Associates, Inc.
[118]
Wu, G., Lin, Z., Han, J., Liu, L., Ding, G., Zhang, B., & Shen, J. (2018). Unsupervised deep hashing via binary latent factor models for large-scale cross-modal retrieval. In IJCAI (pp. 2854–2860). ijcai.org.
[119]
Wu, W., Luo, H., Fang, B., Wang, J., & Ouyang, W. (2023). Cap4video: What can auxiliary captions do for text-video retrieval? In CVPR (pp. 10704–10713). IEEE.
[120]
Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). MSR-VTT: A large video description dataset for bridging video and language. In CVPR (pp. 5288–5296). IEEE Computer Society.
[121]
Yang, J., Bisk, Y., Gao, J. (2021). Taco: Token-aware cascade contrastive learning for video-text alignment. In ICCV (pp. 11542–11552). IEEE.
[122]
Yang, Z., Dai, Z., Yang, Y., Carbonell, J. G., Salakhutdinov, R., & Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. In NeurIPS (pp. 5754–5764).
[123]
Yang, D., Wu, D., Zhang, W., Zhang, H., Li, B., & Wang, W. (2020). Deep semantic-alignment hashing for unsupervised cross-modal retrieval. In ICMR (pp. 44–52).
[124]
Yao, L., Huang, R., Hou, L., Lu, G., Niu, M., Xu, H., Liang, X., Li, Z., Jiang, X., & Xu, C. (2022). FILIP: Fine-grained interactive language-image pre-training. In ICLR. OpenReview.net.
[125]
Ye M, Shen J, Lin G, Xiang T, Shao L, and Hoi SCH Deep learning for person re-identification: A survey and outlook IEEE Transactions on Pattern Analysis and Machine Intelligence 2022 44 6 2872-2893
[126]
Yu, H., Ding, S., Li, L., & Wu, J. (2022). Self-attentive CLIP hashing for unsupervised cross-modal retrieval. In ACM Multimedia (pp. 8:1–8:7). ACM.
[127]
Yu, Y., Kim, J., & Kim, G. (2018). A joint sequence fusion model for video question answering and retrieval. In ECCV, volume 11211 (pp. 487–503). Springer.
[128]
Yu, T., Yang, Y., Li, Y., Liu, L., Fei, H., & Li, P. (2021b). Heterogeneous attention network for effective and efficient cross-modal retrieval. In SIGIR (pp. 1146–1156). ACM.
[129]
Yu, J., Zhou, H., Zhan, Y., & Tao, D. (2021a). Deep graph-neighbor coherence preserving network for unsupervised cross-modal hashing. In AAAI (pp. 4626–4634). AAAI.
[130]
Yu T, Meng J, Fang C, Jin H, and Yuan J Product quantization network for fast visual search International Journal of Computer Vision 2020 128 8 2325-2343
[131]
Zala, A., Cho, J., Kottur, S., Chen, X., Oguz, B., Mehdad, Y., & Bansal, M. (2023). Hierarchical video-moment retrieval and step-captioning. In CVPR (pp. 23056–23065). IEEE.
[132]
Zeng, Z., Wang, J., Chen, B., Wang, Y., & Xia, S. (2022). Motion-aware graph reasoning hashing for self-supervised video retrieval. In BMVC (p. 82). BMVA.
[133]
Zhang, T., Du, C., & Wang, J. (2014). Composite quantization for approximate nearest neighbor search. In ICML, volume 32 (pp. 838–846). JMLR.org.
[134]
Zhang, J., Peng, Y., & Yuan, M. (2018). Unsupervised generative adversarial cross-modal hashing. In AAAI (pp. 539–546). AAAI.
[135]
Zhang Z, Lai Z, Huang Z, Wong WK, Xie G, Liu L, and Shao L Scalable supervised asymmetric hashing with semantic and latent factor embedding IEEE Transactions on Image Processing 2019 28 10 4803-4818
[136]
Zhang P, Li Y, Huang Z, and Xu X Aggregation-based graph convolutional hashing for unsupervised cross-modal retrieval IEEE Transactions of Multimedia 2022 24 466-479
[137]
Zhang P, Luo Y, Huang Z, Xu X, and Song J High-order nonlocal hashing for unsupervised cross-modal retrieval World Wide Web 2021 24 2 563-583
[138]
Zhang Z, Luo H, Zhu L, Lu G, and Shen HT Modality-invariant asymmetric networks for cross-modal hashing IEEE Transactions on Knowledge and Data Engineering 2023 35 5 5091-5104
[139]
Zhang, J., & Peng, Y. (2020). Multi-pathway generative adversarial hashing for unsupervised cross-modal retrieval. IEEE Transactions of Multimedia,22(1), 174–187.
[140]
Zhang Z, Wang J, Zhu L, Luo Y, and Lu G Deep collaborative graph hashing for discriminative image retrieval Pattern Recognition 2023 139
[141]
Zheng C, Zhu L, Lu X, Li J, Cheng Z, and Zhang H Fast discrete collaborative multi-modal hashing for large-scale multimedia retrieval IEEE Transactions on Knowledge and Data Engineering 2020 32 11 2171-2184
[142]
Zhong, Y., Arandjelovic, R., & Zisserman, A. (2018). Ghostvlad for set-based face recognition. In ACCV, volume 11362 (pp. 35–50). Springer.
[143]
Zhong, Z., Zheng, L., Cao, D., & Li, S. (2017). Re-ranking person re-identification with k-reciprocal encoding. In CVPR (pp. 3652–3661). IEEE Computer Society.
[144]
Zhou, J., Ding, G., & Guo, Y. (2014). Latent semantic sparse hashing for cross-modal similarity search. In SIGIR (pp. 415–424). ACM.
[145]
Zhu, X., Huang, Z., Shen, H. T., & Zhao, X. (2013). Linear cross-modal hashing for efficient multimedia search. In ACM Multimedia (pp. 143–152). ACM.
[146]
Zhu, Y., Kiros, R., Zemel, R. S., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In ICCV (pp. 19–27). IEEE Computer Society.
[147]
Zhu, H., Long, M., Wang, J., & Cao, Y. (2016). Deep hashing network for efficient similarity retrieval. In AAAI (pp. 2415–2421). AAAI.
[148]
Zhuo, Y., Li, Y., Hsiao, J., Ho, C., & Li, B. (2022). Clip4hashing: Unsupervised deep hashing for cross-modal video-text retrieval. In ICMR, (pp. 158–166). ACM.
[149]
Zhu L, Wu X, Li J, Zhang Z, Guan W, and Shen HT Work together: Correlation-identity reconstruction hashing for unsupervised cross-modal retrieval IEEE Transactions on Knowledge and Data Engineering 2023 35 9 8838-8851

Recommendations

Comments

Information & Contributors

Information

Published In

cover image International Journal of Computer Vision
International Journal of Computer Vision  Volume 132, Issue 8
Aug 2024
640 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 18 February 2024
Accepted: 16 January 2024
Received: 13 April 2023

Author Tags

  1. Unsupervised cross-modal hashing (UCMH)
  2. Transformers
  3. Image retrieval
  4. Video retrieval
  5. Vector of locally aggregated descriptors (VLAD)
  6. Optimized product quantization (OPQ)

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media