Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3503161.3548187acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Differentiable Cross-modal Hashing via Multimodal Transformers

Published: 10 October 2022 Publication History

Abstract

Cross-modal hashing aims at projecting the cross modal content into a common Hamming space for efficient search. Most existing work first encodes the samples with a deep network and then binaries the encoded feature into hashing code. However, the relative location information in the image may be lost when an image is encoded by the convolutional network, which makes it challenging to model the relationship of different modalities. Moreover, it is NP-hard to optimize the model with the discrete sign binary function popularly used in existing solutions. To address these issues, we propose a differentiable cross-modal hashing method that utilizes the multimodal transformer as the backbone to capture the location information in an image when encoding the visual content. In addition, a novel differentiable cross-modal hashing method is proposed to generate the binary code by a selecting mechanism, which could be formulated as a continuous and easily optimized problem. We perform extensive experiments on several cross modal datasets and the results show that the proposed method outperforms many existing solutions.

Supplementary Material

MP4 File (mm22-fp1814.mp4)
This is the video for paper "Differentiable Cross Modal Hashing via Multimodal Transformers". In this paper, we propose a differentiable cross-modal hashing method that utilizes the multimodal transformer as the backbone to capture the location information in an image when encoding the visual content. In addition, a novel differentiable cross-modal hashing method is proposed to generate the binary code by a selecting mechanism, which could be formulated as a continuous and easily optimized problem. We perform extensive experiments on several cross modal datasets and the results show that the proposed method outperforms many existing solutions.

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[2]
Cong Bai, Chao Zeng, Qing Ma, Jinglin Zhang, and Shengyong Chen. 2020. Deep Adversarial Discrete Hashing for Cross-Modal Retrieval. In Proceedings of the 2020 International Conference on Multimedia Retrieval. 525--531.
[3]
Han Cai, Ligeng Zhu, and Song Han. 2018. ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware. CoRR abs/1812.00332 (2018). arXiv:1812.00332 http://arxiv.org/abs/1812.00332
[4]
Yue Cao, Bin Liu, Mingsheng Long, and Jianmin Wang. 2018. Cross-Modal Hamming Hashing. In Proceedings of the European Conference on Computer Vision (ECCV).
[5]
Yue Cao, Mingsheng Long, and Jianmin Wang. 2016. Correlation Hashing Network for Efficient Cross-Modal Retrieval. CoRR abs/1602.06697 (2016). arXiv:1602.06697 http://arxiv.org/abs/1602.06697
[6]
Yudong Chen, SenWang, Jianglin Lu, Zhi Chen, Zheng Zhang, and Zi Huang. 2021. Local Graph Convolutional Networks for Cross-Modal Hashing. In Proceedings of the International Conference on Multimedia. 1921--1928.
[7]
Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: A Real-World Web Image Database from National University of Singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval (Santorini, Fira, Greece).
[8]
Quan Cui, Qing-Yuan Jiang, Xiu-Shen Wei, Wu-Jun Li, and Osamu Yoshie. 2020. ExchNet: A Unified Hashing Network for Large-Scale Fine-Grained Image Retrieval. CoRR abs/2008.01369 (2020). arXiv:2008.01369 https://arxiv.org/abs/2008.01369
[9]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. CoRR abs/2010.11929 (2020). arXiv:2010.11929 https://arxiv.org/abs/2010.11929
[10]
Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. 2019. Neural architecture search: A survey. The Journal of Machine Learning Research 20, 1 (2019), 1997--2017.
[11]
Venice Erin Liong, Jiwen Lu, Gang Wang, Pierre Moulin, and Jie Zhou. 2015. Deep Hashing for Compact Binary Codes Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[12]
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. VSE: Improved Visual-Semantic Embeddings. CoRR abs/1707.05612 (2017). arXiv:1707.05612 http://arxiv.org/abs/1707.05612
[13]
Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. 2021. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP. CoRR abs/2106.11097 (2021). arXiv:2106.11097 https://arxiv.org/abs/2106.11097
[14]
Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc' Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. In Advances in Neural Information Processing Systems, Vol. 26. Curran Associates, Inc.
[15]
Dehong Gao, Linbo Jin, Ben Chen, Minghui Qiu, Peng Li, Yi Wei, Yi Hu, and Hao Wang. 2020. FashionBERT: Text and Image Matching with Adaptive Loss for Cross-Modal Retrieval. 2251--2260.
[16]
Wen Gu, Xiaoyan Gu, Jingzi Gu, Bo Li, Zhi Xiong, and Weiping Wang. 2019. Adversary Guided Asymmetric Hashing for Cross-Modal Retrieval. In Proceedings of the International Conference on Multimedia Retrieval. 159--167. https://doi.org/10.1145/3323873.3325045
[17]
Qing-Yuan Jiang andWu-Jun Li. 2017. Deep Cross-Modal Hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[18]
Parminder Kaur, Husanbir Singh Pannu, and Avleen Kaur Malhi. 2021. Comparative analysis on cross-modal information retrieval: a review. Computer Science Review 39 (2021), 100336.
[19]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[20]
Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. CoRR abs/1411.2539 (2014). arXiv:1411.2539 http://arxiv.org/abs/1411.2539
[21]
Hanjiang Lai, Yan Pan, Ye Liu, and Shuicheng Yan. 2015. Simultaneous feature learning and hash coding with deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3270--3278.
[22]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In Proceedings of the European Conference on Computer Vision (ECCV).
[23]
Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, and Dacheng Tao. 2018. Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4242--4251.
[24]
Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual Semantic Reasoning for Image-Text Matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
[25]
Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018. DARTS: Differentiable Architecture Search. CoRR abs/1806.09055 (2018). arXiv:1806.09055 http://arxiv.org/abs/1806.09055
[26]
Wei Liu, Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. 2011. Hashing with Graphs. In ICML. 1--8.
[27]
Ilya Loshchilov and Frank Hutter. 2016. SGDR: Stochastic Gradient Descent with Restarts. CoRR abs/1608.03983 (2016). arXiv:1608.03983 http://arxiv.org/abs/1608.03983
[28]
David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. International journal of computer vision 60, 2 (2004), 91--110.
[29]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32 (2019).
[30]
Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen,Wen Lei, Nan Duan, and Tianrui Li. 2021. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. CoRR abs/2104.08860 (2021). arXiv:2104.08860 https://arxiv.org/abs/2104.08860
[31]
Nicola Messina, Giuseppe Amato, Andrea Esuli, Fabrizio Falchi, Claudio Gennaro, and Stéphane Marchand-Maillet. 2021. Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders. ACM Trans. Multimedia Comput. Commun. Appl. 17, 4, Article 128 (nov 2021), 23 pages.
[32]
Jesús Andrés Portillo-Quintero, José Carlos Ortiz-Bayliss, and Hugo Terashima-Marín. 2021. A straightforward framework for video retrieval using clip. In Mexican Conference on Pattern Recognition. Springer, 3--12.
[33]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139). PMLR, 8748--8763.
[34]
Alec Radford, JeffreyWu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
[35]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural Machine Translation of Rare Words with Subword Units. CoRR abs/1508.07909 (2015). arXiv:1508.07909 http://arxiv.org/abs/1508.07909
[36]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc.
[37]
Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. 2016. A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215 (2016).
[38]
XinzhiWang, Xitao Zou, Erwin M. Bakker, and SongWu. 2020. Self-constraining and attention-based hashing network for bit-scalable cross-modal retrieval. Neurocomputing 400 (2020), 255--271.
[39]
Yiling Wu, Shuhui Wang, Guoli Song, and Qingming Huang. 2019. Learning Fragment Self-Attention Embeddings for Image-Text Matching. In Proceedings of the 27th ACM International Conference on Multimedia (Nice, France) (MM '19). 2088--2096.
[40]
Tete Xiao, Piotr Dollar, Mannat Singh, Eric Mintun, Trevor Darrell, and Ross Girshick. 2021. Early convolutions help transformers see better. Advances in Neural Information Processing Systems 34 (2021).
[41]
Ruiqing Xu, Chao Li, Junchi Yan, Cheng Deng, and Xianglong Liu. 2019. Graph Convolutional Network Hashing for Cross-Modal Retrieval. In IJCAI, Vol. 2019. 982--988.
[42]
Shiyang Yan, Li Yu, and Yuan Xie. 2021. Discrete-Continuous Action Space Policy Gradient-Based Attention for Image-Text Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8096--8105.
[43]
Erkun Yang, Cheng Deng, Wei Liu, Xianglong Liu, Dacheng Tao, and Xinbo Gao. 2017. Pairwise relationship guided deep hashing for cross-modal retrieval. In proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.
[44]
Hong-Lei Yao, Yu-Wei Zhan, Zhen-Duo Chen, Xin Luo, and Xin-Shun Xu. 2021. TEACH: Attention-Aware Deep Cross-Modal Hashing. In Proceedings of the 2021 International Conference on Multimedia Retrieval (Taipei, Taiwan) (ICMR '21). 376--384.
[45]
Feifei Zhang, Mingliang Xu, Qirong Mao, and Changsheng Xu. 2020. Joint Attribute Manipulation and Modality Alignment Learning for Composing Text and Image to Image Retrieval. In Proceedings of the 28th ACM International Conference on Multimedia. 3367--3376.
[46]
Han Zhu, Mingsheng Long, Jianmin Wang, and Yue Cao. 2016. Deep hashing network for efficient similarity retrieval. In Proceedings of the AAAI conference on Artificial Intelligence, Vol. 30.

Cited By

View all
  • (2024)Deep Neighborhood-aware Proxy Hashing with Uniform Distribution Constraint for Cross-modal RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364363920:6(1-23)Online publication date: 8-Mar-2024
  • (2024)Two-Step Discrete Hashing for Cross-Modal RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.338182826(8730-8741)Online publication date: 2024
  • (2024)Deep Ranking Distribution Preserving Hashing for Robust Multi-Label Cross-Modal RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.335899526(7027-7042)Online publication date: 26-Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-modal retrieval
  2. hashing
  3. transformer

Qualifiers

  • Research-article

Conference

MM '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24
The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne , VIC , Australia

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)298
  • Downloads (Last 6 weeks)22
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Deep Neighborhood-aware Proxy Hashing with Uniform Distribution Constraint for Cross-modal RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364363920:6(1-23)Online publication date: 8-Mar-2024
  • (2024)Two-Step Discrete Hashing for Cross-Modal RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.338182826(8730-8741)Online publication date: 2024
  • (2024)Deep Ranking Distribution Preserving Hashing for Robust Multi-Label Cross-Modal RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.335899526(7027-7042)Online publication date: 26-Jan-2024
  • (2024)Deep Neighborhood-Preserving Hashing With Quadratic Spherical Mutual Information for Cross-Modal RetrievalIEEE Transactions on Multimedia10.1109/TMM.2023.334907526(6361-6374)Online publication date: 2024
  • (2024)Learning to Agree on Vision Attention for Visual Commonsense ReasoningIEEE Transactions on Multimedia10.1109/TMM.2023.327587426(1065-1075)Online publication date: 2024
  • (2024)Deep Self-Supervised Hashing With Fine-Grained Similarity Mining for Cross-Modal RetrievalIEEE Access10.1109/ACCESS.2024.337117312(31756-31770)Online publication date: 2024
  • (2024)Deep Hashing Similarity Learning for Cross-Modal RetrievalIEEE Access10.1109/ACCESS.2024.335243412(8609-8618)Online publication date: 2024
  • (2024)Hugs Bring Double Benefits: Unsupervised Cross-Modal Hashing with Multi-granularity Aligned TransformersInternational Journal of Computer Vision10.1007/s11263-024-02009-7132:8(2765-2797)Online publication date: 18-Feb-2024
  • (2024)Multi-label semantic sharing based on graph convolutional network for image-to-text retrievalThe Visual Computer10.1007/s00371-024-03496-yOnline publication date: 10-Jun-2024
  • (2023)Graph Convolutional Incomplete Multi-modal HashingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612282(7029-7037)Online publication date: 26-Oct-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media