research-article

Differentiable Cross-modal Hashing via Multimodal Transformers

Authors:

Meng WangAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 453 - 461

https://doi.org/10.1145/3503161.3548187

Published: 10 October 2022 Publication History

Get Access

Abstract

Cross-modal hashing aims at projecting the cross modal content into a common Hamming space for efficient search. Most existing work first encodes the samples with a deep network and then binaries the encoded feature into hashing code. However, the relative location information in the image may be lost when an image is encoded by the convolutional network, which makes it challenging to model the relationship of different modalities. Moreover, it is NP-hard to optimize the model with the discrete sign binary function popularly used in existing solutions. To address these issues, we propose a differentiable cross-modal hashing method that utilizes the multimodal transformer as the backbone to capture the location information in an image when encoding the visual content. In addition, a novel differentiable cross-modal hashing method is proposed to generate the binary code by a selecting mechanism, which could be formulated as a continuous and easily optimized problem. We perform extensive experiments on several cross modal datasets and the results show that the proposed method outperforms many existing solutions.

Supplementary Material

MP4 File (mm22-fp1814.mp4)

This is the video for paper "Differentiable Cross Modal Hashing via Multimodal Transformers". In this paper, we propose a differentiable cross-modal hashing method that utilizes the multimodal transformer as the backbone to capture the location information in an image when encoding the visual content. In addition, a novel differentiable cross-modal hashing method is proposed to generate the binary code by a selecting mechanism, which could be formulated as a continuous and easily optimized problem. We perform extensive experiments on several cross modal datasets and the results show that the proposed method outperforms many existing solutions.

Download
9.39 MB

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Correlation Autoencoder Hashing for Supervised Cross-Modal Search

Multi-level adversarial attention cross-modal hashing

When CLIP meets cross-modal hashing retrieval: A new strong baseline

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations