Abstract
Cross-modal retrieval essentially extracts the shared semantics of an object between two different modalities. However, “modality gap” may significantly limit the performance when analyzing from each modality sample. In this paper, to overcome the characteristics from heterogeneous data, we propose a novel mutual information-based disentanglement framework to capturing the precise shared semantics in cross-modal scenes. Firstly, we design a disentanglement framework to extract the shared parts of modalities, which can provide the basis for semantic measuring with mutual information. Secondly, we measure semantic associations from the perspective of distribution, which overcomes perturbations brought by “modality gap”. Finally, we formalize our framework and theoretically prove that mutual information can obtain remarkable performance under the disentanglement framework. Sufficient experimental results evaluated on two large benchmarks demonstrate that our approach can obtain significant performance in cross-modal retrieval task.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alajaji, F., Chen, P.-N.: An Introduction to Single-User Information Theory. SUTMT, Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-8001-2
Belghazi, M.I., et al.: Mutual information neural estimation. In: 35th International Conference on Machine Learning, pp. 530–539. PMLR, Stockholmsmässan, Stockholm, Sweden (2018)
Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
Chen, T., Deng, J., Luo, J.: Adaptive offline quintuplet loss for image-text matching. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12358, pp. 549–565. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58601-0_33
Faghri, F., Fleet, D. J., Kiros, J. R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. In: 29th British Machine Vision Conference, Article 12. BMVA Press Newcastle, UK (2018)
Guo, W., Huang, H., Kong, X., He, R.: Learning disentangled representation for cross-modal retrieval with deep mutual information estimation. In: 27th ACM International Conference on Multimedia, pp. 1712–1720. ACM, Nice, France (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 26th IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. IEEE Computer Society, Las Vegas, NV, USA (2016)
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: 25th IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137. IEEE Computer Society, Boston, MA, USA (2015)
Lee, K.-H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 212–228. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_13
Li, C., Deng, C., Li, N., Liu, W., Gao, X., Tao, D.: Self-supervised adversarial hashing networks for cross-modal retrieval. In: 28th IEEE Conference on Computer Vision and Pattern Recognition, pp. 4242–4251. IEEE Computer Society, Salt Lake City, UT, USA (2018)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, C., Mao, Z., Liu, A. A., Zhang, T., Wang, B., Zhang, Y.: Focus your attention: a bidirectional focal attention network for image-text matching. In: 27th ACM International Conference on Multimedia, pp. 3–11. ACM, Nice, France (2019)
Ma, D., Zhai, X., Peng, Y.: Cross-media retrieval by cluster-based correlation analysis. In: 20th IEEE International Conference on Image Processing, pp. 3986–3990. IEEE, Melbourne, Australia (2013)
Peng, Y., Huang, X., Zhao, Y.: An overview of cross-media retrieval: concepts, methodologies, benchmarks, and challenges. IEEE Trans. Circuits Syst. Video Technol. 28(9), 2372–2385 (2017)
Peng, Y., Qi, J., Yuan, Y.: Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Trans. Image Process. 27(11), 5585–5599 (2018)
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: 15th IEEE International Conference on Computer Vision, pp. 2641–2649. IEEE Computer Society, Santiago, Chile (2015)
Rasiwasia, N., Costa Pereira, J., Coviello, E., Doyle, G., Lanckriet, G. R., Levy, R., Vasconcelos, N.: A new approach to cross-modal multimedia retrieval. In: 18th ACM International Conference on Multimedia, pp. 251–260. ACM, Firenze, Italy (2010)
Song, Y., Soleymani, M.: Polysemous visual-semantic embedding for cross-modal retrieval. In: 29th IEEE Conference on Computer Vision and Pattern Recognition, pp. 1979–1988. IEEE, Long Beach, CA, USA (2019)
Wang, B., Yang, Y., Xu, X., Hanjalic, A., Shen, H.T.: Adversarial cross-modal retrieval. In: 25th ACM International Conference on Multimedia, pp. 154–162. ACM, Mountain View, CA, USA (2017)
Wang, T., Xu, X., Yang, Y., Hanjalic, A., Shen, H.T., Song, J.: Matching images and text with multi-modal tensor fusion and re-ranking. In: 27th ACM International Conference on Multimedia, pp. 12–20. ACM, Nice, France (2019)
Wang, Z., et al.: Camp: cross-modal adaptive message passing for text-image retrieval. In: 17th IEEE International Conference on Computer Vision, pp. 5763–5772. IEEE, Seoul, Korea (South) (2019)
Wehrmann, J., Kolling, C., Barros, R. C.: Adaptive cross-modal embeddings for image-text alignment. In: 32nd AAAI Conference on Artificial Intelligence, pp. 12313–12320. AAAI Press, New York, NY, USA (2020)
Wei, J., Xu, X., Yang, Y., Ji, Y., Wang, Z., Shen, H.T.: Universal weighting metric learning for cross-moda matching. In: 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13002–13011. IEEE, Seattle, WA, USA (2020)
Wu, H., Merler, M., Uceda-Sosa, R., Smith, J. R.: Learning to make better mistakes: Semantics-aware visual food recognition. In: 24th ACM International Conference on Multimedia, pp. 172–176. ACM, Amsterdam, The Netherlands (2016)
Yang, Y., Xu, D., Nie, F., Luo, J., Zhuang, Y.: Ranking with local regression and global alignment for cross media retrieval. In: 17th ACM international Conference on Multimedia, pp. 175–184. ACM, Vancouver, British Columbia, Canada (2009)
Zhang, W., et al.: Photo stream question answer. In: 28th ACM International Conference on Multimedia, pp. 3966–3975. ACM, Virtual Event (2020)
Zhuang, Y.T., Yang, Y., Wu, F.: Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Trans. Multimedia 10(2), 221–229 (2008)
Li, C., Zhu, C., Huang, Y., Tang, J., Wang, L.: Cross-modal ranking with soft consistency and noisy labels for robust RGB-T tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 831–847. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_49
Acknowledgments
This work is supported by the National Natural Science Foundation of China (NSFC) (61972455), and the Joint Project of Bayescom. Xiaowang Zhang is supported by the program of Peiyang Young Scholars in Tianjin University (2019XRX-0032).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Wu, H. et al. (2021). A Mutual Information-Based Disentanglement Framework for Cross-Modal Retrieval. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds) Neural Information Processing. ICONIP 2021. Lecture Notes in Computer Science(), vol 13111. Springer, Cham. https://doi.org/10.1007/978-3-030-92273-3_48
Download citation
DOI: https://doi.org/10.1007/978-3-030-92273-3_48
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92272-6
Online ISBN: 978-3-030-92273-3
eBook Packages: Computer ScienceComputer Science (R0)