Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Unifying Visual and Semantic Feature Spaces with Diffusion Models for Enhanced Cross-Modal Alignment

  • Conference paper
  • First Online:
Artificial Neural Networks and Machine Learning – ICANN 2024 (ICANN 2024)

Abstract

Image classification models often demonstrate unstable performance in real-world applications due to variations in image information, driven by differing visual perspectives of subject objects and lighting discrepancies. To mitigate these challenges, existing studies commonly incorporate additional modal information matching the visual data to regularize the model’s learning process, enabling the extraction of high-quality visual features from complex image regions. Specifically, in the realm of multimodal learning, cross-modal alignment is recognized as an effective strategy, harmonizing different modal information by learning a domain-consistent latent feature space for visual and semantic features. However, this approach may face limitations due to the heterogeneity between multimodal information, such as differences in feature distribution and structure. To address this issue, we introduce a Multimodal Alignment and Reconstruction Network (MARNet), designed to enhance the model’s resistance to visual noise. Importantly, MARNet includes a cross-modal diffusion reconstruction module for smoothly and stably blending information across different domains. Experiments conducted on two benchmark datasets, Vireo-Food172 and Ingredient-101, demonstrate that MARNet effectively improves the quality of image information extracted by the model. It is a plug-and-play framework that can be rapidly integrated into various image classification frameworks, boosting model performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bolaños, M., Ferrà, A., Radeva, P.: Food ingredients recognition through multi-label learning. In: Battiato, S., Farinella, G.M., Leo, M., Gallo, G. (eds.) ICIAP 2017. LNCS, vol. 10590, pp. 394–402. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-70742-6_37

    Chapter  Google Scholar 

  2. Chen, J., Ngo, C.W.: Deep-based ingredient recognition for cooking recipe retrieval. In: Proceedings of the 24th ACM Multimedia, pp. 32–41 (2016)

    Google Scholar 

  3. Chen, Z., Qi, Z., Cao, X., Li, X., Meng, X., Meng, L.: Class-level structural relation modeling and smoothing for visual representation learning. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 2964–2972 (2023)

    Google Scholar 

  4. Clark, K., Jaini, P.: Text-to-image diffusion models are zero shot classifiers. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  5. Ding, X., Chen, H., Zhang, X., Han, J., Ding, G.: Repmlpnet: hierarchical vision MLP with re-parameterized locality. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 578–587 (2022)

    Google Scholar 

  6. Ding, X., Zhang, X., Ma, N., Han, J., Ding, G., Sun, J.: Repvgg: making VGG-style convnets great again. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13733–13742 (2021)

    Google Scholar 

  7. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

    Google Scholar 

  8. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  9. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851 (2020)

    Google Scholar 

  10. Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications (2021)

    Google Scholar 

  11. Jiang, D., Ye, M.: Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2787–2797 (2023)

    Google Scholar 

  12. Kang, G., Jiang, L., Yang, Y., Hauptmann, A.G.: Contrastive adaptation network for unsupervised domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4893–4902 (2019)

    Google Scholar 

  13. Lee, C.Y., Batra, T., Baig, M.H., Ulbricht, D.: Sliced Wasserstein discrepancy for unsupervised domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10285–10295 (2019)

    Google Scholar 

  14. Li, A.C., Prabhudesai, M., Duggal, S., Brown, E., Pathak, D.: Your diffusion model is secretly a zero-shot classifier. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2206–2217 (2023)

    Google Scholar 

  15. Li, J., Ma, H., Li, X., Qi, Z., Meng, L., Meng, X.: Unsupervised contrastive masking for visual haze classification. In: Proceedings of the 2022 International Conference on Multimedia Retrieval, pp. 426–434 (2022)

    Google Scholar 

  16. Li, S., Xie, B., Wu, J., Zhao, Y., Liu, C.H., Ding, Z.: Simultaneous semantic alignment network for heterogeneous domain adaptation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 3866–3874 (2020)

    Google Scholar 

  17. Li, X., Meng, L., Wu, L., Li, M., Meng, X.: Dreamfont3d: personalized text-to-3D artistic font generation. In: ACM SIGGRAPH, pp. 1–11 (2024)

    Google Scholar 

  18. Li, X., Zheng, Y., Ma, H., Qi, Z., Meng, X., Meng, L.: Cross-modal learning using privileged information for long-tailed image classification. CVM (2024)

    Google Scholar 

  19. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of ICCV, pp. 10012–10022 (2021)

    Google Scholar 

  20. Ma, H., et al.: Plug-in diffusion model for sequential recommendation. In: Proceedings of AAAI, pp. 8886–8894 (2024)

    Google Scholar 

  21. Ma, H., Xie, R., Meng, L., Yang, Y., Sun, X., Kang, Z.: Seedrec: sememe-based diffusion for sequential recommendation. In: Proceedings of IJCAI, pp. 1–9 (2024)

    Google Scholar 

  22. Ma, H., Yang, Y., Meng, L., Xie, R., Meng, X.: Multimodal conditioned diffusion model for recommendation, pp. 1733–1740 (2024)

    Google Scholar 

  23. Martinel, N., Foresti, G.L., Micheloni, C.: Wide-slice residual networks for food recognition. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 567–576. IEEE (2018)

    Google Scholar 

  24. Meng, L., et al.: Learning using privileged information for food recognition. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 557–565 (2019)

    Google Scholar 

  25. Niu, Y., Tang, K., Zhang, H., Lu, Z., Hua, X.S., Wen, J.R.: Counterfactual VQA: a cause-effect look at language bias. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12700–12710 (2021)

    Google Scholar 

  26. Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

  27. Peng, X., Wei, Y., Deng, A., Wang, D., Hu, D.: Balanced multimodal learning via on-the-fly gradient modulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8238–8247 (2022)

    Google Scholar 

  28. Qi, Z., Meng, L., Chen, Z., Hu, H., Lin, H., Meng, X.: Cross-silo prototypical calibration for federated learning with non-IID data. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 3099–3107 (2023)

    Google Scholar 

  29. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)

    Google Scholar 

  30. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning, pp. 2256–2265. PMLR (2015)

    Google Scholar 

  31. Wang, C., Wu, L., Liu, X., Li, X., Meng, L., Meng, X.: Anything to glyph: artistic font synthesis via text-to-image diffusion model. In: SIGGRAPH Asia 2023 Conference Papers, pp. 1–11 (2023)

    Google Scholar 

  32. Wang, F., Zhou, Y., Wang, S., Vardhanabhuti, V., Yu, L.: Multi-granularity cross-modal alignment for generalized medical visual representation learning. In: Advances in Neural Information Processing Systems, vol. 35, pp. 33536–33549 (2022)

    Google Scholar 

  33. Wang, Y., Li, X., Liu, Y., Cao, X., Meng, X., Meng, L.: Causal inference for out-of-distribution recognition via sample balancing. CAAI Trans. Intell. Technol. (2024)

    Google Scholar 

  34. Wang, Y., et al.: Meta-causal feature learning for out-of-distribution generalization. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds.) ECCV 2022. LNCS, vol. 13806, pp. 530–545. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-25075-0_36

    Chapter  Google Scholar 

  35. Wang, Y., Qi, Z., Li, X., Liu, J., Meng, X., Meng, L.: Multi-channel attentive weighting of visual frames for multimodal video classification. In: 2023 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2023)

    Google Scholar 

  36. Xie, C.W., Wu, J., Zheng, Y., Pan, P., Hua, X.S.: Token embeddings alignment for cross-modal retrieval. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 4555–4563 (2022)

    Google Scholar 

  37. Zagoruyko, S., Komodakis, N.: Wide residual networks. In: BMVC (2016)

    Google Scholar 

Download references

Acknowledgments

This work is supported in part by the Oversea Innovation Team Project of the “20 Regulations for New Universities” funding program of Jinan (Grant no. 2021GXRC073).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lei Meng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zheng, Y. et al. (2024). Unifying Visual and Semantic Feature Spaces with Diffusion Models for Enhanced Cross-Modal Alignment. In: Wand, M., Malinovská, K., Schmidhuber, J., Tetko, I.V. (eds) Artificial Neural Networks and Machine Learning – ICANN 2024. ICANN 2024. Lecture Notes in Computer Science, vol 15021. Springer, Cham. https://doi.org/10.1007/978-3-031-72347-6_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72347-6_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72346-9

  • Online ISBN: 978-3-031-72347-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics