Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3664647.3681629acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

CREST: Cross-modal Resonance through Evidential Deep Learning for Enhanced Zero-Shot Learning

Published: 28 October 2024 Publication History

Abstract

Zero-shot learning (ZSL) enables the recognition of novel classes by leveraging semantic knowledge transfer from known to unknown categories. This knowledge, typically encapsulated in attribute descriptions, aids in identifying class-specific visual features, thus facilitating visual-semantic alignment and improving ZSL performance. However, real-world challenges such as distribution imbalances and attribute co-occurrence among instances often hinder the discernment of local variances in images, a problem exacerbated by the scarcity of fine-grained, region-specific attribute annotations. Moreover, the variability in visual presentation within categories can also skew attribute-category associations. In response, we propose a bidirectional cross-modal ZSL approach CREST. It begins by extracting representations for attribute and visual localization and employs Evidential Deep Learning (EDL) to measure underlying epistemic uncertainty, thereby enhancing the model's resilience against hard negatives. CREST incorporates dual learning pathways, focusing on both visual-category and attribute-category alignments, to ensure robust correlation between latent and observable spaces. Moreover, we introduce an uncertainty-informed cross-modal fusion technique to refine visual-attribute inference. Extensive experiments demonstrate our model's effectiveness and unique explainability across multiple datasets. Our code and data are available at: https://github.com/JethroJames/CREST

References

[1]
Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. 2015. Label-embedding for image classification. IEEE transactions on pattern analysis and machine intelligence 38, 7 (2015), 1425--1438.
[2]
Sarkhan Badirli, Zeynep Akata, George Mohler, Christine Picard, and Mehmet M Dundar. 2021. Fine-grained zero-shot learning with dna as side information. Advances in Neural Information Processing Systems 34 (2021), 19352--19362.
[3]
Wentao Bao, Qi Yu, and Yu Kong. 2021. Evidential Deep Learning for Open Set Action Recognition. arXiv:2107.10161 [cs.CV]
[4]
Nick Bostrom. 2020. Ethical issues in advanced artificial intelligence. Machine Ethics and Robot Ethics (2020), 69--75.
[5]
Maxime Bucher, Tuan-Hung Vu, Matthieu Cord, and Patrick Pérez. 2019. Zeroshot semantic segmentation. Advances in Neural Information Processing Systems 32 (2019).
[6]
Haodong Chen, Haojian Huang, Junhao Dong, Mingzhe Zheng, and Dian Shao. 2024. FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs. arXiv preprint arXiv:2407.02157 (2024).
[7]
Haodong Chen, Yongle Huang, Haojian Huang, Xiangsheng Ge, and Dian Shao. 2024. GaussianVTON: 3D Human Virtual Try-ON via Multi-Stage Gaussian Splatting Editing with Image Prompting. arXiv preprint arXiv:2405.07472 (2024).
[8]
Jiaoyan Chen, Yuxia Geng, Zhuo Chen, Jeff Z. Pan, Yuan He, Wen Zhang, Ian Horrocks, and Huajun Chen. 2022. Zero-shot and Few-shot Learning with Knowledge Graphs: A Comprehensive Survey. arXiv:2112.10006 [cs.LG]
[9]
Shiming Chen, Ziming Hong, Wenjin Hou, Guo-Sen Xie, Yibing Song, Jian Zhao, Xinge You, Shuicheng Yan, and Ling Shao. 2022. TransZero: Cross Attributeguided Transformer for Zero-Shot Learning. (2022).
[10]
Shiming Chen, Ziming Hong, Yang Liu, Guo-Sen Xie, Baigui Sun, Hao Li, Qinmu Peng, Ke Lu, and Xinge You. 2022. TransZero: Attribute-guided Transformer for Zero-Shot Learning. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI).
[11]
Shiming Chen, Ziming Hong, Guo-Sen Xie, Wenhan Yang, Qinmu Peng, Kai Wang, Jian Zhao, and Xinge You. 2022. MSDN: Mutually Semantic Distillation Network for Zero-Shot Learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ).
[12]
Shiming Chen, Wenjin Hou, Ziming Hong, Xiaohan Ding, Yibing Song, Xinge You, Tongliang Liu, and Kun Zhang. 2023. Evolving semantic prototype improves generative zero-shot learning. In International Conference on Machine Learning. PMLR, 4611--4622.
[13]
Shiming Chen, Wenjie Wang, Beihao Xia, Qinmu Peng, Xinge You, Feng Zheng, and Ling Shao. 2021. Free: Feature refinement for generalized zero-shot learning. In Proceedings of the IEEE/CVF international conference on computer vision. 122--131.
[14]
Shiming Chen, Guosen Xie, Yang Liu, Qinmu Peng, Baigui Sun, Hao Li, Xinge You, and Ling Shao. 2021. Hsva: Hierarchical semantic-visual adaptation for zeroshot learning. Advances in Neural Information Processing Systems 34 (2021), 16622--16634.
[15]
Xin Chen and Li Wang. 2023. Next-Generation Variational Autoencoders for Zero-Shot Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[16]
Zhuo Chen, Jiaoyan Chen, Yuxia Geng, Jeff Z Pan, Zonggang Yuan, and Huajun Chen. 2021. Zero-shot visual question answering using knowledge graph. In The SemanticWeb--ISWC 2021: 20th International SemanticWeb Conference, ISWC 2021, Virtual Event, October 24--28, 2021, Proceedings 20. Springer, 146--162.
[17]
Zhuo Chen, Yufeng Huang, Jiaoyan Chen, Yuxia Geng, Wen Zhang, Yin Fang, Jeff Z. Pan, and Huajun Chen. 2023. DUET: Cross-Modal Semantic Grounding for Contrastive Zero-Shot Learning. In AAAI. AAAI Press, 405--413.
[18]
Zhi Chen, Yadan Luo, Ruihong Qiu, Sen Wang, Zi Huang, Jingjing Li, and Zheng Zhang. 2021. Semantics disentangling for generalized zero-shot learning. In Proceedings of the IEEE/CVF international conference on computer vision. 8712-- 8720.
[19]
Yu-Ying Chou, Hsuan-Tien Lin, and Tyng-Luh Liu. 2021. Adaptive and Generative Zero-Shot Learning. In International Conference on Learning Representations. https://openreview.net/forum?id=ahAUv8TI2Mz
[20]
Emily Davis and Nathan Roberts. 2023. Refining Zero-Shot Learning with Attribute-Guided Attention Mechanisms. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
[21]
Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. 2022. Decoupling zeroshot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11583--11592.
[22]
Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc-Aurelio Ranzato, and Tomas Mikolov. 2013. Devise: A deep visual-semantic embedding model. Advances in neural information processing systems 26 (2013).
[23]
Zhenyong Fu, Tao Xiang, Elyor Kodirov, and Shaogang Gong. 2017. Zeroshot learning on semantic class prototype graph. IEEE transactions on pattern analysis and machine intelligence 40, 8 (2017), 2009--2022.
[24]
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On Calibration of Modern Neural Networks. arXiv:1706.04599 [cs.LG]
[25]
Ankit Gupta and Prashant Sharma. 2022. Diverse Feature Synthesis with GANs for Generalized Zero-Shot Learning. In Artificial Intelligence and Statistics (AISTATS).
[26]
Zongyan Han, Zhenyong Fu, Shuo Chen, and Jian Yang. 2021. Contrastive embedding for generalized zero-shot learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2371--2381.
[27]
Zongbo Han, Changqing Zhang, Huazhu Fu, and Joey Tianyi Zhou. 2021. Trusted Multi-View Classification. arXiv:2102.02051 [cs.LG]
[28]
Zongbo Han, Changqing Zhang, Huazhu Fu, and Joey Tianyi Zhou. 2022. Trusted Multi-View Classification with Dynamic Evidential Fusion. arXiv:2204.11423 [cs.LG]
[29]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[30]
Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. 2022. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning. PMLR, 9118--9147.
[31]
Dat Huynh and Ehsan Elhamifar. 2020. Compositional zero-shot learning via finegrained dense feature composition. Advances in Neural Information Processing Systems 33 (2020), 19849--19860.
[32]
Dat Huynh and Ehsan Elhamifar. 2020. Fine-grained generalized zero-shot learning via dense attribute-based attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4483--4493.
[33]
Ajay Jain, Ben Mildenhall, Jonathan T Barron, Pieter Abbeel, and Ben Poole. 2022. Zero-shot text-guided object generation with dream fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 867--876.
[34]
Audun Jøsang. 2016. Subjective Logic. Vol. 3. Springer.
[35]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems 35 (2022), 22199--22213.
[36]
Vikram Kumar and Manish Jain. 2022. Bi-Directional Attention: Bridging Semantic Gaps in Zero-Shot Learning. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD).
[37]
Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. 2009. Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE conference on computer vision and pattern recognition. IEEE, 951--958.
[38]
Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. 2013. Attributebased classification for zero-shot visual object categorization. IEEE transactions on pattern analysis and machine intelligence 36, 3 (2013), 453--465.
[39]
Hugo Larochelle, Dumitru Erhan, and Yoshua Bengio. 2008. Zero-data learning of new tasks. In AAAI, Vol. 1. 3.
[40]
Hyun Lee and Young Kim. 2022. Enhanced Cross-Modal Embedding Alignment for Robust Zero-Shot Object Recognition. In European Conference on Computer Vision (ECCV).
[41]
Yan Li, Junge Zhang, Jianguo Zhang, and Kaiqi Huang. 2018. Discriminative learning of latent features for zero-shot recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7463--7471.
[42]
Wei Liu, Xiaodong Yue, Yufei Chen, and Thierry Denoeux. 2022. Trusted Multi-View Deep Learning with Opinion Aggregation. Proceedings of the AAAI Conference on Artificial Intelligence 36, 7 (Jun. 2022), 7585--7593. https: //doi.org/10.1609/aaai.v36i7.20724
[43]
Yang Liu, Jishun Guo, Deng Cai, and Xiaofei He. 2019. Attribute attention for semantic disambiguation in zero-shot learning. In Proceedings of the IEEE/CVF international conference on computer vision. 6698--6707.
[44]
Yang Liu, Lei Zhou, Xiao Bai, Yifei Huang, Lin Gu, Jun Zhou, and Tatsuya Harada. 2021. Goal-oriented gaze estimation for zero-shot learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3794--3803.
[45]
Shaobo Min, Hantao Yao, Hongtao Xie, Chaoqun Wang, Zheng-Jun Zha, and Yongdong Zhang. 2020. Domain-aware Visual Bias Eliminating for Generalized Zero-Shot Learning. arXiv:2003.13261 [cs.CV]
[46]
Sanath Narayan, Akshita Gupta, Fahad Shahbaz Khan, Cees GM Snoek, and Ling Shao. 2020. Latent embedding feedback and discriminative features for zeroshot classification. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XXII 16. Springer, 479--495.
[47]
Connor O'Reilly and Fang Liu. 2021. Deep Attention-Based Frameworks: The Future of Zero-Shot Learning. In Proceedings of the International Conference on Learning Representations (ICLR).
[48]
Mark Palatucci, Dean Pomerleau, Geoffrey E Hinton, and Tom M Mitchell. 2009. Zero-shot learning with semantic output codes. Advances in neural information processing systems 22 (2009).
[49]
Rahul Patel and Surya Singh. 2021. Semantic Augmentation in Visual-Semantic Embeddings for Comprehensive Zero-Shot Learning. Journal of Artificial Intelligence Research (JAIR) (2021).
[50]
Genevieve Patterson and James Hays. 2012. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2751--2758.
[51]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.
[52]
Yang Qin, Dezhong Peng, Xi Peng, XuWang, and Peng Hu. 2022. Deep evidential learning with noisy correspondence for cross-modal retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 4948--4956.
[53]
Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. 2016. Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 49--58.
[54]
Edgar Schonfeld, Sayna Ebrahimi, Samarth Sinha, Trevor Darrell, and Zeynep Akata. 2019. Generalized zero-and few-shot learning via aligned variational autoencoders. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8247--8255.
[55]
Murat Sensoy, Lance Kaplan, and Melih Kandemir. 2018. Evidential Deep Learning to Quantify Classification Uncertainty. arXiv:1806.01768 [cs.LG]
[56]
Zhimin Shao, Weibei Dou, and Yu Pan. 2024. Dual-level Deep Evidential Fusion: Integrating multimodal information for enhanced reliable decision-making in deep learning. Information Fusion 103 (2024), 102113.
[57]
Yuming Shen, Jie Qin, Lei Huang, Li Liu, Fan Zhu, and Ling Shao. 2020. Invertible zero-shot recognition flows. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XVI 16. Springer, 614--631.
[58]
John Smith and Alice Doe. 2023. Advances in Regularization Techniques for Embedding-Based Zero-Shot Learning. In Proceedings of the International Conference on Machine Learning (ICML).
[59]
Jie Song, Chengchao Shen, Yezhou Yang, Yang Liu, and Mingli Song. 2018. Transductive unbiased embedding for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1024--1033.
[60]
Jake Varley, Sumeet Singh, Deepali Jain, Krzysztof Choromanski, Andy Zeng, Somnath Basu Roy Chowdhury, Avinava Dubey, and Vikas Sindhwani. 2024. Embodied AI with Two Arms: Zero-shot Learning, Safety and Modularity. arXiv:2404.03570 [cs.RO]
[61]
Vinay Kumar Verma, Gundeep Arora, Ashish Mishra, and Piyush Rai. 2018. Generalized zero-shot learning via synthesized examples. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4281--4289.
[62]
Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652 (2021).
[63]
Peter Welinder, Steve Branson, Takeshi Mita, Catherine Wah, Florian Schroff, Serge Belongie, and Pietro Perona. 2010. Caltech-UCSD birds 200. (2010).
[64]
Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. 2018. Zero-shot learning?a comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence 41, 9 (2018), 2251--2265.
[65]
Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. 2018. Feature generating networks for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5542--5551.
[66]
Yongqin Xian, Bernt Schiele, and Zeynep Akata. 2017. Zero-shot learning-the good, the bad and the ugly. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4582--4591.
[67]
Yongqin Xian, Saurabh Sharma, Bernt Schiele, and Zeynep Akata. 2019. f-vaegand2: A feature generating framework for any-shot learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10275--10284.
[68]
Guo-Sen Xie, Li Liu, Fan Zhu, Fang Zhao, Zheng Zhang, Yazhou Yao, Jie Qin, and Ling Shao. 2020. Region graph embedding network for zero-shot learning. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part IV 16. Springer, 562--580.
[69]
Cai Xu, Jiajun Si, Ziyu Guan, Wei Zhao, Yue Wu, and Xiyue Gao. 2024. Reliable Conflictive Multi-View Learning. arXiv:2402.16897 [cs.LG]
[70]
Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. 2021. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084 (2021).
[71]
Jiale Xu, Xintao Wang, Weihao Cheng, Yan-Pei Cao, Ying Shan, Xiaohu Qie, and Shenghua Gao. 2023. Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20908--20918.
[72]
Wenjia Xu, Yongqin Xian, Jiuniu Wang, Bernt Schiele, and Zeynep Akata. 2020. Attribute prototype network for zero-shot learning. Advances in Neural Information Processing Systems 33 (2020), 21969--21980.
[73]
Wenjia Xu, Yongqin Xian, Jiuniu Wang, Bernt Schiele, and Zeynep Akata. 2020. Attribute Prototype Network for Zero-Shot Learning. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 21969--21980. https://proceedings.neurips.cc/paper_files/paper/2020/file/ fa2431bf9d65058fe34e9713e32d60e6-Paper.pdf
[74]
Wenjia Xu, Yongqin Xian, Jiuniu Wang, Bernt Schiele, and Zeynep Akata. 2022. Attribute prototype network for any-shot learning. International Journal of Computer Vision 130, 7 (2022), 1735--1753.
[75]
Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. 2022. Zero-shot video question answering via frozen bidirectional language models. Advances in Neural Information Processing Systems 35 (2022), 124--141.
[76]
Yang Yu, Danruo Deng, Furui Liu, Yueming Jin, Qi Dou, Guangyong Chen, and Pheng-Ann Heng. 2023. Adaptive Negative Evidential Deep Learning for Openset Semi-supervised Learning. arXiv:2303.12091 [cs.LG]
[77]
Zhongqi Yue, Tan Wang, Qianru Sun, Xian-Sheng Hua, and Hanwang Zhang. 2021. Counterfactual zero-shot and open-set visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15404-- 15414.
[78]
Yue Zhang and Zheng Lu. 2021. Generative FlowModels: A NewFrontier for Zero-Shot Learning Feature Synthesis. In Advances in Neural Information Processing Systems (NeurIPS).
[79]
Yizhe Zhu, Jianwen Xie, Zhiqiang Tang, Xi Peng, and Ahmed Elgammal. 2019. Semantic-guided multi-attention localization for zero-shot learning. Advances in Neural Information Processing Systems 32 (2019).

Cited By

View all
  • (2025)UA-FER: Uncertainty-aware representation learning for facial expression recognitionNeurocomputing10.1016/j.neucom.2024.129261621(129261)Online publication date: Mar-2025

Index Terms

  1. CREST: Cross-modal Resonance through Evidential Deep Learning for Enhanced Zero-Shot Learning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. contrastive learning
    2. evidential deep learning
    3. multimodality
    4. zero-shot learning

    Qualifiers

    • Research-article

    Funding Sources

    • National Key Research and Devel- opment Program of China

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)123
    • Downloads (Last 6 weeks)39
    Reflects downloads up to 25 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)UA-FER: Uncertainty-aware representation learning for facial expression recognitionNeurocomputing10.1016/j.neucom.2024.129261621(129261)Online publication date: Mar-2025

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media