Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3581783.3612652acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

DPNET: Dynamic Poly-attention Network for Trustworthy Multi-modal Classification

Published: 27 October 2023 Publication History

Abstract

With advances in sensing technology, multi-modal data collected from different sources are increasingly available. Multi-modal classification aims to integrate complementary information from multi-modal data to improve model classification performance. However, existing multi-modal classification methods are basically weak in integrating global structural information and providing trustworthy multi-modal fusion, especially in safety-sensitive practical applications (e.g., medical diagnosis). In this paper, we propose a novel Dynamic Poly-attention Network (DPNET) for trustworthy multi-modal classification. Specifically, DPNET has four merits: (i) To capture the intrinsic modality-specific structural information, we design a structure-aware feature aggregation module to learn the corresponding structure-preserved global compact feature representation. (ii) A transparent fusion strategy based on the modality confidence estimation strategy is induced to track information variation within different modalities for dynamical fusion. (iii) To facilitate more effective and efficient multi-modal fusion, we introduce a cross-modal low-rank fusion module to reduce the complexity of tensor-based fusion and activate the implication of different rank-wise features via a rank attention mechanism. (iv) A label confidence estimation module is devised to drive the network to generate more credible confidence. An intra-class attention loss is introduced to supervise the network training. Extensive experiments on four real-world multi-modal biomedical datasets demonstrate that the proposed method achieves competitive performance compared to other state-of-the-art ones.

References

[1]
Moloud Abdar, Farhad Pourpanah, Sadiq Hussain, Dana Rezazadegan, Li Liu, Mohammad Ghavamzadeh, Paul Fieguth, Xiaochun Cao, Abbas Khosravi, U Rajendra Acharya, et al. 2021. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information Fusion, Vol. 76 (2021), 243--297.
[2]
Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In International conference on machine learning. PMLR, 1247--1255.
[3]
Javier Antorán, James Allingham, and José Miguel Hernández-Lobato. 2020. Depth uncertainty in neural networks. Advances in neural information processing systems, Vol. 33 (2020), 10620--10634.
[4]
John Arevalo, Thamar Solorio, Manuel Montes-y Gómez, and Fabio A González. 2017. Gated multimodal units for information fusion. arXiv preprint arXiv:1702.01992 (2017).
[5]
Tadas Baltruvs aitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, Vol. 41, 2 (2018), 423--443.
[6]
Kevin M Boehm, Pegah Khosravi, Rami Vanguri, Jianjiong Gao, and Sohrab P Shah. 2022. Harnessing multimodal data integration to advance precision oncology. Nature Reviews Cancer, Vol. 22, 2 (2022), 114--126.
[7]
Thomas Cover and Peter Hart. 1967. Nearest neighbor pattern classification. IEEE transactions on information theory, Vol. 13, 1 (1967), 21--27.
[8]
Yilun Du, Zhijian Liu, Hector Basevi, Ales Leonardis, Bill Freeman, Josh Tenenbaum, and Jiajun Wu. 2018. Learning to exploit stability for 3d scene parsing. Advances in Neural Information Processing Systems, Vol. 31 (2018).
[9]
Maksim Dzabraev, Maksim Kalashnikov, Stepan Komkov, and Aleksandr Petiushko. 2021. Mdmmt: Multidomain multimodal transformer for video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3354--3363.
[10]
Lillian J Eichner, Marie-Claude Perry, Catherine R Dufour, Nicholas Bertos, Morag Park, Julie St-Pierre, and Vincent Giguère. 2010. miR-378? mediates metabolic shift in breast cancer cells via the PGC-1β/ERRγ transcriptional pathway. Cell metabolism, Vol. 12, 4 (2010), 352--361.
[11]
Anthony Fleury, Michel Vacher, and Norbert Noury. 2009. SVM-based multimodal classification of activities of daily living in health smart homes: sensors, algorithms, and first experimental results. IEEE Transactions on Information Technology in Biomedicine, Vol. 14, 2 (2009), 274--283.
[12]
Yu Geng, Zongbo Han, Changqing Zhang, and Qinghua Hu. 2021. Uncertainty-aware multi-view representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 7545--7553.
[13]
Hristijan Gjoreski, Mathias Ciliberto, Lin Wang, Francisco Javier Ordonez Morales, Sami Mekki, Stefan Valentin, and Daniel Roggen. 2018. The university of sussex-huawei locomotion and transportation dataset for multimodal analytics with mobile devices. IEEE Access, Vol. 6 (2018), 42592--42604.
[14]
Ehsan Hajiramezanali, Arman Hasanzadeh, Nick Duffield, Krishna Narayanan, and Xiaoning Qian. 2020. BayReL: Bayesian relational learning for multi-omics data integration. Advances in Neural Information Processing Systems, Vol. 33 (2020), 19251--19263.
[15]
Zongbo Han, Fan Yang, Junzhou Huang, Changqing Zhang, and Jianhua Yao. 2022a. Multimodal dynamics: Dynamical fusion for trustworthy multimodal classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20707--20717.
[16]
Zongbo Han, Changqing Zhang, Huazhu Fu, and Joey Tianyi Zhou. 2022b. Trusted Multi-View Classification with Dynamic Evidential Fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
[17]
Danfeng Hong, Lianru Gao, Naoto Yokoya, Jing Yao, Jocelyn Chanussot, Qian Du, and Bing Zhang. 2020. More diverse means better: Multimodal deep learning meets remote-sensing imagery classification. IEEE Transactions on Geoscience and Remote Sensing, Vol. 59, 5 (2020), 4340--4354.
[18]
Di Hu, Chengze Wang, Feiping Nie, and Xuelong Li. 2019. Dense multimodal fusion for hierarchically joint representation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3941--3945.
[19]
Yu Huang, Chenzhuang Du, Zihui Xue, Xuanyao Chen, Hang Zhao, and Longbo Huang. 2021. What makes multi-modal learning better than single (provably). Advances in Neural Information Processing Systems, Vol. 34 (2021), 10944--10956.
[20]
Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Ethan Perez, and Davide Testuggine. 2019. Supervised multimodal bitransformers for classifying images and text. arXiv preprint arXiv:1909.02950 (2019).
[21]
Zhenglai Li, Chang Tang, Xinwang Liu, Xiao Zheng, Wei Zhang, and En Zhu. 2021. Consensus graph learning for multi-view clustering. IEEE Transactions on Multimedia, Vol. 24 (2021), 2461--2472.
[22]
Jingwei Liao, Yanli Liu, Guanyu Xing, Housheng Wei, Jueyu Chen, and Songhua Xu. 2021. Shadow Detection via Predicting the Confidence Maps of Shadow Detection Methods. In Proceedings of the 29th ACM International Conference on Multimedia. 704--712.
[23]
Wenzhi Liao, Rik Bellens, Aleksandra Pivz urica, Sidharta Gautama, and Wilfried Philips. 2014. Combining feature fusion and decision fusion for classification of hyperspectral and LiDAR data. In 2014 IEEE Geoscience and Remote Sensing Symposium. IEEE, 1241--1244.
[24]
Wei Liu, Xiaodong Yue, Yufei Chen, and Thierry Denoeux. 2022. Trusted multi-view deep learning with opinion aggregation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 7585--7593.
[25]
Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, AmirAli Bagher Zadeh, and Louis-Philippe Morency. 2018. Efficient Low-rank Multimodal Fusion With Modality-Specific Factors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2247--2256.
[26]
Haojie Ma, Wenzhong Li, Xiao Zhang, Songcheng Gao, and Sanglu Lu. 2019. AttnSense: Multi-level attention mechanism for multimodal human activity recognition. In IJCAI. 3109--3115.
[27]
Gordana Maric, Matthew G Annis, Patricia A MacDonald, Caterina Russo, Dru Perkins, Doris R Siwak, Gordon B Mills, and Peter M Siegel. 2019. GPNMB augments Wnt-1 mediated breast tumor initiation and growth by enhancing PI3K/AKT/mTOR pathway signaling and β-catenin activity. Oncogene, Vol. 38, 26 (2019), 5294--5307.
[28]
Satyam Mohla, Shivam Pande, Biplab Banerjee, and Subhasis Chaudhuri. 2020. Fusatnet: Dual attention based spectrospatial multimodal fusion network for hyperspectral and lidar classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 92--93.
[29]
Jooyoung Moon, Jihyo Kim, Younghak Shin, and Sangheum Hwang. 2020. Confidence-aware learning for deep neural networks. In international conference on machine learning. PMLR, 7034--7044.
[30]
Reuben Rideaux, Katherine R Storrs, Guido Maiello, and Andrew E Welchman. 2021. How multisensory neurons solve causal inference. Proceedings of the National Academy of Sciences, Vol. 118, 32 (2021), e2106235118.
[31]
Alina Roitberg, Tim Pollert, Monica Haurilet, Manuel Martin, and Rainer Stiefelhagen. 2019. Analysis of deep fusion strategies for multi-modal gesture recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 0-0.
[32]
Jonathan H Shepherd, Ivan P Uray, Abhijit Mazumdar, Anna Tsimelzon, Michelle Savage, Susan G Hilsenbeck, and Powel H Brown. 2016. The SOX11 transcription factor is a critical regulator of basal-like breast cancer growth, invasion, and basal-like gene expression. Oncotarget, Vol. 7, 11 (2016), 13106.
[33]
Amrit Singh, Casey P Shannon, Beno^it Gautier, Florian Rohart, Michaël Vacher, Scott J Tebbutt, and Kim-Anh Lê Cao. 2019. DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics, Vol. 35, 17 (2019), 3055--3062.
[34]
Lin Sun, Jiquan Wang, Kai Zhang, Yindu Su, and Fangsheng Weng. 2021. RpBERT: a text-image relation propagation-based BERT model for multimodal NER. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 13860--13868.
[35]
Chuanqi Tan, Fuchun Sun, Wenchang Zhang, Jianhua Chen, and Chunfang Liu. 2017. Multimodal classification with deep convolutional-recurrent neural networks for electroencephalography. In Neural Information Processing: 24th International Conference, ICONIP 2017, Guangzhou, China, November 14-18, 2017, Proceedings, Part II 24. Springer, 767--776.
[36]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
[37]
Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, Yoshua Bengio, et al. 2017. Graph attention networks. stat, Vol. 1050, 20 (2017), 10-48550.
[38]
Dongzhe Wang, Kezhi Mao, and Gee-Wah Ng. 2017. Convolutional neural networks and multimodal fusion for text aided image classification. In 2017 20th International Conference on Information Fusion (Fusion). IEEE, 1--7.
[39]
Lichen Wang, Jiaxiang Wu, Shao-Lun Huang, Lizhong Zheng, Xiangxiang Xu, Lin Zhang, and Junzhou Huang. 2019b. An efficient approach to informative feature extraction from multimodal data. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5281--5288.
[40]
Qilong Wang, Banggu Wu, Pengfei Zhu, Peihua Li, Wangmeng Zuo, and Qinghua Hu. 2020a. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11534--11542.
[41]
Siwei Wang, Xinwang Liu, En Zhu, Chang Tang, Jiyuan Liu, Jingtao Hu, Jingyuan Xia, and Jianping Yin. 2019a. Multi-view Clustering via Late Fusion Alignment Maximization. In IJCAI. 3778--3784.
[42]
Tongxin Wang, Wei Shao, Zhi Huang, Haixu Tang, Jie Zhang, Zhengming Ding, and Kun Huang. 2021. MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification. Nature communications, Vol. 12, 1 (2021), 3445.
[43]
Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. 2015. On deep multi-view representation learning. In International conference on machine learning. PMLR, 1083--1092.
[44]
Yikai Wang, Chengming Xu, Chen Liu, Li Zhang, and Yanwei Fu. 2020b. Instance credibility inference for few-shot learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12836--12845.
[45]
Max Welling and Yee W Teh. 2011. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11). 681--688.
[46]
Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV). 3--19.
[47]
Xijiong Xie and Shiliang Sun. 2019. Multi-view support vector machines with the consensus and complementarity information. IEEE Transactions on Knowledge and Data Engineering, Vol. 32, 12 (2019), 2401--2413.
[48]
Zhen Xu, David R So, and Andrew M Dai. 2021. Mufasa: Multimodal fusion architecture search for electronic health records. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 10532--10540.
[49]
Santosh Kumar Yadav, Kamlesh Tiwari, Hari Mohan Pandey, and Shaik Ali Akbar. 2021. A review of multimodal human activity recognition with special emphasis on classification, applications, challenges and future directions. Knowledge-Based Systems, Vol. 223 (2021), 106970.
[50]
Jordan Yap, William Yolland, and Philipp Tschandl. 2018. Multimodal skin lesion classification using deep learning. Experimental dermatology, Vol. 27, 11 (2018), 1261--1267.
[51]
Chaohe Zhang, Xu Chu, Liantao Ma, Yinghao Zhu, Yasha Wang, Jiangtao Wang, and Junfeng Zhao. 2022. M3Care: Learning with Missing Modalities in Multimodal Healthcare Data. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2418--2428.
[52]
Heng Zhang, Vishal M Patel, and Rama Chellappa. 2017. Hierarchical multimodal metric learning for multimodal classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3057--3065.
[53]
Handong Zhao, Hongfu Liu, and Yun Fu. 2016. Incomplete multi-modal visual data grouping. In IJCAI. 2392--2398.

Cited By

View all
  • (2024)Building Trust in Decision with Conformalized Multi-view Deep ClassificationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681297(7278-7287)Online publication date: 28-Oct-2024
  • (2024)Heterogeneous Graph Guided Contrastive Learning for Spatially Resolved Transcriptomics DataProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680941(8287-8295)Online publication date: 28-Oct-2024
  • (2024)DAI-Net: Dual Adaptive Interaction Network for Coordinated Medication RecommendationIEEE Journal of Biomedical and Health Informatics10.1109/JBHI.2024.342583328:10(6201-6211)Online publication date: Oct-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. attention
  2. confidence estimation
  3. cross-modal low-rank fusion
  4. dynamical fusion
  5. trustworthy multi-modal classification

Qualifiers

  • Research-article

Funding Sources

  • the National Science Foundation of China under Grant
  • the National Key R\&D Program of China under Grant

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)210
  • Downloads (Last 6 weeks)25
Reflects downloads up to 14 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Building Trust in Decision with Conformalized Multi-view Deep ClassificationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681297(7278-7287)Online publication date: 28-Oct-2024
  • (2024)Heterogeneous Graph Guided Contrastive Learning for Spatially Resolved Transcriptomics DataProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680941(8287-8295)Online publication date: 28-Oct-2024
  • (2024)DAI-Net: Dual Adaptive Interaction Network for Coordinated Medication RecommendationIEEE Journal of Biomedical and Health Informatics10.1109/JBHI.2024.342583328:10(6201-6211)Online publication date: Oct-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media