research-article

Graph Attention Transformer Network for Multi-label Image Classification

Authors:

Yong RuiAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 19, Issue 4

Article No.: 150, Pages 1 - 16

https://doi.org/10.1145/3578518

Published: 27 February 2023 Publication History

Abstract

Multi-label classification aims to recognize multiple objects or attributes from images. The key to solving this issue relies on effectively characterizing the inter-label correlations or dependencies, which bring the prevailing graph neural network. However, current methods often use the co-occurrence probability of labels based on the training set as the adjacency matrix to model this correlation, which is greatly limited by the dataset and affects the model’s generalization ability. This article proposes a Graph Attention Transformer Network, a general framework for multi-label image classification by mining rich and effective label correlation. First, we use the cosine similarity value of the pre-trained label word embedding as the initial correlation matrix, which can represent richer semantic information than the co-occurrence one. Subsequently, we propose the graph attention transformer layer to transfer this adjacency matrix to adapt to the current domain. Our extensive experiments have demonstrated that our proposed methods can achieve highly competitive performance on three datasets.

References

[1]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations (ICLR’15).

[2]

Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2014. Spectral networks and locally connected networks on graphs. In Proceedings of the International Conference on Learning Representations (ICLR’14).

[3]

Mark Chen, Alec Radford, Rewon Child, Jeffrey K. Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020. Generative pretraining from pixels. In Proceedings of the 37th International Conference on Machine Learning (ICML’20), Vol. 1. 1691–1703.

[4]

Shikai Chen, Jianfeng Wang, Yuedong Chen, Zhongchao Shi, Xin Geng, and Yong Rui. 2020. Label distribution learning on auxiliary label space graphs for facial expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 13984–13993.

[5]

Shang-Fu Chen, Yi-Chen Chen, Chih-Kuan Yeh, and Yu-Chiang Frank Wang. 2017. Order-free RNN with visual attention for multi-label classification. In Proceedings of the AAAI Annual Conference on Artificial Intelligence (AAAI’17). 6714–6721.

[6]

Tianshui Chen, Liang Lin, Xiaolu Hui, Riquan Chen, and Hefeng Wu. 2020. Knowledge-guided multi-label few-shot learning for general image recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 3 (2020), 1371–1384.

[7]

Tianshui Chen, Zhouxia Wang, Guanbin Li, and Liang Lin. 2017. Recurrent attentional reinforcement learning for multi-label image recognition. In Proceedings of the AAAI Annual Conference on Artificial Intelligence (AAAI’17). 6730–6737.

[8]

Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, and Yanwen Guo. 2019. Multi-label recognition with graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 5177–5186.

[9]

Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: A real-world web image database from national university of singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval. 48.

Digital Library

[10]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina N. Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186.

[11]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR’21).

[12]

Mark Everingham, Luc Gool, Christopher K. Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 2 (2010), 303–338.

Digital Library

[13]

Andrea Galassi, Marco Lippi, and Paolo Torroni. 2020. Attention in natural language processing. IEEE Trans. Neural Netw. Learn. Syst. 32, 10 (2020), 4291–4308.

[14]

Bin-Bin Gao, Chao Xing, Chen-Wei Xie, Jianxin Wu, and Xin Geng. 2017. Deep label distribution learning with label ambiguity. IEEE Trans. Image Process. 26, 6 (2017), 2825–2838.

Digital Library

[15]

Zongyuan Ge, Dwarikanath Mahapatra, Suman Sedai, Rahil Garnavi, and Rajib Chakravorty. 2018. Chest x-rays classification: A multi-label and fine-grained problem. arXiv:1807.07247. Retrieved from https://arxiv.org/abs/1807.07247.

[16]

Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. 2016. 855–864.

Digital Library

[17]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770–778.

[18]

Qinghua Huang, Bisheng Chen, Jingdong Wang, and Tao Mei. 2014. Personalized video recommendation through graph propagation. ACM Trans. Multimedia Comput. Commun. Appl. 10, 4 (2014), 1–17.

Digital Library

[19]

Thomas N. Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations (ICLR’16).

[20]

Jack Lanchantin, Tianlu Wang, Vicente Ordonez, and Yanjun Qi. 2021. General multi-label image classification with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16478–16488.

[21]

Liang Li, Shuhui Wang, Shuqiang Jiang, and Qingming Huang. 2018. Attentive recurrent neural network for weak-supervised multi-label image classification. In Proceedings of the 26th ACM International Conference on Multimedia. 1092–1100.

Digital Library

[22]

Qing Li, Xiaojiang Peng, Yu Qiao, and Qiang Peng. 2019. Learning category correlations for multi-label image recognition with graph networks. arXiv:1909.13005. Retrieved from https://arxiv.org/abs/1909.13005.

[23]

Yining Li, Chen Huang, Chen Change Loy, and Xiaoou Tang. 2016. Human attribute recognition by deep hierarchical contexts. In Proceedings of the European Conference on Computer Vision. 684–700.

[24]

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740–755.

[25]

Feng Liu, Tao Xiang, Timothy M. Hospedales, Wankou Yang, and Changyin Sun. 2017. Semantic regularisation for recurrent image annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 4160–4168.

[26]

Shilong Liu, Lei Zhang, Xiao Yang, Hang Su, and Jun Zhu. 2021. Query2label: A simple transformer way to multi-label classification. arXiv:2107.10834. Retrieved from https://arxiv.org/abs/2107.10834.

[27]

Weiwei Liu and Ivor W. Tsang. 2015. On the optimality of classifier chain for multi-label classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15), Vol. 28. 712–720.

[28]

Yang Liu, Yao Zhang, Yixin Wang, Feng Hou, Jin Yuan, Jiang Tian, Yang Zhang, Zhongchao Shi, Jianping Fan, and Zhiqiang He. 2021. A survey of visual transformers. arXiv:2111.06091. Retrieved from https://arxiv.org/abs/2111.06091.

[29]

Xin Man, Deqiang Ouyang, Xiangpeng Li, Jingkuan Song, and Jie Shao. 2022. Scenario-aware recurrent transformer for goal-directed video captioning. ACM Trans. Multimedia Comput. Commun. Appl. 18, 4 (2022), 1–17.

Digital Library

[30]

Nicola Messina, Giuseppe Amato, Andrea Esuli, Fabrizio Falchi, Claudio Gennaro, and Stéphane Marchand-Maillet. 2021. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans. Multimedia Comput. Commun. Appl. 17, 4 (2021), 1–23.

Digital Library

[31]

Jinseok Nam, Young-Bum Kim, Eneldo Loza Mencia, Sunghyun Park, Ruhi Sarikaya, and Johannes Fürnkranz. 2019. Learning context-dependent label permutations for multi-label classification. In Proceedings of the International Conference on Machine Learning. 4733–4742.

[32]

Nipun D. Nath, Theodora Chaspari, and Amir H. Behzadan. 2019. Single- and multi-label classification of construction objects using deep transfer learning methods. J. Inf. Technol. Construct. 24, 28 (2019), 511–526.

[33]

Hoang D. Nguyen, Xuan-Son Vu, and Duc-Trong Le. 2021. Modular graph transformer networks for multi-label image classification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’21). AAAI.

[34]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532–1543.

[35]

Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 701–710.

Digital Library

[36]

José Ramón Quevedo, Oscar Luaces, and Antonio Bahamonde. 2012. Multilabel classifiers with a probabilistic thresholding strategy. Pattern Recogn. 45, 2 (2012), 876–883.

Digital Library

[37]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (unpublished).

[38]

Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. 2011. Classifier chains for multi-label classification. Mach. Learn. 85, 3 (2011), 333–359.

Digital Library

[39]

Tal Ridnik, Gilad Sharir, Avi Ben-Cohen, Emanuel Ben-Baruch, and Asaf Noy. 2023. ML-decoder: Scalable and versatile classification head. In Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 32–41.

[40]

Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR’15).

[41]

Yaguang Song, Xiaoshan Yang, and Changsheng Xu. 2022. Self-supervised calorie-aware heterogeneous graph networks for food recommendation. ACM Trans. Multimedia Comput. Commun. Appl. (2022).

[42]

Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. LINE: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web. 1067–1077.

Digital Library

[43]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Vol. 30. 5998–6008.

[44]

Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In International Conference on Learning Representations.

[45]

Xuan-Son Vu, Duc-Trong Le, Christoffer Edlund, Lili Jiang, and Hoang D. Nguyen. 2020. Privacy-preserving visual content tagging using graph transformer networks. In Proceedings of the 28th ACM International Conference on Multimedia. 2299–2307.

Digital Library

[46]

Haidong Wang, Xuan He, Zhiyong Li*, Jin Yuan*, and Shutao Li. 2022. JDAN: Joint detection and association network for real-time online multi-object tracking. ACM Trans. Multimedia Comput. Commun. Appl. (2022).

[47]

Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, and Wei Xu. 2016. CNN-RNN: A unified framework for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 2285–2294.

[48]

Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S. Yu. 2019. Heterogeneous graph attention network. In Proceedings of the World Wide Web Conference. 2022–2032.

Digital Library

[49]

Ya Wang, Dongliang He, Fu Li, Xiang Long, Zhichao Zhou, Jinwen Ma, and Shilei Wen. 2020. Multi-label classification with label graph superimposing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12265–12272.

[50]

Zhouxia Wang, Tianshui Chen, Guanbin Li, Ruijia Xu, and Liang Lin. 2017. Multi-label image recognition by recurrently discovering attentional regions. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 464–472.

[51]

Yunchao Wei, Wei Xia, Min Lin, Junshi Huang, Bingbing Ni, Jian Dong, Yao Zhao, and Shuicheng Yan. 2016. HCP: A flexible CNN framework for multi-label image classification. IEEE Trans. Pattern Anal. Mach. Intell. 38, 9 (2016), 1901–1907.

Digital Library

[52]

Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 5987–5995.

[53]

I. Zeki Yalniz, Hervé Jégou, Kan Chen, Manohar Paluri, and Dhruv Mahajan. 2019. Billion-scale semi-supervised learning for image classification. arXiv:1905.00546. Retrieved from https://arxiv.org/abs/1905.02546.

[54]

Hao Yang, Joey Tianyi Zhou, Yu Zhang, Bin-Bin Gao, Jianxin Wu, and Jianfei Cai. 2016. Exploit bounding box annotations for multi-label object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 280–288.

[55]

Vacit Oguz Yazici, Abel Gonzalez-Garcia, Arnau Ramisa, Bartlomiej Twardowski, and Joost van de Weijer. 2020. Orderless recurrent models for multi-label classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 13440–13449.

[56]

Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, and Hyunwoo J. Kim. 2019. Graph transformer networks. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems (NeurIPS’19), Vol. 32. 11960–11970.

[57]

Yong Zhang, Yingwei Pan, Ting Yao, Rui Huang, Tao Mei, and Chang-Wen Chen. 2022. Boosting scene graph generation with visual relation saliency. ACM Trans. Multimedia Comput. Commun. Appl. (2022).

[58]

Kaili Zhao, Wen-Sheng Chu, and Honggang Zhang. 2016. Deep region and multi-label learning for facial action unit detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 3391–3399.

[59]

Wei Zhou, Zhiwu Xia, Peng Dou, Tao Su, and Haifeng Hu. 2022. Double attention based on graph attention network for image multi-label classification. ACM Trans. Multimedia Comput. Commun. Appl. (2022).

[60]

Feng Zhu, Hongsheng Li, Wanli Ouyang, Nenghai Yu, and Xiaogang Wang. 2017. Learning spatial regularization with image-level supervisions for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2027–2036.

[61]

Ni Zhuang, Yan Yan, Si Chen, Hanzi Wang, and Chunhua Shen. 2018. Multi-label learning based deep transfer neural network for facial attribute classification. Pattern Recogn. 80 (2018), 225–240.

Cited By

Huang JWang DHong XQu XXue W(2024)Cross-modality semantic guidance for multi-label image classificationIntelligent Data Analysis10.3233/IDA-23023928:3(633-646)Online publication date: 28-May-2024
https://dl.acm.org/doi/10.3233/IDA-230239
Poddar SMukherjee RSamad AGanguly NGhosh S(2024)MuLX-QA: Classifying Multi-Labels and Extracting Rationale Spans in Social Media PostsACM Transactions on the Web10.1145/365330318:3(1-26)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.1145/3653303
Li ZWang RZhu FHan JHu SGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)Pyramidal Cross-Modal Transformer with Sustained Visual Guidance for Multi-Label Image ClassificationProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658005(740-748)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658005
Show More Cited By

Index Terms

Graph Attention Transformer Network for Multi-label Image Classification
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object recognition

Recommendations

Double Attention Based on Graph Attention Network for Image Multi-Label Classification
The task of image multi-label classification is to accurately recognize multiple objects in an input image. Most of the recent works need to leverage the label co-occurrence matrix counted from training data to construct the graph structure, which are ...
Dual-channel graph contrastive learning for multi-label classification with label-specific features and label correlations
Abstract
In multi-label classification scenarios, the labels have both interactive correlations and their own respective characteristics. It is a meaningful but challenging task that learning the discriminative features specific to each label while ...
Semantic guide for semi-supervised few-shot multi-label node classification
Abstract
We study a new research problem named semi-supervised few-shot multi-label node classification which has the following characteristics: 1) the extreme imbalance between the number of labeled and unlabeled nodes that are connected on ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 19, Issue 4

July 2023

263 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3582888

Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 February 2023

Online AM: 29 December 2022

Accepted: 18 December 2022

Revised: 31 October 2022

Received: 17 May 2022

Published in TOMM Volume 19, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
1,202
Total Downloads

Downloads (Last 12 months)719
Downloads (Last 6 weeks)57

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Huang JWang DHong XQu XXue W(2024)Cross-modality semantic guidance for multi-label image classificationIntelligent Data Analysis10.3233/IDA-23023928:3(633-646)Online publication date: 28-May-2024
https://dl.acm.org/doi/10.3233/IDA-230239
Poddar SMukherjee RSamad AGanguly NGhosh S(2024)MuLX-QA: Classifying Multi-Labels and Extracting Rationale Spans in Social Media PostsACM Transactions on the Web10.1145/365330318:3(1-26)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.1145/3653303
Li ZWang RZhu FHan JHu SGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)Pyramidal Cross-Modal Transformer with Sustained Visual Guidance for Multi-Label Image ClassificationProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658005(740-748)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658005
Gao PYang XZhang RHuang K(2024)Continuous Image Outpainting with Neural ODEACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364836720:7(1-16)Online publication date: 25-Apr-2024
https://dl.acm.org/doi/10.1145/3648367
Li JMao ZLi HChen WZhang Y(2024)Exploring Visual Relationships via Transformer-based Graphs for Enhanced Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363855820:5(1-23)Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1145/3638558
Fu FFang SChen WMao Z(2024)Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video CommentingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333420:4(1-24)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3633334
Yuan JHou FYang YZhang YShi ZGeng XFan JHe ZRui Y(2024)Domain-Aware Graph Network for Bridging Multi-Source Domain AdaptationIEEE Transactions on Multimedia10.1109/TMM.2024.336172926(7210-7224)Online publication date: 2-Feb-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3361729
Guo JSun HHan JSong BChi YSong B(2024)Multitask Fine-Grained Feature Mining for Multilabel Remote Sensing Image ClassificationIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.342647362(1-17)Online publication date: 2024
https://doi.org/10.1109/TGRS.2024.3426473
Sun YDuan YMa HLi YWang J(2024)High-frequency and low-frequency dual-channel graph attention networkPattern Recognition10.1016/j.patcog.2024.110795156(110795)Online publication date: Dec-2024
https://doi.org/10.1016/j.patcog.2024.110795
Mbiaya FVrain CRos FDao TLucas Y(2024)Knowledge graph-based image classificationData & Knowledge Engineering10.1016/j.datak.2024.102285151:COnline publication date: 1-May-2024
https://dl.acm.org/doi/10.1016/j.datak.2024.102285
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents