Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Graph Attention Transformer Network for Multi-label Image Classification

Published: 27 February 2023 Publication History

Abstract

Multi-label classification aims to recognize multiple objects or attributes from images. The key to solving this issue relies on effectively characterizing the inter-label correlations or dependencies, which bring the prevailing graph neural network. However, current methods often use the co-occurrence probability of labels based on the training set as the adjacency matrix to model this correlation, which is greatly limited by the dataset and affects the model’s generalization ability. This article proposes a Graph Attention Transformer Network, a general framework for multi-label image classification by mining rich and effective label correlation. First, we use the cosine similarity value of the pre-trained label word embedding as the initial correlation matrix, which can represent richer semantic information than the co-occurrence one. Subsequently, we propose the graph attention transformer layer to transfer this adjacency matrix to adapt to the current domain. Our extensive experiments have demonstrated that our proposed methods can achieve highly competitive performance on three datasets.

References

[1]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations (ICLR’15).
[2]
Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2014. Spectral networks and locally connected networks on graphs. In Proceedings of the International Conference on Learning Representations (ICLR’14).
[3]
Mark Chen, Alec Radford, Rewon Child, Jeffrey K. Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020. Generative pretraining from pixels. In Proceedings of the 37th International Conference on Machine Learning (ICML’20), Vol. 1. 1691–1703.
[4]
Shikai Chen, Jianfeng Wang, Yuedong Chen, Zhongchao Shi, Xin Geng, and Yong Rui. 2020. Label distribution learning on auxiliary label space graphs for facial expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 13984–13993.
[5]
Shang-Fu Chen, Yi-Chen Chen, Chih-Kuan Yeh, and Yu-Chiang Frank Wang. 2017. Order-free RNN with visual attention for multi-label classification. In Proceedings of the AAAI Annual Conference on Artificial Intelligence (AAAI’17). 6714–6721.
[6]
Tianshui Chen, Liang Lin, Xiaolu Hui, Riquan Chen, and Hefeng Wu. 2020. Knowledge-guided multi-label few-shot learning for general image recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 3 (2020), 1371–1384.
[7]
Tianshui Chen, Zhouxia Wang, Guanbin Li, and Liang Lin. 2017. Recurrent attentional reinforcement learning for multi-label image recognition. In Proceedings of the AAAI Annual Conference on Artificial Intelligence (AAAI’17). 6730–6737.
[8]
Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, and Yanwen Guo. 2019. Multi-label recognition with graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 5177–5186.
[9]
Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: A real-world web image database from national university of singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval. 48.
[10]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina N. Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186.
[11]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR’21).
[12]
Mark Everingham, Luc Gool, Christopher K. Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 2 (2010), 303–338.
[13]
Andrea Galassi, Marco Lippi, and Paolo Torroni. 2020. Attention in natural language processing. IEEE Trans. Neural Netw. Learn. Syst. 32, 10 (2020), 4291–4308.
[14]
Bin-Bin Gao, Chao Xing, Chen-Wei Xie, Jianxin Wu, and Xin Geng. 2017. Deep label distribution learning with label ambiguity. IEEE Trans. Image Process. 26, 6 (2017), 2825–2838.
[15]
Zongyuan Ge, Dwarikanath Mahapatra, Suman Sedai, Rahil Garnavi, and Rajib Chakravorty. 2018. Chest x-rays classification: A multi-label and fine-grained problem. arXiv:1807.07247. Retrieved from https://arxiv.org/abs/1807.07247.
[16]
Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. 2016. 855–864.
[17]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770–778.
[18]
Qinghua Huang, Bisheng Chen, Jingdong Wang, and Tao Mei. 2014. Personalized video recommendation through graph propagation. ACM Trans. Multimedia Comput. Commun. Appl. 10, 4 (2014), 1–17.
[19]
Thomas N. Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations (ICLR’16).
[20]
Jack Lanchantin, Tianlu Wang, Vicente Ordonez, and Yanjun Qi. 2021. General multi-label image classification with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16478–16488.
[21]
Liang Li, Shuhui Wang, Shuqiang Jiang, and Qingming Huang. 2018. Attentive recurrent neural network for weak-supervised multi-label image classification. In Proceedings of the 26th ACM International Conference on Multimedia. 1092–1100.
[22]
Qing Li, Xiaojiang Peng, Yu Qiao, and Qiang Peng. 2019. Learning category correlations for multi-label image recognition with graph networks. arXiv:1909.13005. Retrieved from https://arxiv.org/abs/1909.13005.
[23]
Yining Li, Chen Huang, Chen Change Loy, and Xiaoou Tang. 2016. Human attribute recognition by deep hierarchical contexts. In Proceedings of the European Conference on Computer Vision. 684–700.
[24]
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740–755.
[25]
Feng Liu, Tao Xiang, Timothy M. Hospedales, Wankou Yang, and Changyin Sun. 2017. Semantic regularisation for recurrent image annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 4160–4168.
[26]
Shilong Liu, Lei Zhang, Xiao Yang, Hang Su, and Jun Zhu. 2021. Query2label: A simple transformer way to multi-label classification. arXiv:2107.10834. Retrieved from https://arxiv.org/abs/2107.10834.
[27]
Weiwei Liu and Ivor W. Tsang. 2015. On the optimality of classifier chain for multi-label classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15), Vol. 28. 712–720.
[28]
Yang Liu, Yao Zhang, Yixin Wang, Feng Hou, Jin Yuan, Jiang Tian, Yang Zhang, Zhongchao Shi, Jianping Fan, and Zhiqiang He. 2021. A survey of visual transformers. arXiv:2111.06091. Retrieved from https://arxiv.org/abs/2111.06091.
[29]
Xin Man, Deqiang Ouyang, Xiangpeng Li, Jingkuan Song, and Jie Shao. 2022. Scenario-aware recurrent transformer for goal-directed video captioning. ACM Trans. Multimedia Comput. Commun. Appl. 18, 4 (2022), 1–17.
[30]
Nicola Messina, Giuseppe Amato, Andrea Esuli, Fabrizio Falchi, Claudio Gennaro, and Stéphane Marchand-Maillet. 2021. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans. Multimedia Comput. Commun. Appl. 17, 4 (2021), 1–23.
[31]
Jinseok Nam, Young-Bum Kim, Eneldo Loza Mencia, Sunghyun Park, Ruhi Sarikaya, and Johannes Fürnkranz. 2019. Learning context-dependent label permutations for multi-label classification. In Proceedings of the International Conference on Machine Learning. 4733–4742.
[32]
Nipun D. Nath, Theodora Chaspari, and Amir H. Behzadan. 2019. Single- and multi-label classification of construction objects using deep transfer learning methods. J. Inf. Technol. Construct. 24, 28 (2019), 511–526.
[33]
Hoang D. Nguyen, Xuan-Son Vu, and Duc-Trong Le. 2021. Modular graph transformer networks for multi-label image classification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’21). AAAI.
[34]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532–1543.
[35]
Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 701–710.
[36]
José Ramón Quevedo, Oscar Luaces, and Antonio Bahamonde. 2012. Multilabel classifiers with a probabilistic thresholding strategy. Pattern Recogn. 45, 2 (2012), 876–883.
[37]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (unpublished).
[38]
Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. 2011. Classifier chains for multi-label classification. Mach. Learn. 85, 3 (2011), 333–359.
[39]
Tal Ridnik, Gilad Sharir, Avi Ben-Cohen, Emanuel Ben-Baruch, and Asaf Noy. 2023. ML-decoder: Scalable and versatile classification head. In Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 32–41.
[40]
Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR’15).
[41]
Yaguang Song, Xiaoshan Yang, and Changsheng Xu. 2022. Self-supervised calorie-aware heterogeneous graph networks for food recommendation. ACM Trans. Multimedia Comput. Commun. Appl. (2022).
[42]
Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. LINE: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web. 1067–1077.
[43]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Vol. 30. 5998–6008.
[44]
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In International Conference on Learning Representations.
[45]
Xuan-Son Vu, Duc-Trong Le, Christoffer Edlund, Lili Jiang, and Hoang D. Nguyen. 2020. Privacy-preserving visual content tagging using graph transformer networks. In Proceedings of the 28th ACM International Conference on Multimedia. 2299–2307.
[46]
Haidong Wang, Xuan He, Zhiyong Li*, Jin Yuan*, and Shutao Li. 2022. JDAN: Joint detection and association network for real-time online multi-object tracking. ACM Trans. Multimedia Comput. Commun. Appl. (2022).
[47]
Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, and Wei Xu. 2016. CNN-RNN: A unified framework for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 2285–2294.
[48]
Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S. Yu. 2019. Heterogeneous graph attention network. In Proceedings of the World Wide Web Conference. 2022–2032.
[49]
Ya Wang, Dongliang He, Fu Li, Xiang Long, Zhichao Zhou, Jinwen Ma, and Shilei Wen. 2020. Multi-label classification with label graph superimposing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12265–12272.
[50]
Zhouxia Wang, Tianshui Chen, Guanbin Li, Ruijia Xu, and Liang Lin. 2017. Multi-label image recognition by recurrently discovering attentional regions. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 464–472.
[51]
Yunchao Wei, Wei Xia, Min Lin, Junshi Huang, Bingbing Ni, Jian Dong, Yao Zhao, and Shuicheng Yan. 2016. HCP: A flexible CNN framework for multi-label image classification. IEEE Trans. Pattern Anal. Mach. Intell. 38, 9 (2016), 1901–1907.
[52]
Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 5987–5995.
[53]
I. Zeki Yalniz, Hervé Jégou, Kan Chen, Manohar Paluri, and Dhruv Mahajan. 2019. Billion-scale semi-supervised learning for image classification. arXiv:1905.00546. Retrieved from https://arxiv.org/abs/1905.02546.
[54]
Hao Yang, Joey Tianyi Zhou, Yu Zhang, Bin-Bin Gao, Jianxin Wu, and Jianfei Cai. 2016. Exploit bounding box annotations for multi-label object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 280–288.
[55]
Vacit Oguz Yazici, Abel Gonzalez-Garcia, Arnau Ramisa, Bartlomiej Twardowski, and Joost van de Weijer. 2020. Orderless recurrent models for multi-label classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 13440–13449.
[56]
Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, and Hyunwoo J. Kim. 2019. Graph transformer networks. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems (NeurIPS’19), Vol. 32. 11960–11970.
[57]
Yong Zhang, Yingwei Pan, Ting Yao, Rui Huang, Tao Mei, and Chang-Wen Chen. 2022. Boosting scene graph generation with visual relation saliency. ACM Trans. Multimedia Comput. Commun. Appl. (2022).
[58]
Kaili Zhao, Wen-Sheng Chu, and Honggang Zhang. 2016. Deep region and multi-label learning for facial action unit detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 3391–3399.
[59]
Wei Zhou, Zhiwu Xia, Peng Dou, Tao Su, and Haifeng Hu. 2022. Double attention based on graph attention network for image multi-label classification. ACM Trans. Multimedia Comput. Commun. Appl. (2022).
[60]
Feng Zhu, Hongsheng Li, Wanli Ouyang, Nenghai Yu, and Xiaogang Wang. 2017. Learning spatial regularization with image-level supervisions for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2027–2036.
[61]
Ni Zhuang, Yan Yan, Si Chen, Hanzi Wang, and Chunhua Shen. 2018. Multi-label learning based deep transfer neural network for facial attribute classification. Pattern Recogn. 80 (2018), 225–240.

Cited By

View all
  • (2024)Cross-modality semantic guidance for multi-label image classificationIntelligent Data Analysis10.3233/IDA-23023928:3(633-646)Online publication date: 28-May-2024
  • (2024)MuLX-QA: Classifying Multi-Labels and Extracting Rationale Spans in Social Media PostsACM Transactions on the Web10.1145/365330318:3(1-26)Online publication date: 6-May-2024
  • (2024)Pyramidal Cross-Modal Transformer with Sustained Visual Guidance for Multi-Label Image ClassificationProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658005(740-748)Online publication date: 30-May-2024
  • Show More Cited By

Index Terms

  1. Graph Attention Transformer Network for Multi-label Image Classification

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 4
    July 2023
    263 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3582888
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 February 2023
    Online AM: 29 December 2022
    Accepted: 18 December 2022
    Revised: 31 October 2022
    Received: 17 May 2022
    Published in TOMM Volume 19, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Graph neural network
    2. transformer
    3. attention mechanism
    4. multi-label classification

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)719
    • Downloads (Last 6 weeks)57
    Reflects downloads up to 04 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Cross-modality semantic guidance for multi-label image classificationIntelligent Data Analysis10.3233/IDA-23023928:3(633-646)Online publication date: 28-May-2024
    • (2024)MuLX-QA: Classifying Multi-Labels and Extracting Rationale Spans in Social Media PostsACM Transactions on the Web10.1145/365330318:3(1-26)Online publication date: 6-May-2024
    • (2024)Pyramidal Cross-Modal Transformer with Sustained Visual Guidance for Multi-Label Image ClassificationProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658005(740-748)Online publication date: 30-May-2024
    • (2024)Continuous Image Outpainting with Neural ODEACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364836720:7(1-16)Online publication date: 25-Apr-2024
    • (2024)Exploring Visual Relationships via Transformer-based Graphs for Enhanced Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363855820:5(1-23)Online publication date: 22-Jan-2024
    • (2024)Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video CommentingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333420:4(1-24)Online publication date: 11-Jan-2024
    • (2024)Domain-Aware Graph Network for Bridging Multi-Source Domain AdaptationIEEE Transactions on Multimedia10.1109/TMM.2024.336172926(7210-7224)Online publication date: 2-Feb-2024
    • (2024)Multitask Fine-Grained Feature Mining for Multilabel Remote Sensing Image ClassificationIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.342647362(1-17)Online publication date: 2024
    • (2024)High-frequency and low-frequency dual-channel graph attention networkPattern Recognition10.1016/j.patcog.2024.110795156(110795)Online publication date: Dec-2024
    • (2024)Knowledge graph-based image classificationData & Knowledge Engineering10.1016/j.datak.2024.102285151:COnline publication date: 1-May-2024
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media