Dual-stream multi-label image classification model enhanced by feature reconstruction

Hu, Liming; Chen, Mingxuan; Wang, Anjie; Fang, Zhijun

doi:10.1007/s00530-024-01493-8

Dual-stream multi-label image classification model enhanced by feature reconstruction

Regular Paper
Published: 20 September 2024

Volume 30, article number 281, (2024)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Liming Hu¹^na1,
Mingxuan Chen¹^na1,
Anjie Wang² &
…
Zhijun Fang^1,3

179 Accesses
Explore all metrics

Abstract

Multi-label image classification (MLIC) is a highly practical and challenging task in computer vision. Compared to traditional single-label image classification, MLIC not only focuses on the dependencies between images and labels but also places significant emphasis on the spatial relationships within images and the internal dependencies of labels. In this paper, we propose the Dual-Stream Classification Network (DSCN) for multi-label image classification. In one branch, we capture more spatial information by segmenting the image. A feature reconstruction layer based on self-attention mechanism is used to recover the boundary information lost after segmentation, while the dependency between the image and label is captured by a transformer encoder. The other branch enhances the label’s semantics using multimodal features by employing templates to extend categories into prompts, thus improving the reliability of the features. The CLIP model provides multimodal association features between images and prompts. The final labels of the images are generated by a weighted fusion of the results from the two branches. We tested our model on three popular datasets: MSCOCO2014, VOC2007 and NUS-WIDE. DSCN outperformed state-of-the-art methods, demonstrating the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A multi-label image classification method combining multi-stage image semantic information and label relevance

Article 08 April 2024

An Attention-Driven Multi-label Image Classification with Semantic Embedding and Graph Convolutional Networks

Article 09 January 2022

Multi-label image recognition with attentive transformer-localizer module

Article 29 January 2022

Data availability and access

The datasets analysed during the current study are available from the corresponding author on reasonable request.

References

Montanes, E., Senge, R., Barranquero, J., Quevedo, J.R., Coz, J.J., Hüllermeier, E.: Dependent binary relevance models for multi-label classification. Pattern Recognit. 47, 1494–1508 (2014)
Article Google Scholar
Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., Xu, W.: CNN-RNN: a unified framework for multi-label image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2285–2294 (2016)
Yazici, V.O., Gonzalez-Garcia, A., Ramisa, A., Twardowski, B., Weijer, J.V.d.: Orderless recurrent models for multi-label classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13440–13449 (2020)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Chen, Z.-M., Wei, X.-S., Wang, P., Guo, Y.: Multi-label image recognition with graph convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5177–5186 (2019)
Ye, J., He, J., Peng, X., Wu, W., Qiao, Y.: Attention-driven dynamic graph convolutional network for multi-label image recognition. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pp. 649–665. Springer (2020)
Wang, Y., He, D., Li, F., Long, X., Zhou, Z., Ma, J., Wen, S.: Multi-label classification with label graph superimposing. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12265–12272 (2020)
Xu, J., Tian, H., Wang, Z., Wang, Y., Kang, W., Chen, F.: Joint input and output space learning for multi-label image classification. IEEE Trans. Multimedia 23, 1696–1707 (2020)
Article Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems, vol. 30 (2017)
Lanchantin, J., Wang, T., Ordonez, V., Qi, Y.: General multi-label image classification with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16478–16488 (2021)
Zhang, L., Liu, J., Bao, Y., Wang, J.: Region-awared transformer with asymmetric loss in multi-label classification. In: ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Chen, T., Xu, M., Hui, X., Wu, H., Lin, L.: Learning semantic-specific graph representation for multi-label image recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 522–531 (2019)
Zhang, J., Ren, J., Zhang, Q., Liu, J., Jiang, X.: Spatial context-aware object-attentional network for multi-label image classification. IEEE Trans. Image Process. 32, 3000–3012 (2023)
Article Google Scholar
Wu, Y., Feng, S., Zhao, G., Jin, Y.: Transformer driven matching selection mechanism for multi-label image classification. IEEE Trans. Circuits Syst. Video Technol. 32, 924–937 (2023)
Google Scholar
Lyu, F., Wu, Q., Hu, F., Wu, Q., Tan, M.: Attend and imagine: multi-label image classification with visual attention and recurrent neural networks. IEEE Trans. Multimedia 21(8), 1971–1981 (2019)
Article Google Scholar
Zhou, W., Jiang, W., Chen, D., Hu, H., Su, T.: Mining semantic information with dual relation graph network for multi-label image classification. IEEE Trans. Multimedia 26, 1143–1157 (2024). https://doi.org/10.1109/TMM.2023.3277279
Article Google Scholar
Zhou, W., Dou, P., Su, T., Hu, H., Zheng, Z.: Feature learning network with transformer for multi-label image classification. Pattern Recognit. 136, 109203 (2023)
Article Google Scholar
Wu, Y., Feng, S., Wang, Y.: Semantic-aware graph matching mechanism for multi-label image recognition. IEEE Trans. Circuits Syst. Video Technol. 33(11), 6788–6803 (2023). https://doi.org/10.1109/TCSVT.2023.3268997
Article Google Scholar
Baltrušaitis, T., Ahuja, C., Morency, L.-P.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2018)
Article Google Scholar
Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10578–10587 (2020)
Mokady, R., Hertz, A., Bermano, A.H.: ClipCap: CLIP prefix for image captioning. arXiv preprint arXiv:2111.09734 (2021)
Ding, Y., Yu, J., Liu, B., Hu, Y., Cui, M., Wu, Q.: MuKEA: multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5089–5098 (2022)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., Wei, F., Guo, B.: Swin transformer v2: Scaling up capacity and resolution (2022)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, pp. 740–755. Springer (2014)
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vision 88, 303–338 (2010)
Article Google Scholar
Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: NUS-WIDE: a real-world web image database from National University of Singapore. In: Proceedings of the ACM International Conference on Image and Video Retrieval, pp. 1–9 (2009)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Zhu, F., Li, H., Ouyang, W., Yu, N., Wang, X.: Learning spatial regularization with image-level supervisions for multi-label image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5513–5522 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Chen, T., Lin, L., Chen, R., Hui, X., Wu, H.: Knowledge-guided multi-label few-shot learning for general image recognition. IEEE Trans. Pattern Anal. Mach. Intell. 44(3), 1371–1384 (2020)
Article Google Scholar
Wei, Y., Xia, W., Lin, M., Huang, J., Ni, B., Dong, J., Zhao, Y., Yan, S.: HCP: a flexible CNN framework for multi-label image classification. IEEE Trans. Pattern Anal. Mach. Intell. 38(9), 1901–1907 (2015)
Article Google Scholar
Wang, Z., Chen, T., Li, G., Xu, R., Lin, L.: Multi-label image recognition by recurrently discovering attentional regions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 464–472 (2017)
Zhu, K., Wu, J.: Residual attention: a simple but effective method for multi-label recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 184–193 (2021)
Deng, X., Feng, S., Lyu, G., Wang, T., Lang, C.: Beyond word embeddings: heterogeneous prior knowledge driven multi-label image classification. IEEE Trans. Multimedia 25, 4013–4025 (2022)
Article Google Scholar
Wu, Y., Liu, H., Feng, S., Jin, Y., Lyu, G., Wu, Z.: GM-MLIC: graph matching based multi-label image classification. arXiv preprint arXiv:2104.14762 (2021)
Chen, Z.-M., Cui, Q., Zhao, B., Song, R., Zhang, X., Yoshie, O.: SST: spatial and semantic transformers for multi-label image recognition. IEEE Trans. Image Process. 31, 2570–2583 (2022). https://doi.org/10.1109/TIP.2022.3148867
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for providing helpful comments.

Funding

This study received no external funding.

Author information

Liming Hu and Mingxuan Chen contributed equally to this work.

Authors and Affiliations

School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, 333 Longteng Road, Shanghai, 201620, China
Liming Hu, Mingxuan Chen & Zhijun Fang
School of Electronic and Computer Engineering, Peking University, 2199 Lishui Road, Shenzhen, 518055, China
Anjie Wang
School of Computer Science and Technology, Donghua University, 2899 Renming North Road, Shanghai, 201620, China
Zhijun Fang

Authors

Liming Hu
View author publications
You can also search for this author in PubMed Google Scholar
Mingxuan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Anjie Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhijun Fang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Liming Hu: Writing-Original draft preparation, Conceptualization, Methodology, Software, Visualization. Mingxuan Chen: Conceptualization, Methodology, Writing-Reviewing and Editing. Anjie Wang: Writing-Reviewing and Editing, Validation. Zhijun Fang: Validation, Supervision and Project administration.

Corresponding author

Correspondence to Zhijun Fang.

Ethics declarations

Conflict of interest

The authors have no conflict of interest to declare that are relevant to the content of this article.

Additional information

Communicated by Bing-kun Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Hu, L., Chen, M., Wang, A. et al. Dual-stream multi-label image classification model enhanced by feature reconstruction. Multimedia Systems 30, 281 (2024). https://doi.org/10.1007/s00530-024-01493-8

Download citation

Received: 02 April 2024
Accepted: 06 September 2024
Published: 20 September 2024
DOI: https://doi.org/10.1007/s00530-024-01493-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dual-stream multi-label image classification model enhanced by feature reconstruction

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A multi-label image classification method combining multi-stage image semantic information and label relevance

An Attention-Driven Multi-label Image Classification with Semantic Embedding and Graph Convolutional Networks

Multi-label image recognition with attentive transformer-localizer module

Data availability and access

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Dual-stream multi-label image classification model enhanced by feature reconstruction

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A multi-label image classification method combining multi-stage image semantic information and label relevance

An Attention-Driven Multi-label Image Classification with Semantic Embedding and Graph Convolutional Networks

Multi-label image recognition with attentive transformer-localizer module

Data availability and access

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation