Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Advertisement

Dual-stream multi-label image classification model enhanced by feature reconstruction

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Multi-label image classification (MLIC) is a highly practical and challenging task in computer vision. Compared to traditional single-label image classification, MLIC not only focuses on the dependencies between images and labels but also places significant emphasis on the spatial relationships within images and the internal dependencies of labels. In this paper, we propose the Dual-Stream Classification Network (DSCN) for multi-label image classification. In one branch, we capture more spatial information by segmenting the image. A feature reconstruction layer based on self-attention mechanism is used to recover the boundary information lost after segmentation, while the dependency between the image and label is captured by a transformer encoder. The other branch enhances the label’s semantics using multimodal features by employing templates to extend categories into prompts, thus improving the reliability of the features. The CLIP model provides multimodal association features between images and prompts. The final labels of the images are generated by a weighted fusion of the results from the two branches. We tested our model on three popular datasets: MSCOCO2014, VOC2007 and NUS-WIDE. DSCN outperformed state-of-the-art methods, demonstrating the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability and access

The datasets analysed during the current study are available from the corresponding author on reasonable request.

References

  1. Montanes, E., Senge, R., Barranquero, J., Quevedo, J.R., Coz, J.J., Hüllermeier, E.: Dependent binary relevance models for multi-label classification. Pattern Recognit. 47, 1494–1508 (2014)

    Article  Google Scholar 

  2. Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., Xu, W.: CNN-RNN: a unified framework for multi-label image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2285–2294 (2016)

  3. Yazici, V.O., Gonzalez-Garcia, A., Ramisa, A., Twardowski, B., Weijer, J.V.d.: Orderless recurrent models for multi-label classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13440–13449 (2020)

  4. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  5. Chen, Z.-M., Wei, X.-S., Wang, P., Guo, Y.: Multi-label image recognition with graph convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5177–5186 (2019)

  6. Ye, J., He, J., Peng, X., Wu, W., Qiao, Y.: Attention-driven dynamic graph convolutional network for multi-label image recognition. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pp. 649–665. Springer (2020)

  7. Wang, Y., He, D., Li, F., Long, X., Zhou, Z., Ma, J., Wen, S.: Multi-label classification with label graph superimposing. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12265–12272 (2020)

  8. Xu, J., Tian, H., Wang, Z., Wang, Y., Kang, W., Chen, F.: Joint input and output space learning for multi-label image classification. IEEE Trans. Multimedia 23, 1696–1707 (2020)

    Article  Google Scholar 

  9. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems, vol. 30 (2017)

  10. Lanchantin, J., Wang, T., Ordonez, V., Qi, Y.: General multi-label image classification with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16478–16488 (2021)

  11. Zhang, L., Liu, J., Bao, Y., Wang, J.: Region-awared transformer with asymmetric loss in multi-label classification. In: ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)

  12. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)

  13. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  14. Chen, T., Xu, M., Hui, X., Wu, H., Lin, L.: Learning semantic-specific graph representation for multi-label image recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 522–531 (2019)

  15. Zhang, J., Ren, J., Zhang, Q., Liu, J., Jiang, X.: Spatial context-aware object-attentional network for multi-label image classification. IEEE Trans. Image Process. 32, 3000–3012 (2023)

    Article  Google Scholar 

  16. Wu, Y., Feng, S., Zhao, G., Jin, Y.: Transformer driven matching selection mechanism for multi-label image classification. IEEE Trans. Circuits Syst. Video Technol. 32, 924–937 (2023)

    Google Scholar 

  17. Lyu, F., Wu, Q., Hu, F., Wu, Q., Tan, M.: Attend and imagine: multi-label image classification with visual attention and recurrent neural networks. IEEE Trans. Multimedia 21(8), 1971–1981 (2019)

    Article  Google Scholar 

  18. Zhou, W., Jiang, W., Chen, D., Hu, H., Su, T.: Mining semantic information with dual relation graph network for multi-label image classification. IEEE Trans. Multimedia 26, 1143–1157 (2024). https://doi.org/10.1109/TMM.2023.3277279

    Article  Google Scholar 

  19. Zhou, W., Dou, P., Su, T., Hu, H., Zheng, Z.: Feature learning network with transformer for multi-label image classification. Pattern Recognit. 136, 109203 (2023)

    Article  Google Scholar 

  20. Wu, Y., Feng, S., Wang, Y.: Semantic-aware graph matching mechanism for multi-label image recognition. IEEE Trans. Circuits Syst. Video Technol. 33(11), 6788–6803 (2023). https://doi.org/10.1109/TCSVT.2023.3268997

    Article  Google Scholar 

  21. Baltrušaitis, T., Ahuja, C., Morency, L.-P.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2018)

    Article  Google Scholar 

  22. Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10578–10587 (2020)

  23. Mokady, R., Hertz, A., Bermano, A.H.: ClipCap: CLIP prefix for image captioning. arXiv preprint arXiv:2111.09734 (2021)

  24. Ding, Y., Yu, J., Liu, B., Hu, Y., Cui, M., Wu, Q.: MuKEA: multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5089–5098 (2022)

  25. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

  26. Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., Wei, F., Guo, B.: Swin transformer v2: Scaling up capacity and resolution (2022)

  27. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, pp. 740–755. Springer (2014)

  28. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vision 88, 303–338 (2010)

    Article  Google Scholar 

  29. Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: NUS-WIDE: a real-world web image database from National University of Singapore. In: Proceedings of the ACM International Conference on Image and Video Retrieval, pp. 1–9 (2009)

  30. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  31. Zhu, F., Li, H., Ouyang, W., Yu, N., Wang, X.: Learning spatial regularization with image-level supervisions for multi-label image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5513–5522 (2017)

  32. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  33. Chen, T., Lin, L., Chen, R., Hui, X., Wu, H.: Knowledge-guided multi-label few-shot learning for general image recognition. IEEE Trans. Pattern Anal. Mach. Intell. 44(3), 1371–1384 (2020)

    Article  Google Scholar 

  34. Wei, Y., Xia, W., Lin, M., Huang, J., Ni, B., Dong, J., Zhao, Y., Yan, S.: HCP: a flexible CNN framework for multi-label image classification. IEEE Trans. Pattern Anal. Mach. Intell. 38(9), 1901–1907 (2015)

    Article  Google Scholar 

  35. Wang, Z., Chen, T., Li, G., Xu, R., Lin, L.: Multi-label image recognition by recurrently discovering attentional regions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 464–472 (2017)

  36. Zhu, K., Wu, J.: Residual attention: a simple but effective method for multi-label recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 184–193 (2021)

  37. Deng, X., Feng, S., Lyu, G., Wang, T., Lang, C.: Beyond word embeddings: heterogeneous prior knowledge driven multi-label image classification. IEEE Trans. Multimedia 25, 4013–4025 (2022)

    Article  Google Scholar 

  38. Wu, Y., Liu, H., Feng, S., Jin, Y., Lyu, G., Wu, Z.: GM-MLIC: graph matching based multi-label image classification. arXiv preprint arXiv:2104.14762 (2021)

  39. Chen, Z.-M., Cui, Q., Zhao, B., Song, R., Zhang, X., Yoshie, O.: SST: spatial and semantic transformers for multi-label image recognition. IEEE Trans. Image Process. 31, 2570–2583 (2022). https://doi.org/10.1109/TIP.2022.3148867

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for providing helpful comments.

Funding

This study received no external funding.

Author information

Authors and Affiliations

Authors

Contributions

Liming Hu: Writing-Original draft preparation, Conceptualization, Methodology, Software, Visualization. Mingxuan Chen: Conceptualization, Methodology, Writing-Reviewing and Editing. Anjie Wang: Writing-Reviewing and Editing, Validation. Zhijun Fang: Validation, Supervision and Project administration.

Corresponding author

Correspondence to Zhijun Fang.

Ethics declarations

Conflict of interest

The authors have no conflict of interest to declare that are relevant to the content of this article.

Additional information

Communicated by Bing-kun Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, L., Chen, M., Wang, A. et al. Dual-stream multi-label image classification model enhanced by feature reconstruction. Multimedia Systems 30, 281 (2024). https://doi.org/10.1007/s00530-024-01493-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-024-01493-8

Keywords