Abstract
Multi-label image classification (MLIC) is a highly practical and challenging task in computer vision. Compared to traditional single-label image classification, MLIC not only focuses on the dependencies between images and labels but also places significant emphasis on the spatial relationships within images and the internal dependencies of labels. In this paper, we propose the Dual-Stream Classification Network (DSCN) for multi-label image classification. In one branch, we capture more spatial information by segmenting the image. A feature reconstruction layer based on self-attention mechanism is used to recover the boundary information lost after segmentation, while the dependency between the image and label is captured by a transformer encoder. The other branch enhances the label’s semantics using multimodal features by employing templates to extend categories into prompts, thus improving the reliability of the features. The CLIP model provides multimodal association features between images and prompts. The final labels of the images are generated by a weighted fusion of the results from the two branches. We tested our model on three popular datasets: MSCOCO2014, VOC2007 and NUS-WIDE. DSCN outperformed state-of-the-art methods, demonstrating the effectiveness of our approach.




Similar content being viewed by others
Data availability and access
The datasets analysed during the current study are available from the corresponding author on reasonable request.
References
Montanes, E., Senge, R., Barranquero, J., Quevedo, J.R., Coz, J.J., Hüllermeier, E.: Dependent binary relevance models for multi-label classification. Pattern Recognit. 47, 1494–1508 (2014)
Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., Xu, W.: CNN-RNN: a unified framework for multi-label image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2285–2294 (2016)
Yazici, V.O., Gonzalez-Garcia, A., Ramisa, A., Twardowski, B., Weijer, J.V.d.: Orderless recurrent models for multi-label classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13440–13449 (2020)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Chen, Z.-M., Wei, X.-S., Wang, P., Guo, Y.: Multi-label image recognition with graph convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5177–5186 (2019)
Ye, J., He, J., Peng, X., Wu, W., Qiao, Y.: Attention-driven dynamic graph convolutional network for multi-label image recognition. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pp. 649–665. Springer (2020)
Wang, Y., He, D., Li, F., Long, X., Zhou, Z., Ma, J., Wen, S.: Multi-label classification with label graph superimposing. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12265–12272 (2020)
Xu, J., Tian, H., Wang, Z., Wang, Y., Kang, W., Chen, F.: Joint input and output space learning for multi-label image classification. IEEE Trans. Multimedia 23, 1696–1707 (2020)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems, vol. 30 (2017)
Lanchantin, J., Wang, T., Ordonez, V., Qi, Y.: General multi-label image classification with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16478–16488 (2021)
Zhang, L., Liu, J., Bao, Y., Wang, J.: Region-awared transformer with asymmetric loss in multi-label classification. In: ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Chen, T., Xu, M., Hui, X., Wu, H., Lin, L.: Learning semantic-specific graph representation for multi-label image recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 522–531 (2019)
Zhang, J., Ren, J., Zhang, Q., Liu, J., Jiang, X.: Spatial context-aware object-attentional network for multi-label image classification. IEEE Trans. Image Process. 32, 3000–3012 (2023)
Wu, Y., Feng, S., Zhao, G., Jin, Y.: Transformer driven matching selection mechanism for multi-label image classification. IEEE Trans. Circuits Syst. Video Technol. 32, 924–937 (2023)
Lyu, F., Wu, Q., Hu, F., Wu, Q., Tan, M.: Attend and imagine: multi-label image classification with visual attention and recurrent neural networks. IEEE Trans. Multimedia 21(8), 1971–1981 (2019)
Zhou, W., Jiang, W., Chen, D., Hu, H., Su, T.: Mining semantic information with dual relation graph network for multi-label image classification. IEEE Trans. Multimedia 26, 1143–1157 (2024). https://doi.org/10.1109/TMM.2023.3277279
Zhou, W., Dou, P., Su, T., Hu, H., Zheng, Z.: Feature learning network with transformer for multi-label image classification. Pattern Recognit. 136, 109203 (2023)
Wu, Y., Feng, S., Wang, Y.: Semantic-aware graph matching mechanism for multi-label image recognition. IEEE Trans. Circuits Syst. Video Technol. 33(11), 6788–6803 (2023). https://doi.org/10.1109/TCSVT.2023.3268997
Baltrušaitis, T., Ahuja, C., Morency, L.-P.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2018)
Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10578–10587 (2020)
Mokady, R., Hertz, A., Bermano, A.H.: ClipCap: CLIP prefix for image captioning. arXiv preprint arXiv:2111.09734 (2021)
Ding, Y., Yu, J., Liu, B., Hu, Y., Cui, M., Wu, Q.: MuKEA: multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5089–5098 (2022)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., Wei, F., Guo, B.: Swin transformer v2: Scaling up capacity and resolution (2022)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, pp. 740–755. Springer (2014)
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vision 88, 303–338 (2010)
Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: NUS-WIDE: a real-world web image database from National University of Singapore. In: Proceedings of the ACM International Conference on Image and Video Retrieval, pp. 1–9 (2009)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Zhu, F., Li, H., Ouyang, W., Yu, N., Wang, X.: Learning spatial regularization with image-level supervisions for multi-label image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5513–5522 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Chen, T., Lin, L., Chen, R., Hui, X., Wu, H.: Knowledge-guided multi-label few-shot learning for general image recognition. IEEE Trans. Pattern Anal. Mach. Intell. 44(3), 1371–1384 (2020)
Wei, Y., Xia, W., Lin, M., Huang, J., Ni, B., Dong, J., Zhao, Y., Yan, S.: HCP: a flexible CNN framework for multi-label image classification. IEEE Trans. Pattern Anal. Mach. Intell. 38(9), 1901–1907 (2015)
Wang, Z., Chen, T., Li, G., Xu, R., Lin, L.: Multi-label image recognition by recurrently discovering attentional regions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 464–472 (2017)
Zhu, K., Wu, J.: Residual attention: a simple but effective method for multi-label recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 184–193 (2021)
Deng, X., Feng, S., Lyu, G., Wang, T., Lang, C.: Beyond word embeddings: heterogeneous prior knowledge driven multi-label image classification. IEEE Trans. Multimedia 25, 4013–4025 (2022)
Wu, Y., Liu, H., Feng, S., Jin, Y., Lyu, G., Wu, Z.: GM-MLIC: graph matching based multi-label image classification. arXiv preprint arXiv:2104.14762 (2021)
Chen, Z.-M., Cui, Q., Zhao, B., Song, R., Zhang, X., Yoshie, O.: SST: spatial and semantic transformers for multi-label image recognition. IEEE Trans. Image Process. 31, 2570–2583 (2022). https://doi.org/10.1109/TIP.2022.3148867
Acknowledgements
The authors would like to thank the anonymous reviewers for providing helpful comments.
Funding
This study received no external funding.
Author information
Authors and Affiliations
Contributions
Liming Hu: Writing-Original draft preparation, Conceptualization, Methodology, Software, Visualization. Mingxuan Chen: Conceptualization, Methodology, Writing-Reviewing and Editing. Anjie Wang: Writing-Reviewing and Editing, Validation. Zhijun Fang: Validation, Supervision and Project administration.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflict of interest to declare that are relevant to the content of this article.
Additional information
Communicated by Bing-kun Bao.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hu, L., Chen, M., Wang, A. et al. Dual-stream multi-label image classification model enhanced by feature reconstruction. Multimedia Systems 30, 281 (2024). https://doi.org/10.1007/s00530-024-01493-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00530-024-01493-8