VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition

Tian, Changyao; Wang, Wenhai; Zhu, Xizhou; Dai, Jifeng; Qiao, Yu

doi:10.1007/978-3-031-19806-9_5

Changyao Tian¹²,
Wenhai Wang¹⁴,
Xizhou Zhu¹³,
Jifeng Dai¹³ &
…
Yu Qiao¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13685))

Included in the following conference series:

European Conference on Computer Vision

Abstract

Recently, computer vision foundation models such as CLIP and ALI-GN, have shown impressive generalization capabilities on various downstream tasks. But their abilities to deal with the long-tailed data still remain to be proved. In this work, we present a novel framework based on pre-trained visual-linguistic models for long-tailed recognition (LTR), termed VL-LTR, and conduct empirical studies on the benefits of introducing text modality for long-tailed recognition tasks. Compared to existing approaches, the proposed VL-LTR has the following merits. (1) Our method can not only learn visual representation from images but also learn corresponding linguistic representation from noisy class-level text descriptions collected from the Internet; (2) Our method can effectively use the learned visual-linguistic representation to improve the visual recognition performance, especially for classes with fewer image samples. We also conduct extensive experiments and set the new state-of-the-art performance on widely-used LTR benchmarks. Notably, our method achieves 77.2% overall accuracy on ImageNet-LT, which significantly outperforms the previous best method by over 17 points, and is close to the prevailing performance training on the full ImageNet. Code is available at https://github.com/ChangyaoTian/VL-LTR.

C. Tian et al.—Contributed equally.

C. Tian—The work is done when Changyao Tian is an intern at SenseTime Research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Learning to Prompt for Vision-Language Models

Article 31 July 2022

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

MVP: Multimodality-Guided Visual Pre-training

Notes

1.
https://en.wikipedia.org/.

References

Buda, M., Maki, A., Mazurowski, M.A.: A systematic study of the class imbalance problem in convolutional neural networks. IEEE Trans. Neural Netw. 106, 249–259 (2018)
Article Google Scholar
Cao, K., Wei, C., Gaidon, A., Aréchiga, N., Ma, T.: Learning imbalanced datasets with label-distribution-aware margin loss. In: Proceedings of Advances in Neural Information Processing System (2019)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
Chen, Y.C., et al.: Uniter: learning universal image-text representations. arXiv preprint arXiv:1909.11740 (2019)
Chu, P., Bian, X., Liu, S., Ling, H.: Feature space augmentation for long-tailed data. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12374, pp. 694–710. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58526-6_41
Chapter Google Scholar
Cui, J., Liu, S., Tian, Z., Zhong, Z., Jia, J.: Result: residual learning for long-tailed recognition. arXiv preprint arXiv:2101.10633 (2021)
Cui, J., Zhong, Z., Liu, S., Yu, B., Jia, J.: Parametric contrastive learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2021)
Google Scholar
Cui, Y., Jia, M., Lin, T.Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2019)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2009)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of International Conference on Learning Representations (2021)
Google Scholar
Drummond, C., Holte, R.C., et al.: C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on Learning from Imbalanced Datasets II, vol. 11, pp. 1–8 (2003)
Google Scholar
Gan, Z., Chen, Y.C., Li, L., Zhu, C., Cheng, Y., Liu, J.: Large-scale adversarial training for vision-and-language representation learning. In: Proceedings of Advances in Neural Information Processing Systems (2020)
Google Scholar
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91
Chapter Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
He, X., Peng, Y.: Fine-grained image classification via combining vision and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Google Scholar
Huang, C., Li, Y., Loy, C.C., Tang, X.: Learning deep representation for imbalanced classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
Huang, Z., Zeng, Z., Liu, B., Fu, D., Fu, J.: Pixel-BERT: aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 (2020)
Jamal, M.A., Brown, M., Yang, M.H., Wang, L., Gong, B.: Rethinking class-balanced methods for long-tailed visual recognition from a domain adaptation perspective. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020)
Google Scholar
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: Proceedings of International Conference on Machine Learning (2021)
Google Scholar
Kang, B., Li, Y., Xie, S., Yuan, Z., Feng, J.: Exploring balanced feature spaces for representation learning. In: Proceedings of International Conference on Learning Representations (2020)
Google Scholar
Kang, B., et al.: Decoupling representation and classifier for long-tailed recognition. In: Proceedings of International Conference on Learning Representations (2019)
Google Scholar
Khan, S.H., Hayat, M., Bennamoun, M., Sohel, F.A., Togneri, R.: Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Trans. Neural Netw. Learn. Syst. 29(8), 3573–3587 (2017)
Google Scholar
Kim, J., Jeong, J., Shin, J.: M2M: imbalanced classification via major-to-minor translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020)
Google Scholar
Li, L., Chen, Y., Cheng, Y., Gan, Z., Yu, L., Liu, J.: HERO: hierarchical encoder for video+language omni-representation pre-training. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, 16–20 November 2020 (2020)
Google Scholar
Li, L., Gan, Z., Liu, J.: A closer look at the robustness of vision-and-language pre-trained models. arXiv preprint arXiv:2012.08673 (2020)
Li, T., Wang, L., Wu, G.: Self supervision to distillation for long-tailed visual recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2021)
Google Scholar
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Chapter Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE Conference on Computer Vision (2017)
Google Scholar
Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., Yu, S.X.: Large-scale long-tailed recognition in an open world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
Google Scholar
Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. In: Proceedings of International Conference on Learning Representations (2017)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: Proceedings of International Conference on Learning Representations (2019)
Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: VilBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Proceedings of Advances in Neural Information Processing System (2019)
Google Scholar
Lu, J., Goswami, V., Rohrbach, M., Parikh, D., Lee, S.: 12-in-1: Multi-task vision and language representation learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020)
Google Scholar
Mahajan, D., et al.: Exploring the limits of weakly supervised pretraining. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 185–201. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_12
Chapter Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proc. Advances in Neural Inf. Process. Syst. (2013)
Google Scholar
Mu, J., Liang, P., Goodman, N.D.: Shaping visual representations with language for few-shot classification. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, 5–10 July 2020 (2020)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)
Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Google Scholar
Samuel, D., Atzmon, Y., Chechik, G.: From generalized zero-shot learning to long-tail with class descriptors. In: IEEE Winter Conference on Applications of Computer Vision (2021)
Google Scholar
Samuel, D., Chechik, G.: Distributional robustness loss for long-tail learning. arXiv preprint arXiv:2104.03066 (2021)
Shen, L., Lin, Z., Huang, Q.: Relay backpropagation for effective learning of deep convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 467–482. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_29
Chapter Google Scholar
Su, W., et al.: Vl-BERT: pre-training of generic visual-linguistic representations. In: Proceedings of International Conference on Learning Representations(2019)
Google Scholar
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019 (2019)
Google Scholar
Tan, J., et al.: Equalization loss for long-tailed object recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: Proceedings of International Conference on Machine Learning (2021)
Google Scholar
Van Horn, G., et al.: The inaturalist species classification and detection dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Wang, P., Han, K., Wei, X.S., Zhang, L., Wang, L.: Contrastive learning based hybrid networks for long-tailed image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2021)
Google Scholar
Wang, X., Lian, L., Miao, Z., Liu, Z., Yu, S.X.: Long-tailed recognition by routing diverse distribution-aware experts. In: Proceedings of International Conference on Learning Representations (2020)
Google Scholar
Wang, Y.X., Ramanan, D., Hebert, M.: Learning to model the tail. In: Proceedings of Advances in Neural Information Processing System (2017)
Google Scholar
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Google Scholar
Yin, X., Yu, X., Sohn, K., Liu, X., Chandraker, M.: Feature transfer learning for face recognition with under-represented data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
Google Scholar
Zhang, P., et al.: Vinvl: revisiting visual representations in vision-language models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2021)
Google Scholar
Zhang, Y., Hooi, B., Hong, L., Feng, J.: Test-agnostic long-tailed recognition by test-time aggregating diverse experts with self-supervision. arXiv preprint arXiv:2107.09249 (2021)
Zhong, Y., et al.: Unequal-training for deep face recognition with long-tailed noisy data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
Google Scholar
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1452–1464 (2017)
Article Google Scholar
Zhou, B., Cui, Q., Wei, X.S., Chen, Z.M.: BBN: bilateral-branch network with cumulative learning for long-tailed visual recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020)
Google Scholar
Zhu, L., Yang, Y.: Inflated episodic memory with region self-attention for long-tailed visual recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020)
Google Scholar
Zhuang, P., Wang, Y., Qiao, Y.: Wildfish++: a comprehensive fish benchmark for multimedia research. IEEE Trans. Multimedia 23, 3603–3617 (2020)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Chinese University of Hong Kong, Hong Kong, China
Changyao Tian
SenseTime, Hong Kong, China
Xizhou Zhu & Jifeng Dai
Shanghai AI Laboratory, Shanghai, China
Wenhai Wang & Yu Qiao

Authors

Changyao Tian
View author publications
You can also search for this author in PubMed Google Scholar
Wenhai Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xizhou Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Jifeng Dai
View author publications
You can also search for this author in PubMed Google Scholar
Yu Qiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jifeng Dai .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 13839 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tian, C., Wang, W., Zhu, X., Dai, J., Qiao, Y. (2022). VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13685. Springer, Cham. https://doi.org/10.1007/978-3-031-19806-9_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-19806-9_5
Published: 20 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19805-2
Online ISBN: 978-3-031-19806-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Learning to Prompt for Vision-Language Models

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

MVP: Multimodality-Guided Visual Pre-training

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 13839 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Learning to Prompt for Vision-Language Models

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

MVP: Multimodality-Guided Visual Pre-training

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 13839 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation