Grounding DINO: Marrying DINO with Grounded Pre-training for Open-Set Object Detection

Liu, Shilong; Zeng, Zhaoyang; Ren, Tianhe; Li, Feng; Zhang, Hao; Yang, Jie; Jiang, Qing; Li, Chunyuan; Yang, Jianwei; Su, Hang; Zhu, Jun; Zhang, Lei

doi:10.1007/978-3-031-72970-6_3

Shilong Liu^13,14,
Zhaoyang Zeng¹⁴,
Tianhe Ren¹⁴,
Feng Li^14,15,
Hao Zhang^14,15,
Jie Yang^14,16,
Qing Jiang^14,18,
Chunyuan Li¹⁷,
Jianwei Yang¹⁷,
Hang Su¹³,
Jun Zhu¹³ &
…
Lei Zhang¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15105))

Included in the following conference series:

European Conference on Computer Vision

1087 Accesses
195 Citations

Abstract

In this paper, we develop an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for modalities fusion. We first pre-train Grounding DINO on large-scale datasets, including object detection data, grounding data, and caption data, and evaluate the model on both open-set object detection and referring object detection benchmarks. Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a 52.5 AP on the COCO zero-shot (In this paper, ‘zero-shot’ refers to scenarios where the training split of the test dataset is not utilized in the training process) detection benchmark. It sets a new record on the ODinW zero-shot benchmark with a mean 26.1 AP. We release some checkpoints and inference codes at https://github.com/IDEA-Research/GroundingDINO.

This work was done when Shilong Liu, Feng Li, Hao Zhang, Jie Yang, and Qing Jiang were interns at IDEA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Open Vocabulary Object Detection with Pseudo Bounding-Box Labels

Class-Agnostic Object Detection with Multi-modal Transformer

DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM

Notes

1.
We view the terms open-set object detection, open-world object detection, and open-vocabulary object detection the same task in this paper. To avoid confusion, we always use open-set object detection in our paper.
2.
We use the term Referring Expression Comprehension (REC) and Referring (Object) Detection exchangeable in this paper.
3.
It is not an exact mapping between O365 and COCO categories. We made some approximations during evaluation.
4.
We used the official released code and checkpoints in https://github.com/microsoft/GLIP.

References

Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Computer Vision and Pattern Recognition (2017)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chen, K., et al.: Hybrid task cascade for instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4974–4983 (2019)
Google Scholar
Chen, Q., et al.: Group DETR: fast DETR training with group-wise one-to-many assignment (2022)
Google Scholar
Dai, X., et al.: Dynamic head: unifying object detection heads with attentions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7373–7382 (2021)
Google Scholar
Deng, J., Yang, Z., Chen, T., Zhou, W., Li, H.: TransVG: end-to-end visual grounding with transformers. arXiv, Computer Vision and Pattern Recognition (2021)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dong, N., Zhang, Y., Ding, M., Lee, G.H.: Boosting long-tailed object detection via step-wise learning on smooth-tail data (2023). https://arxiv.org/abs/2305.12833
Du, Y., Fu, Z., Liu, Q., Wang, Y.: Visual grounding with transformers (2021)
Google Scholar
Gan, Z., Chen, Y.C., Li, L., Zhu, C., Cheng, Y., Liu, J.: Large-scale adversarial training for vision-and-language representation learning. In: Neural Information Processing Systems (2020)
Google Scholar
Gao, P., et al.: Clip-adapter: better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544 (2021)
Gao, P., Zheng, M., Wang, X., Dai, J., Li, H.: Fast convergence of DETR with spatially modulated co-attention. arXiv preprint arXiv:2101.07448 (2021)
Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. Learning (2021)
Google Scholar
Gupta, A., Dollar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5356–5364 (2019)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Jia, D., et al.: DETRs with hybrid matching (2022)
Google Scholar
Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: MDETR-modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1780–1790 (2021)
Google Scholar
Krasin, I., et al.: Openimages: a public dataset for large-scale multi-label and multi-class image classification. Dataset https://github.com/openimages, 2(3), 18 (2017)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. (2017)
Google Scholar
Kuo, W., Bertsch, F., Li, W., Piergiovanni, A., Saffar, M., Angelova, A.: Findit: generalized localization with natural language queries (2022)
Google Scholar
Kuznetsova, A., et al.: The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. arXiv, Computer Vision and Pattern Recognition (2018)
Google Scholar
Li, C., et al.: Elevater: a benchmark and toolkit for evaluating language-augmented visual models (2022)
Google Scholar
Li, F., Zhang, H., Liu, S., Guo, J., Ni, L.M., Zhang, L.: DN-DETR: accelerate DETR training by introducing query denoising. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13619–13627 (2022)
Google Scholar
Li, L.H., et al.: Grounded language-image pre-training. arXiv preprint arXiv:2112.03857 (2021)
Li, M., Sigal, L.: Referring transformer: a one-step approach to multi-task visual grounding. arXiv, Computer Vision and Pattern Recognition (2021)
Google Scholar
Li, Y., et al.: GLIGEN: open-set grounded text-to-image generation. In: CVPR (2023)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, J., Wang, L., Yang, M.H.: Referring expression generation and comprehension via attributes. In: International Conference on Computer Vision (2017)
Google Scholar
Liu, S., et al.: DAB-DETR: dynamic anchor boxes are better queries for DETR. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=oMI9PjOb9Jl
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)
Meng, D., et al.: Conditional DETR for fast training convergence. arXiv preprint arXiv:2108.06152 (2021)
Miao, P., Su, W., Wang, L., Fu, Y., Li, X.: Referring expression comprehension via cross-level multi-modal fusion. arXiv abs/2204.09957 (2022)
Google Scholar
Minderer, M., et al.: Simple open-vocabulary object detection with vision transformers (2022)
Google Scholar
Ordonez, V., Kulkarni, G., Berg, T.L.: Im2text: describing images using 1 million captioned photographs. In: Neural Information Processing Systems (2011)
Google Scholar
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2641–2649 (2015)
Google Scholar
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. Int. J. Comput. Vis. (2015)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)
Article Google Scholar
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 658–666 (2019)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021)
Google Scholar
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Meeting of the Association for Computational Linguistics (2015)
Google Scholar
Shao, S., et al.: Objects365: a large-scale, high-quality dataset for object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 8430–8439 (2019)
Google Scholar
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Meeting of the Association for Computational Linguistics (2018)
Google Scholar
Shilong, L., et al.: DQ-DETR: dual query detection transformer for phrase extraction and grounding. In: Proceedings of the AAAI Conference on Artificial Intelligence (2023)
Google Scholar
Tan, M., Le, Q.V.: EfficientNet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning (2019)
Google Scholar
Thomee, B., et al.: YFCC100M: the new data in multimedia research. Commun. ACM (2016)
Google Scholar
Wang, Y., Zhang, X., Yang, T., Sun, J.: Anchor DETR: query design for transformer-based detector. In: National Conference on Artificial Intelligence (2021)
Google Scholar
Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019)
Xu, M., et al.: End-to-end semi-supervised object detection with soft teacher. arXiv preprint arXiv:2106.09018 (2021)
Yao, L., et al.: DetCLIPv2: scalable open-vocabulary object detection pre-training via word-region alignment (2023)
Google Scholar
Yao, L., et al.: Detclip: dictionary-enriched visual-concept paralleled pre-training for open-world detection (2022)
Google Scholar
Yu, L., et al.: Mattnet: modular attention network for referring expression comprehension. In: Computer Vision and Pattern Recognition (2018)
Google Scholar
Yuan, L., et al.: Florence: a new foundation model for computer vision (2022)
Google Scholar
Zang, Y., Li, W., Zhou, K., Huang, C., Loy, C.C.: Open-vocabulary DETR with conditional matching (2022)
Google Scholar
Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14393–14402 (2021)
Google Scholar
Zhang, H., et al.: DINO: DETR with improved denoising anchor boxes for end-to-end object detection (2022)
Google Scholar
Zhang, H., et al.: GLIPv2: unifying localization and vision-language understanding (2022)
Google Scholar
Zhao, T., Liu, P., Lu, X., Lee, K.: OmDet: language-aware object detection with large-scale vision-language multi-dataset pre-training (2022)
Google Scholar
Zhong, Y., et al.: Regionclip: region-based language-image pretraining (2022)
Google Scholar
Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I.: Detecting twenty-thousand classes using image-level supervision. In: ECCV (2022)
Google Scholar
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: ICLR 2021: The Ninth International Conference on Learning Representations (2021)
Google Scholar

Download references

Acknowledgement

We thank the authors of GLIP [24]: Liunian Harold Li, Pengchuan Zhang, and Haotian Zhang for their helpful discussions and instructions. We also thank Tiancheng Zhao, the author of OmDet [58], and Jianhua Han, the author of DetCLIP [51], for their response to their model details. We thank He Cao of The Hong Kong University of Science and Technology for his help on diffusion models.

Author information

Authors and Affiliations

Department of Computer Science and Technology, BNRist Center, State Key Lab for Intelligent Technology and Systems, Institute for AI, Tsinghua-Bosch Joint Center for ML, Tsinghua University, Beijing, China
Shilong Liu, Hang Su & Jun Zhu
International Digital Economy Academy (IDEA), Shenzhen, China
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang & Lei Zhang
The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong
Feng Li & Hao Zhang
The Chinese University of Hong Kong (Shenzhen), Shenzhen, China
Jie Yang
Microsoft Research, Redmond, USA
Chunyuan Li & Jianwei Yang
South China University of Technology, Guangzhou, China
Qing Jiang

Authors

Shilong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhaoyang Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Tianhe Ren
View author publications
You can also search for this author in PubMed Google Scholar
Feng Li
View author publications
You can also search for this author in PubMed Google Scholar
Hao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jie Yang
View author publications
You can also search for this author in PubMed Google Scholar
Qing Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Chunyuan Li
View author publications
You can also search for this author in PubMed Google Scholar
Jianwei Yang
View author publications
You can also search for this author in PubMed Google Scholar
Hang Su
View author publications
You can also search for this author in PubMed Google Scholar
Jun Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Lei Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jun Zhu or Lei Zhang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2727 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, S. et al. (2025). Grounding DINO: Marrying DINO with Grounded Pre-training for Open-Set Object Detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15105. Springer, Cham. https://doi.org/10.1007/978-3-031-72970-6_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-72970-6_3
Published: 23 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72969-0
Online ISBN: 978-3-031-72970-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Grounding DINO: Marrying DINO with Grounded Pre-training for Open-Set Object Detection

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Open Vocabulary Object Detection with Pseudo Bounding-Box Labels

Class-Agnostic Object Detection with Multi-modal Transformer

DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 2727 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Grounding DINO: Marrying DINO with Grounded Pre-training for Open-Set Object Detection

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Open Vocabulary Object Detection with Pseudo Bounding-Box Labels

Class-Agnostic Object Detection with Multi-modal Transformer

DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 2727 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation