ViDT: An Efficient and Effective Fully Transformer-based Object Detector

Song, Hwanjun; Sun, Deqing; Chun, Sanghyuk; Jampani, Varun; Han, Dongyoon; Heo, Byeongho; Kim, Wonjae; Yang, Ming-Hsuan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2110.03921 (cs)

[Submitted on 8 Oct 2021 (v1), last revised 29 Nov 2021 (this version, v2)]

Title:ViDT: An Efficient and Effective Fully Transformer-based Object Detector

Authors:Hwanjun Song, Deqing Sun, Sanghyuk Chun, Varun Jampani, Dongyoon Han, Byeongho Heo, Wonjae Kim, Ming-Hsuan Yang

View PDF

Abstract:Transformers are transforming the landscape of computer vision, especially for recognition tasks. Detection transformers are the first fully end-to-end learning systems for object detection, while vision transformers are the first fully transformer-based architecture for image classification. In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector. ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector, followed by a computationally efficient transformer decoder that exploits multi-scale features and auxiliary techniques essential to boost the detection performance without much increase in computational load. Extensive evaluation results on the Microsoft COCO benchmark dataset demonstrate that ViDT obtains the best AP and latency trade-off among existing fully transformer-based object detectors, and achieves 49.2AP owing to its high scalability for large models. We will release the code and trained models at this https URL

Comments:	This is a revised version on Nov. 29
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2110.03921 [cs.CV]
	(or arXiv:2110.03921v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2110.03921

Submission history

From: Hwanjun Song [view email]
[v1] Fri, 8 Oct 2021 06:32:05 UTC (795 KB)
[v2] Mon, 29 Nov 2021 11:07:13 UTC (817 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CV

< prev | next >

new | recent | 2021-10

Change to browse by:

cs
cs.LG

References & Citations

DBLP - CS Bibliography

listing | bibtex

Deqing Sun
Sanghyuk Chun
Varun Jampani
Dongyoon Han
Byeongho Heo

…

export BibTeX citation

Computer Science > Computer Vision and Pattern Recognition

Title:ViDT: An Efficient and Effective Fully Transformer-based Object Detector

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:ViDT: An Efficient and Effective Fully Transformer-based Object Detector

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators