research-article

Richer Information Transformer for Object Detection

Authors:

Shunyu Yao,

Ke Qi,

Wenbin Chen, and

Yicong ZhouAuthors Info & Claims

MLNLP '22: Proceedings of the 2022 5th International Conference on Machine Learning and Natural Language Processing

December 2022

Pages 110 - 114

https://doi.org/10.1145/3578741.3578763

Published: 06 March 2023 Publication History

Get Access

Abstract

While Convolutional Neural Networks (CNNs) have been dominated computer vision tasks such as object detection and instance segmentation for a long time, recently Vision Transformers (ViTs) are showing promising performance in these tasks nowadays. Though CNNs can efficiently decrease local redundancy by convolution within a small neighborhood, the limited receptive field makes it hard to capture global dependency. Alternatively, ViTs can effectively capture long-range dependency via self-attention mechanism which also produce quadratic computation complexity to the image size input. In this paper, we propose a module composed of several groups of convs and activation functions to make up for the lack of comprehensive information in ViTs for extracting features, so that conv and transformer can achieve a complementary advantage. We also introduce channel attention module to capture the channel information, which arising from the frequent manipulations to channels during the calculation process of self-attention. In the absence of pretrained data, our model achieves 40.3 box AP and 37.1 mask AP on COCO object detection task, surpassing state-of-art Swin Transformers backbone respectively by +8.8, +6.7 respectively under the similar FLOPs settings.

References

[1]

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Communications of the ACM 60.6 (2017): 84-90.

Digital Library

Google Scholar

[2]

Dosovitskiy, Alexey, "An image is worth 16x16 words: Transformers forimage recognition at scale." arXiv preprint arXiv:2010.11929 (2020).

Google Scholar

[3]

Yang, Jianwei, "Focal self-attention for local-global interactions in vision transformers." arXiv preprint arXiv:2107.00641 (2021).

Google Scholar

[4]

Fan, Haoqi, "Multiscale vision transformers." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

Google Scholar

[5]

Xie, Saining, "Aggregated residual transformations for deep neural networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

Google Scholar

[6]

Dai, Jifeng, "Deformable convolutional networks." Proceedings of the IEEE international conference on computer vision. 2017.

Google Scholar

[7]

Li, Yanjie, "Tokenpose: Learning keypoint tokens for human pose estimation." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

Google Scholar

[8]

Touvron, Hugo, "Going deeper with image transformers." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

Google Scholar

[9]

Dai, Zihang, "Coatnet: Marrying convolution and attention for all data sizes." Advances in Neural Information Processing Systems 34 (2021): 3965-3977.

Google Scholar

[10]

Sandler, Mark, "Mobilenetv2: Inverted residuals and linear bottlenecks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

Google Scholar

[11]

Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-excitation networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

Google Scholar

Index Terms

Richer Information Transformer for Object Detection
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Deep convolution neural network with scene-centric and object-centric information for object detection
Abstract
In recent years, Deep Convolutional Neural Network (CNN) has shown an impressive performance on computer vision field. The ability of learning feature representations from large training dataset makes deep CNN outperform traditional ...
Read More
Multi-view convolutional vision transformer for 3D object recognition
Abstract
With the rapid development of three-dimensional (3D) vision technology and the increasing application of 3D objects, there is an urgent need for 3D object recognition in the fields of computer vision, virtual reality, and artificial intelligence ...
Highlights
- Proposing a new architecture for view-based 3D object recognition.
- Combining the respective advantages of convolutional neural network and transformer.
- Designing a multi-scale feature fusion module.
- Designing a masking ...
Read More
Exploring Plain Vision Transformer Backbones for Object Detection
Computer Vision – ECCV 2022
Abstract
We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone ... $^{}$
Read More

Comments

Information & Contributors

Information

Published In

MLNLP '22: Proceedings of the 2022 5th International Conference on Machine Learning and Natural Language Processing

December 2022

406 pages

ISBN:9781450399067

DOI:10.1145/3578741

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 March 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Conference

MLNLP 2022

MLNLP 2022: 2022 5th International Conference on Machine Learning and Natural Language Processing

December 23 - 25, 2022

Sanya, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
33
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)0

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Abstract

References

Index Terms

Recommendations

Deep convolution neural network with scene-centric and object-centric information for object detection

Multi-view convolutional vision transformer for 3D object recognition

Exploring Plain Vision Transformer Backbones for Object Detection

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Funding Sources

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Get Access

Login options

Full Access

View options

PDF

eReader

HTML Format

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations