research-article

RPViT: Vision Transformer Based on Region Proposal

Authors:

Qianxiang Wang,

Guangyu GaoAuthors Info & Claims

ICIGP '22: Proceedings of the 2022 5th International Conference on Image and Graphics Processing

Pages 220 - 225

https://doi.org/10.1145/3512388.3512421

Published: 28 March 2022 Publication History

Abstract

Vision Transformers constantly absorb the characteristics of convolutional neural networks to solve its shortcomings in translational invariance and scale invariance. However, dividing the image by a simple grid often destroys the position and scale features in the image at the beginning of the network. In this paper, we propose a vision transformer based on region proposal, which obtains the inductive bias in a simple way. Specifically, RPViT achieves locality and scale-invariance by extracting regions with locality using a traditional region proposal algorithm and deflating objects of different scales to the same scale by a bilinear interpolation algorithm. In addition, to enable the network to fully utilize and encode diverse candidate objects, a multi-class token approach based on orthogonalization is proposed and applied. Experiments on ImageNet demonstrate that RPViT outperforms baseline converters and related work.

Supplementary Material

Presentation slides (presentation_of_rpvit.pptx)

Download
1.60 MB

References

[1]

Pablo Arbeláez, Jordi Pont-Tuset, Jonathan T Barron, Ferran Marques, and Jitendra Malik. 2014. Multiscale combinatorial grouping. In Proceedings of the IEEE conference on computer vision and pattern recognition. 328-335.

Digital Library

[2]

Åke Björck. 1967. Solving linear least squares problems by Gram-Schmidt orthogonalization. BIT Numerical Mathematics. 7, 1, 1-21.

Digital Library

[3]

Marcus D Bloice, Peter M Roth, and Andreas Holzinger. 2019. Biomedical image augmentation using Augmentor. Bioinformatics. 35, 21, 4522-4524.

[4]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European Conference on Computer Vision. Springer, 213-229.

Digital Library

[5]

Ming-Ming Cheng, Ziming Zhang, Wen-Yan Lin, and Philip Torr. 2014. BING: Binarized normed gradients for objectness estimation at 300fps. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3286-3293.

Digital Library

[6]

Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. 2019. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 113-123.

[7]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, and Sylvain Gelly. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

[8]

Pedro F Felzenszwalb and Daniel P Huttenlocher. 2004. Efficient graph-based image segmentation. International journal of computer vision. 59, 2, 167-181.

[9]

Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440-1448.

Digital Library

[10]

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 580-587.

Digital Library

[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence. 37, 9, 1904-1916.

[12]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770-778.

[13]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems. 25, 1097-1105.

[14]

Kevin Lin, Lijuan Wang, and Zicheng Liu. 2021. End-to-end human pose and mesh reconstruction with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1954-1963.

[15]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030.

[16]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training,

[17]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818-2826.

[18]

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning. PMLR, 10347-10357.

[19]

Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. 2013. Selective search for object recognition. International journal of computer vision. 104, 2, 154-171.

[20]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998-6008.

[21]

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2021. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122.

[22]

Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. 2021. Cvt: Introducing convolutions to vision transformers. arXiv preprint arXiv:2103.15808.

[23]

Fuzhi Yang, Huan Yang, Jianlong Fu, Hongtao Lu, and Baining Guo. 2020. Learning texture transformer network for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5791-5800.

[24]

Fisher Yu and Vladlen Koltun. 2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122.

[25]

Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. 2021. Tokens-to-token vit: Training vision transformers from scratch on imagenet. arXiv preprint arXiv:2101.11986.

[26]

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. 2019. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6023-6032.

[27]

C Lawrence Zitnick and Piotr Dollár. 2014. Edge boxes: Locating object proposals from edges. In European conference on computer vision. Springer, 391-405.

Recommendations

A deep learning-based and adaptive region proposal algorithm for semantic segmentation
Abstract
This paper presents an adaptive and new region proposal algorithm for generating high-quality regions. The main aim of this algorithm is to investigate different features in the proposal generation process. This algorithm is based on bottom-up ...
Highlights
- This paper proposes a new region proposal generation based on a hierarchical deep learning-based merging algorithm.
- The effectiveness and quality of some known texture-based descriptors are explored in the proposed algorithm.
- A new ...
A comprehensive and systematic review on classical and deep learning based region proposal algorithms
Abstract
Development of region proposal algorithms has rapidly become one of the most critical research areas over recent years. The perfect accuracy of region-based recognition techniques has led to the use of proposal algorithms as an ...
Highlights
- A comprehensive review of recent works of region proposal algorithms is presented.
Weakly Supervised Region Proposal Network and Object Detection
Computer Vision – ECCV 2018
Abstract
The Convolutional Neural Network (CNN) based region proposal generation method (i.e. region proposal network), trained using bounding box annotations, is an essential component in modern fully supervised object detectors. However, Weakly ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICIGP '22: Proceedings of the 2022 5th International Conference on Image and Graphics Processing

January 2022

391 pages

ISBN:9781450395465

DOI:10.1145/3512388

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 March 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICIGP 2022

ICIGP 2022: 2022 the 5th International Conference on Image and Graphics Processing

January 7 - 9, 2022

Beijing, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
121
Total Downloads

Downloads (Last 12 months)33
Downloads (Last 6 weeks)4

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents