Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
survey

Transformers in Vision: A Survey

Published: 13 September 2022 Publication History

Abstract

Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks, e.g., Long short-term memory. Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text, and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers, i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization), and three-dimensional analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges toward the application of transformer models in computer vision.

References

[1]
[n.d.]. AAAI 2020 Keynotes Turing Award Winners Event. Retrieved December 31, 2020 from https://www.youtube.com/watch?v=UX8OubxsY8w.
[2]
[n.d.]. OpenAI’s GPT-3 Language Model: A Technical Overview. Retrieved December 31, 2020 from https://lambdalabs.com/blog/demystifying-gpt-3/.
[3]
[n.d.]. Revisiting the Unreasonable Effectiveness of Data. Retrieved December 31, 2020 from https://ai.googleblog.com/2017/07/revisiting-unr easonable-effectiveness.html.
[4]
Samira Abnar and Willem Zuidema. 2020. Quantifying attention flow in transformers. arXiv:2005.00928. Retrieved from https://arxiv.org/abs/2005.00928.
[5]
Unaiza Ahsan, Rishi Madhok, and Irfan Essa. 2019. Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition. In WACV.
[6]
Chris Alberti, Jeffrey Ling, Michael Collins, and David Reitter. 2019. Fusion of detected objects in text for visual question answering. In EMNLP.
[7]
Anonymous. 2022. Patches are all you need? In ICLR.
[8]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering. In ICCV.
[9]
Relja Arandjelovic and Andrew Zisserman. 2017. Look, listen and learn. In ICCV.
[10]
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. 2021. Vivit: A video vision transformer. arXiv:2103.15691. Retrieved from https://arxiv.org/abs/2103.15691.
[11]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. arXiv:1607.06450. Retrieved from https://arxiv.org/abs/1607.06450.
[12]
Philip Bachman, R. Hjelm, and W. Buchwalter. 2019. Learning representations by maximizing mutual information across views. In NeurIPS.
[13]
Irwan Bello. 2021. LambdaNetworks: Modeling long-range interactions without attention. In ICLR.
[14]
Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V. Le. 2019. Attention augmented convolutional networks. In ICCV.
[15]
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv:2004.05150. Retrieved from https://arxiv.org/abs/2004.05150.
[16]
Yoshua Bengio, Ian Goodfellow, and Aaron Courville. 2017. Deep Learning. MIT Press.
[17]
Gedas Bertasius and Lorenzo Torresani. 2020. Classifying, segmenting, and tracking object instances in video with mask propagation. In CVPR. 9739–9748.
[18]
Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding? In ICML.
[19]
Gabriel J. Brostow, Julien Fauqueur, and Roberto Cipolla. 2009. Semantic object classes in video: A high-definition ground truth database. Pattern Recogn. Lett. 30, 2 (2009), 88–97.
[20]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. arXiv:2005.14165. Retrieved from https://arxiv.org/abs/2005.14165.
[21]
Antoni Buades, Bartomeu Coll, and J.-M. Morel. 2005. A non-local algorithm for image denoising. In CVPR.
[22]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. arXiv:2005.12872. Retrieved from https://arxiv.org/abs/2005.12872.
[23]
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. arXiv:2104.14294. Retrieved from https://arxiv.org/abs/2104.14294.
[24]
Joao Carreira, Eric Noland, Chloe Hillier, and A. Zisserman. 2019. A short note on the kinetics-700 human action dataset. arXiv:1907.06987. Retrieved from https://arxiv.org/abs/1907.06987.
[25]
Sneha Chaudhari, Gungor Polatkan, Rohan Ramanath, and Varun Mithal. 2019. An attentive survey of attention models. arXiv:1904.02874. Retrieved from https://arxiv.org/abs/1904.02874.
[26]
Hila Chefer, Shir Gur, and Lior Wolf. 2020. Transformer interpretability beyond attention visualization. arXiv:2012.09838. Retrieved from https://arxiv.org/abs/2012.09838.
[27]
Boyu Chen, Peixia Li, Chuming Li, Baopu Li, Lei Bai, Chen Lin, Ming Sun, Wanli Ouyang, et al. 2021. GLiT: Neural architecture search for global and local image transformer. arXiv:2107.02960 (2021).
[28]
Chun-Fu Chen, Quanfu Fan, and Rameswar Panda. 2021. Crossvit: Cross-attention multi-scale vision transformer for image classification. arXiv:2103.14899. Retrieved from https://arxiv.org/abs/2103.14899.
[29]
Chun-Fu Chen, Rameswar Panda, and Quanfu Fan. 2021. RegionViT: Regional-to-local attention for vision transformers. arxiv:cs.CV/2106.02689. Retrieved from https://arxiv.org/abs/2106.02689.
[30]
Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. 2020. Pre-trained image processing transformer. arXiv:2012.00364. Retrieved from https://arxiv.org/abs/2012.00364.
[31]
Kevin Chen, Junshen K. Chen, Jo Chuang, Marynel Vázquez, and Silvio Savarese. 2020. Topological planning with transformers for vision-and-language navigation. arXiv:2012.05292. Retrieved from https://arxiv.org/abs/2012.05292.
[32]
Minghao Chen, Houwen Peng, Jianlong Fu, and Haibin Ling. 2021. AutoFormer: Searching transformers for visual recognition. arXiv:2107.00651. Retrieved from https://arxiv.org/abs/2107.00651.
[33]
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and I. Sutskever. 2020. Generative pretraining from pixels. In ICML.
[34]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. arXiv:2002.05709. Retrieved from https://arxiv.org/abs/2002.05709.
[35]
Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, and Geoffrey Hinton. 2021. Pix2seq: A Language Modeling Framework for Object Detection. arxiv:cs.CV/2109.10852. Retrieved from https://arxiv.org/abs/2109.10852.
[36]
Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. 2020. Improved baselines with momentum contrastive learning. arXiv:2003.04297. Retrieved from https://arxiv.org/abs/2003.04297.
[37]
Xiangning Chen, Cho-Jui Hsieh, and Boqing Gong. 2021. When vision transformers outperform ResNets without pretraining or strong data augmentations. arXiv:2106.01548. Retrieved from https://arxiv.org/abs/2106.01548.
[38]
Xinlei Chen, Saining Xie, and Kaiming He. 2021. An empirical study of training self-supervised visual transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21). 9640–9649.
[39]
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: Universal image-text representation learning. In ECCV.
[40]
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv:1904.10509. Retrieved from https://arxiv.org/abs/1904.10509.
[41]
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. 2021. Rethinking attention with performers. In ICLR.
[42]
Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. 2021. Twins: Revisiting the design of spatial attention in vision transformers. arXiv:2104.13840. Retrieved from https://arxiv.org/abs/2104.13840.
[43]
Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. 2021. Conditional Positional Encodings for Vision Transformers. arxiv:cs.CV/2102.10882. Retrieved from https://arxiv.org/abs/2102.10882.
[44]
Adam Coates, Andrew Ng, and Honglak Lee. 2011. An analysis of single-layer networks in unsupervised feature learning. In AISTATS.
[45]
Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jaggi. 2019. On the relationship between self-attention and convolutional layers. In ICLR.
[46]
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In CVPR.
[47]
Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. 2017. Deformable convolutional networks. In ICCV. 764–773.
[48]
Tao Dai, Jianrui Cai, Yongbing Zhang, S. Xia, and L. Zhang. 2019. Second-order attention network for single image super-resolution. In CVPR.
[49]
Zihang Dai, Hanxiao Liu, Quoc V. Le, and Mingxing Tan. 2021. CoAtNet: Marrying convolution and attention for all data sizes. arxiv:cs.CV/2106.04803. Retrieved from https://arxiv.org/abs/2106.04803.
[50]
Alana de Santana Correia and Esther Luna Colombini. 2021. Attention, please! A survey of neural attention models in deep learning. arXiv:2103.16775. Retrieved from https://arxiv.org/abs/2103.16775.
[51]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In CVPR.
[52]
Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang Li. 2021. TransVG: End-to-end visual grounding with transformers. arxiv:cs.CV/2104.08541. Retrieved from https://arxiv.org/abs/2104.08541.
[53]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from https://arxiv.org/abs/1810.04805.
[54]
Carl Doersch, Ankush Gupta, and Andrew Zisserman. 2020. CrossTransformers: Spatially-aware few-shot transfer. NeurIPS.
[55]
Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. 2021. Cswin transformer: A general vision transformer backbone with cross-shaped windows. arXiv:2107.00652. Retrieved from https://arxiv.org/abs/2107.00652.
[56]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16 \(\times\) 16 words: Transformers for image recognition at scale. arXiv:2010.11929. Retrieved from https://arxiv.org/abs/2010.11929.
[57]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arxiv:cs.CV/2010.11929. Retrieved from https://arxiv.org/abs/2010.11929.
[58]
Ye Du, Zehua Fu, Qingjie Liu, and Yunhong Wang. 2021. Visual grounding with transformers. arXiv:2105.04281. Retrieved from https://arxiv.org/abs/2105.04281.
[59]
Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, and Hervé Jegou. 2021. XCiT: Cross-Covariance Image Transformers. Advances in Neural Information Processing Systems 34 (2021).
[60]
Patrick Esser, Robin Rombach, and Björn Ommer. 2020. Taming transformers for high-resolution image synthesis. arXiv:2012.09841. Retrieved from https://arxiv.org/abs/2012.09841.
[61]
Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. 2021. Multiscale vision transformers. arxiv:cs.CV/2104.11227. Retrieved fromd https://arxiv.org/abs/2104.11227.
[62]
Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, and Wenyu Liu. 2021. You only look at one sequence: Rethinking transformer in vision through object detection. arxiv:cs.CV/2106.00666. Retrieved from https://arxiv.org/abs/2106.00666.
[63]
William Fedus, Barret Zoph, and Noam Shazeer. [n.d.]. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv:2101.03961. Retrieved from https://arxiv.org/abs/2101.03961.
[64]
Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. 2020. Sharpness-aware minimization for efficiently improving generalization. arXiv:2010.01412. Retrieved from https://arxiv.org/abs/2010.01412.
[65]
Chen Gao, Yunpeng Chen, Si Liu, Zhenxiong Tan, and Shuicheng Yan. 2020. Adversarialnas: Adversarial neural architecture search for gans. In CVPR. 5680–5689.
[66]
Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP.
[67]
Spyros Gidaris, Praveer Singh, and Nikos Komodakis. 2018. Unsupervised representation learning by predicting image rotations. In ICLR.
[68]
Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, and Thomas Brox. 2020. COOT: Cooperative hierarchical transformer for video-text representation learning. arXiv:2011.00597. Retrieved from https://arxiv.org/abs/2011.00597.
[69]
Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zisserman. 2019. Video action transformer network. In CVPR.
[70]
Ross Girshick. 2015. Fast R-CNN. In ICCV.
[71]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In NeurIPS.
[72]
Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, and Matthijs Douze. 2021. LeViT: a vision transformer in ConvNet’s clothing for faster inference. arxiv:cs.CV/2104.01136. Retrieved from https://arxiv.org/abs/2104.01136.
[73]
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent: A new approach to self-supervised learning. In NeurIPS.
[74]
Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. 2021. Transformer in transformer. arXiv:2103.00112. Retrieved from https://arxiv.org/abs/2103.00112.
[75]
Wei Han, Shiyu Chang, Ding Liu, Mo Yu, M. Witbrock, and T. Huang. 2018. Image super-resolution via dual-state recurrent networks. In CVPR.
[76]
Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin, and Jianfeng Gao. 2020. Towards learning a generic agent for vision-and-language navigation via pre-training. In CVPR.
[77]
Ali Hassani, Steven Walton, Nikhil Shah, Abulikemu Abuduweili, Jiachen Li, and Humphrey Shi. 2021. Escaping the big data paradigm with compact transformers. arxiv:cs.CV/2104.05704. Retrieved from https://arxiv.org/abs/2104.05704.
[78]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9729–9738.
[79]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In ICCV.
[80]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
[81]
Shuting He, Hao Luo, P. Wang, F. Wang, H. Li, and W. Jiang. 2021. TransReID: Transformer-based object re-identification. arXiv:2102.04378. Retrieved from https://arxiv.org/abs/2102.04378.
[82]
Olivier J. Hénaff, Aravind Srinivas, Jeffrey De Fauw, Ali Razavi, Carl Doersch, S. M. Eslami, and Aaron van den Oord. 2019. Data-efficient image recognition with contrastive predictive coding. arXiv:1905.09272. Retrieved from https://arxiv.org/abs/1905.09272.
[83]
Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. 2019. Axial attention in multidimensional transformers. arXiv:1912.12180. Retrieved from https://arxiv.org/abs/1912.12180.
[84]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. (1997).
[85]
Dichao Hu. 2019. An introductory survey on attention mechanisms in NLP problems. In IntelliSys.
[86]
Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. 2019. Local relation networks for image recognition. In ICCV.
[87]
Zilong Huang, Youcheng Ben, Guozhong Luo, Pei Cheng, Gang Yu, and Bin Fu. 2021. Shuffle transformer: Rethinking spatial shuffle for vision transformer. arxiv:cs.CV/2106.03650. Retrieved from https://arxiv.org/abs/2106.03650.
[88]
Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. 2019. CCNet: Criss-cross attention for semantic segmentation. In ICCV.
[89]
Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. 2021. Perceiver IO: A general architecture for structured inputs & outputs. arXiv:2107.14795. Retrieved from https://arxiv.org/abs/2107.14795.
[90]
Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, and Joao Carreira. 2021. Perceiver: General perception with iterative attention. arXiv:2103.03206. Retrieved from https://arxiv.org/abs/2103.03206.
[91]
Yifan Jiang, Shiyu Chang, and Zhangyang Wang. 2021. TransGAN: Two transformers can make one strong GAN. arxiv:cs.CV/2102.07074. Retrieved from https://arxiv.org/abs/2102.07074.
[92]
Zihang Jiang, Qibin Hou, Li Yuan, Daquan Zhou, Yujun Shi, Xiaojie Jin, Anran Wang, and Jiashi Feng. 2021. All tokens matter: Token labeling for training better vision transformers. arXiv:2104.10858. Retrieved from https://arxiv.org/abs/2104.10858.
[93]
Longlong Jing and Yingli Tian. 2020. Self-supervised visual feature learning with deep neural networks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. (2020).
[94]
Longlong Jing, Xiaodong Yang, Jingen Liu, and Yingli Tian. 2018. Self-supervised spatiotemporal feature learning via video rotation prediction. arXiv:1811.11387. Retrieved from https://arxiv.org/abs/1811.11387.
[95]
Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In ECCV.
[96]
Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Synnaeve, and Nicolas Carion. 2021. MDETR–Modulated detection for end-to-end multi-modal understanding. arXiv:2104.12763. Retrieved from https://arxiv.org/abs/2104.12763.
[97]
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and improving the image quality of stylegan. In CVPR. 8110–8119.
[98]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv:1705.06950. Retrieved from https://arxiv.org/abs/1705.06950.
[99]
Sahar Kazemzadeh, Vicente Ordonez, M. Matten, and T. Berg. 2014. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP.
[100]
Salman Khan, Hossein Rahmani, Syed Afaq Ali Shah, and Mohammed Bennamoun. 2018. A guide to convolutional neural networks for computer vision. In Synthesis Lectures on Computer Vision.
[101]
Dahun Kim, Donghyeon Cho, Donggeun Yoo, and In So Kweon. 2018. Learning image representations by completing damaged jigsaw puzzles. In WACV.
[102]
Diederik P. Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv:1312.6114. Retrieved from https://arxiv.org/abs/1312.6114.
[103]
Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár. 2019. Panoptic segmentation. In CVPR.
[104]
Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. In ICLR.
[105]
Bruno Korbar, Du Tran, and Lorenzo T.2018. Cooperative learning of audio and video models from self-supervised synchronization. In NeurIPS.
[106]
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In ICCV. 706–715.
[107]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. (2017).
[108]
Alex Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report.
[109]
Manoj Kumar, Dirk Weissenborn, and Nal Kalchbrenner. 2021. Colorization transformer. In ICLR.
[110]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature (2015).
[111]
Yann LeCun, Bernhard Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne Hubbard, and Lawrence D. Jackel. 1989. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1, 4 (1989), 541–551.
[112]
Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. 2017. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR.
[113]
Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. 2017. Unsupervised representation learning by sorting sequences. In ICCV.
[114]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In ECCV.
[115]
Sangho Lee, Youngjae Yu, Gunhee Kim, Thomas Breuel, Jan Kautz, and Yale Song. 2020. Parameter efficient multimodal transformers for video representation learning. arXiv:2012.04124. Retrieved from https://arxiv.org/abs/2012.04124.
[116]
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv:2006.16668. Retrieved from https://arxiv.org/abs/2006.16668.
[117]
Changlin Li, Tao Tang, Guangrun Wang, Jiefeng Peng, Bing Wang, Xiaodan Liang, and Xiaojun Chang. 2021. Bossnas: Exploring hybrid cnn-transformers with block-wisely self-supervised neural architecture search. arXiv:2103.12424. Retrieved from https://arxiv.org/abs/2103.12424.
[118]
Chunyuan Li, Jianwei Yang, Pengchuan Zhang, Mei Gao, Bin Xiao, Xiyang Dai, Lu Yuan, and Jianfeng Gao. 2021. Efficient Self-supervised vision transformers for representation learning. arXiv:2106.09785. Retrieved from https://arxiv.org/abs/2106.09785.
[119]
Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, and Ming Zhou. 2020. Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. In AAAI.
[120]
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. VisualBERT: A simple and performant baseline for vision and language. arXiv:1908.03557. Retrieved from https://arxiv.org/abs/1908.03557.
[121]
Muchen Li and Leonid Sigal. 2021. Referring transformer: A one-step approach to multi-task visual grounding. arXiv:2106.03089. Retrieved from https://arxiv.org/abs/2106.03089.
[122]
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV.
[123]
Yawei Li, Kai Zhang, Jiezhang Cao, Radu Timofte, and Luc Van Gool. 2021. LocalViT: Bringing locality to vision transformers. arxiv:cs.CV/2104.05707. Retrieved from https://arxiv.org/abs/2104.05707.
[124]
Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. 2021. SwinIR: Image restoration using swin transformer. In ICCVW.
[125]
Xiaodan Liang, Ke Gong, Xiaohui Shen, and Liang Lin. 2018. Look into person: Joint body parsing & pose estimation network and a new benchmark. IEEE Trans. Pattern Anal. Mach. Intell. (2018).
[126]
Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. 2017. Enhanced deep residual networks for single image super-resolution. In CVPRW.
[127]
Daoyu Lin, Kun Fu, Yang Wang, Guangluan Xu, and Xian Sun. 2017. MARTA GANs: Unsupervised representation learning for remote sensing image classification. IEEE Geoscience and Remote Sensing Letters 14, 11 (2017), 2092–2096.
[128]
Junyang Lin, Rui Men, An Yang, Chang Zhou, Ming Ding, Yichang Zhang, Peng Wang, Ang Wang, Le Jiang, Xianyan Jia, et al. 2021. M6: A Chinese multimodal pretrainer. arXiv:2103.00823. Retrieved from https://arxiv.org/abs/2103.00823.
[129]
Kevin Lin, Lijuan Wang, and Zicheng Liu. 2020. End-to-end human pose and mesh reconstruction with transformers. arXiv:2012.09760. Retrieved from https://arxiv.org/abs/2012.09760.
[130]
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, B. Hariharan, and S. Belongie. 2017. Feature pyramid networks for object detection. In CVPR.
[131]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In ECCV.
[132]
Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C. Kot. 2019. NTU RGB+D 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. (2019).
[133]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot multibox detector. In ECCV.
[134]
Xiao Liu, Fanjin Zhang, Zhenyu Hou, Zhaoyu Wang, Li Mian, Jing Zhang, and Jie Tang. 2020. Self-supervised learning: Generative or contrastive. arXiv:2006.08218. Retrieved from https://arxiv.org/abs/2006.08218.
[135]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized bert pretraining approach. arXiv:1907.11692. Retrieved from https://arxiv.org/abs/1907.11692.
[136]
Yun Liu, Guolei Sun, Yu Qiu, Le Zhang, Ajad Chhatkuli, and Luc Van Gool. 2021. Transformer in convolutional neural networks. arXiv:2106.03180. Retrieved from https://arxiv.org/abs/2106.03180.
[137]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV.
[138]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS.
[139]
Zhisheng Lu, Hong Liu, Juncheng Li, and Linlin Zhang. 2021. Efficient transformer for single image super-resolution. arXiv:2108.11084. Retrieved from https://arxiv.org/abs/2108.11084.
[140]
Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, and Dhruv Batra. 2020. Improving vision-and-language navigation with image-text pairs from the web. arXiv:2004.14973. Retrieved from https://arxiv.org/abs/2004.14973.
[141]
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In CVPR.
[142]
Luke Melas-Kyriazi. 2021. Do you even need attention? A stack of feed-forward layers does surprisingly well on imagenet. arXiv:2105.02723. Retrieved from https://arxiv.org/abs/2105.02723.
[143]
Ishan Misra, C. Lawrence Zitnick, and Martial Hebert. 2016. Shuffle and learn: Unsupervised learning using temporal order verification. In ECCV.
[144]
Muzammal Naseer, Kanchana Ranasinghe, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. 2021. Intriguing properties of vision transformers. arXiv:2105.10497. Retrieved from https://arxiv.org/abs/2105.10497.
[145]
Muhammad Muzammal Naseer, Salman H. Khan, Muhammad Haris Khan, Fahad Shahbaz Khan, and Fatih Porikli. 2019. Cross-domain transferability of adversarial perturbations. In NeurIPS.
[146]
Daniel Neimark, Omri Bar, Maya Zohar, and Dotan Asselmann. 2021. Video transformer network. arXiv:2102.00719. Retrieved from https://arxiv.org/abs/2102.00719.
[147]
Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder. 2017. The mapillary vistas dataset for semantic understanding of street scenes. In ICCV.
[148]
Ben Niu, Weilei Wen, Wenqi Ren, Xiangde Zhang, Lianping Yang, Shuzhen Wang, Kaihao Zhang, Xiaochun Cao, and Haifeng Shen. 2020. Single image super-resolution via a holistic attention network. In ECCV.
[149]
Mehdi Noroozi and Paolo Favaro. 2016. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV.
[150]
Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2Text: Describing images using 1 million captioned photographs. In NeurIPS.
[151]
Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. 2018. Scaling neural machine translation. In WMT.
[152]
Seong-Jin Park, Hyeongseok Son, Sunghyun Cho, Ki-Sang Hong, and Seungyong Lee. 2018. SRFEAT: Single image super-resolution with feature discrimination. In ECCV.
[153]
Niki Parmar, Prajit Ramachandran, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jon Shlens. 2019. Stand-alone self-attention in vision models. In NeurIPS.
[154]
Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image transformer. In ICML.
[155]
Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and A. Efros. 2016. Context encoders: Feature learning by inpainting. In CVPR.
[156]
Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A. Smith, and Lingpeng Kong. 2021. Random feature attention. In ICLR.
[157]
Jorge Pérez, Javier Marinković, and Pablo Barceló. 2018. On the Turing completeness of modern neural network architectures. In ICLR.
[158]
Chiara Plizzari, Marco Cannici, and Matteo Matteucci. 2020. Spatial temporal transformer network for skeleton-based action recognition. arXiv:2008.07404. Retrieved from https://arxiv.org/abs/2008.07404.
[159]
Tim Prangemeier, Christoph Reich, and Heinz Koeppl. 2020. Attention-based transformers for instance segmentation of cells in microstructures. In BIBM. IEEE, 700–707.
[160]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. Image 2 (2021), T2.
[161]
Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434. Retrieved from https://arxiv.org/abs/1511.06434.
[162]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving Language Understanding by Generative Pre-training. Technical Report. OpenAI.
[163]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models Are Unsupervised Multitask Learners. Technical Report. OpenAI.
[164]
Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. 2020. Designing network design spaces. In CVPR.
[165]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv:1910.10683. Retrieved from https://arxiv.org/abs/1910.10683.
[166]
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, and Scott Gray. 2021. DALL \({\cdot }\) E: Creating Images from Text. Technical Report. OpenAI.
[167]
Ali Razavi, Aaron van den Oord, and Oriol Vinyals. 2019. Generating diverse high-fidelity images with vq-vae-2. In NeurISP.
[168]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In CVPR.
[169]
Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text to image synthesis. In ICML.
[170]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. (2016).
[171]
Mehdi S. M. Sajjadi, Bernhard Scholkopf, and Michael Hirsch. 2017. EnhanceNet: Single image super-resolution through automated texture synthesis. In ICCV.
[172]
Nawid Sayed, Biagio Brattoli, and Björn Ommer. 2018. Cross and learn: Cross-modal self-supervision. In GCPR.
[173]
Hongje Seong, Junhyuk Hyun, and Euntai Kim. 2019. Video multitask transformer network. In ICCV Workshops. 0–0.
[174]
Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. 2016. NTU RGB+D: A large scale dataset for 3d human activity analysis. In CVPR.
[175]
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL.
[176]
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. In NAACL.
[177]
Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV.
[178]
David R. So, Chen Liang, and Quoc V. Le. 2019. The evolved transformer. arxiv:cs.LG/1901.11117. Retrieved from https://arxiv.org/abs/1901.11117.
[179]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402. Retrieved from https://arxiv.org/abs/1212.0402.
[180]
Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. 2021. Segmenter: Transformer for semantic segmentation. arxiv:cs.CV/2105.05633. Retrieved from https://arxiv.org/abs/2105.05633.
[181]
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. VL-BERT: Pre-training of generic visual-linguistic representations. arXiv:1908.08530. Retrieved from https://arxiv.org/abs/1908.08530.
[182]
Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. 2018. A corpus for reasoning about natural language grounded in photographs. arXiv:1811.00491. Retrieved from https://arxiv.org/abs/1811.00491.
[183]
Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. 2019. Learning video representations using contrastive bidirectional transformer. arXiv:1906.05743. Retrieved from https://arxiv.org/abs/1906.05743.
[184]
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. VideoBERT: A joint model for video and language representation learning. In ICCV.
[185]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In CVPR. 2818–2826.
[186]
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv:1312.6199. Retrieved from https://arxiv.org/abs/1312.6199.
[187]
Ying Tai, Jian Yang, and Xiaoming Liu. 2017. Image super-resolution via deep recursive residual network. In CVPR.
[188]
Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In EMNLP-IJCNLP.
[189]
Hao Tan and Mohit Bansal. 2020. Vokenization: Improving language understanding with contextualized, visual-grounded supervision. In EMNLP.
[190]
Mingxing Tan and Quoc V. Le. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In ICML.
[191]
Y. Tay, D. Bahri, D. Metzler, D. Juan, Z. Zhao, and C. Zheng. 2021. Synthesizer: Rethinking self-attention in transformer models. In ICML.
[192]
Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. 2020. Sparse sinkhorn attention. In ICML.
[193]
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2020. Efficient transformers: A survey. arXiv:2009.06732. Retrieved from https://arxiv.org/abs/2009.06732.
[194]
Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2019. Contrastive multiview coding. arXiv:1906.05849. Retrieved from https://arxiv.org/abs/1906.05849.
[195]
Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Peter Steiner, Daniel Keysers, Jakob Uszkoreit, et al. 2021. Mlp-mixer: An all-mlp architecture for vision. In NIPS.
[196]
Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, et al. 2021. Resmlp: Feedforward networks for image classification with data-efficient training. arXiv:2105.03404. Retrieved from https://arxiv.org/abs/2105.03404.
[197]
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2020. Training data-efficient image transformers & distillation through attention. arXiv:2012.12877. Retrieved from https://arxiv.org/abs/2012.12877.
[198]
Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. 2021. Going deeper with image transformers. arXiv:2103.17239. Retrieved from https://arxiv.org/abs/2103.17239.
[199]
Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. 2016. Conditional image generation with pixelcnn decoders. In NeurIPS.
[200]
Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas, Niki Parmar, Blake Hechtman, and Jonathon Shlens. 2021. Scaling local self-attention for parameter efficient visual backbones. In CVPR. 12894–12904.
[201]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS.
[202]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In CVPR.
[203]
Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv:1905.09418. Retrieved from https://arxiv.org/abs/1905.09418.
[204]
Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, and Song Han. 2020. HAT: Hardware-aware transformers for efficient natural language processing. arxiv:cs.CL/2005.14187. Retrieved from https://arxiv.org/abs/2005.14187.
[205]
Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. 2020. Axial-DeepLab: Stand-alone axial-attention for panoptic segmentation. arXiv:2003.07853. Retrieved from https://arxiv.org/abs/2003.07853.
[206]
Jue Wang, Gedas Bertasius, Du Tran, and Lorenzo Torresani. 2021. Long-short temporal contrastive learning of video transformers. arXiv:2106.09212. Retrieved from https://arxiv.org/abs/2106.09212.
[207]
Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer: Self-attention with linear complexity. arXiv:2006.04768. Retrieved from https://arxiv.org/abs/2006.04768.
[208]
Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2021. PVTv2: Improved baselines with pyramid vision transformer. arxiv:cs.CV/2106.13797. Retrieved from https://arxiv.org/abs/2106.13797.
[209]
Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2021. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. arXiv:2102.12122. Retrieved from https://arxiv.org/abs/2102.12122.
[210]
Wenxiao Wang, Lu Yao, Long Chen, Deng Cai, Xiaofei He, and Wei Liu. 2021. CrossFormer: A versatile vision transformer based on cross-scale attention. arXiv:2108.00154. Retrieved from https://arxiv.org/abs/2108.00154.
[211]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In CVPR.
[212]
Xinpeng Wang, Chandan Yeshwanth, and Matthias Nießner. 2020. SceneFormer: Indoor scene generation with transformers. arXiv:2012.09793. Retreived from https://arxiv.org/abs/2012.09793.
[213]
Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. 2018. ESRGAN: Enhanced super-resolution generative adversarial networks. In ECCVW.
[214]
Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. 2020. End-to-end video instance segmentation with transformers. arXiv:2011.14503. Retrieved from https://arxiv.org/abs/2011.14503.
[215]
Yingming Wang, Xiangyu Zhang, Tong Yang, and Jian Sun. 2021. Anchor DETR: Query design for transformer-based detector. arxiv:cs.CV/2109.07107. Retrieved from https://arxiv.org/abs/2109.07107.
[216]
Zhendong Wang, Xiaodong Cun, Jianmin Bao, and Jianzhuang Liu. 2021. Uformer: A general U-shaped transformer for image restoration. arXiv:2106.03106. Retrieved from https://arxiv.org/abs/2106.03106.
[217]
Donglai Wei, Joseph J. Lim, Andrew Zisserman, and William T. Freeman. 2018. Learning and using the arrow of time. In CVPR.
[218]
Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. 2021. Cvt: Introducing convolutions to vision transformers. arXiv:2103.15808. Retrieved from https://arxiv.org/abs/2103.15808.
[219]
Xiaolei Wu, Zhihao Hu, Lu Sheng, and Dong Xu. 2021. StyleFormer: Real-time arbitrary style transfer via parametric style composition. In ICCV. 14618–14627.
[220]
Yu-Huan Wu, Yun Liu, Xin Zhan, and Ming-Ming Cheng. 2021. P2T: Pyramid pooling transformer for scene understanding. arXiv:2106.12011. Retrieved from https://arxiv.org/abs/2106.12011.
[221]
Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, and Ping Luo. 2021. SegFormer: Simple and efficient design for semantic segmentation with transformers. arxiv:cs.CV/2105.15203. Retrieved from https://arxiv.org/abs/2105.15203.
[222]
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In CVPR.
[223]
Zhenda Xie, Yutong Lin, Zhuliang Yao, Zheng Zhang, Qi Dai, Yue Cao, and Han Hu. 2021. Self-supervised learning with swin transformers. arXiv:2105.04553. Retrieved from https://arxiv.org/abs/2105.04553.
[224]
Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. 2021. Nyströmformer: A Nystöm-based Algorithm for Approximating Self-Attention. In AAAI.
[225]
Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR.
[226]
Weijian Xu, Yifan Xu, Tyler Chang, and Zhuowen Tu. 2021. Co-Scale conv-attentional image transformers. arxiv:cs.CV/2104.06399. Retrieved from https://arxiv.org/abs/2104.06399.
[227]
Fuzhi Yang, Huan Yang, Jianlong Fu, Hongtao Lu, and B. Guo. 2020. Learning texture transformer network for image super-resolution. In CVPR.
[228]
Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, and Jianfeng Gao. 2021. Focal self-attention for local-global interactions in vision transformers. arxiv:cs.CV/2107.00641. Retrieved from https://arxiv.org/abs/2107.00641.
[229]
Linjie Yang, Yuchen Fan, and Ning Xu. 2019. Video instance segmentation. In ICCV. 5188–5197.
[230]
Han-Jia Ye, Hexiang Hu, De-Chuan Zhan, and Fei Sha. 2020. Few-shot learning via embedding adaptation with set-to-set functions. In CVPR.
[231]
Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. 2019. Cross-modal self-attention network for referring image segmentation. In CVPR.
[232]
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. 2016. Modeling context in referring expressions. In ECCV.
[233]
Kun Yuan, Shaopeng Guo, Ziwei Liu, Aojun Zhou, Fengwei Yu, and Wei Wu. 2021. Incorporating convolution designs into visual transformers. arXiv:2103.11816. Retrieved from https://arxiv.org/abs/2103.11816.
[234]
Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis E. H. Tay, Jiashi Feng, and Shuicheng Yan. 2021. Tokens-to-token ViT: Training vision transformers from scratch on imagenet. arXiv:2101.11986. Retrieved from https://arxiv.org/abs/2101.11986.
[235]
Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. 2019. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV. 6023–6032.
[236]
Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. 2021. Restormer: Efficient transformer for high-resolution image restoration. arXiv:2111.09881. Retrieved from https://arxiv.org/abs/2111.09881.
[237]
Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. From recognition to cognition: Visual commonsense reasoning. In CVPR.
[238]
Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. 2021. Scaling vision transformers. arxiv:cs.CV/2106.04560. Retrieved from https://arxiv.org/abs/2106.04560.
[239]
Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. 2019. Self-attention generative adversarial networks. In ICML. PMLR, 7354–7363.
[240]
Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N. Metaxas. 2017. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV.
[241]
Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. 2018. StackGAN++: Realistic image synthesis with stacked generative adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. (2018).
[242]
Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, and Jianfeng Gao. 2021. Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In ICCV (2021).
[243]
Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. Vinvl: Revisiting visual representations in vision-language models. In CVPR. 5579–5588.
[244]
Qinglong Zhang and Yubin Yang. 2021. ResT: An efficient transformer for visual recognition. arXiv:2105.13677. Retrieved from https://arxiv.org/abs/2105.13677.
[245]
Richard Zhang, Phillip Isola, and Alexei A. Efros. 2016. Colorful image colorization. In ECCV.
[246]
Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. 2018. Image super-resolution using very deep residual channel attention networks. In ECCV.
[247]
Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. 2020. Residual dense network for image restoration. IEEE Trans. Pattern Anal. Mach. Intell. (2020).
[248]
Zizhao Zhang, Han Zhang, Long Zhao, Ting Chen, and Tomas Pfister. 2021. Aggregating nested transformers. In arXiv:2105.12723. Retrieved from https://arxiv.org/abs/2105.12723.
[249]
Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. 2020. Exploring self-attention for image recognition. In CVPR.
[250]
Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H. S. Torr, and Li Zhang. 2021. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. arxiv:cs.CV/2012.15840. Retrieved from https://arxiv.org/abs/2012.15840.
[251]
Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. 2017. Scene parsing through ade20k dataset. In CVPR.
[252]
Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou, and Jiashi Feng. 2021. DeepViT: Towards deeper vision transformer. arxiv:cs.CV/2103.11886. Retrieved from https://arxiv.org/abs/2103.11886.
[253]
Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, and Jianfeng Gao. 2020. Unified vision-language pre-training for image captioning and vqa. In AAAI, Vol. 34. 13041–13049.
[254]
Luowei Zhou, Chenliang Xu, and Jason Corso. 2018. Towards automatic learning of procedures from web instructional videos. In AAAI, Vol. 32.
[255]
Luowei Zhou, Yingbo Zhou, Jason Corso, R. Socher, and C. Xiong. 2018. End-to-end dense video captioning with masked transformer. In CVPR.
[256]
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2020. Deformable DETR: Deformable transformers for end-to-end object detection. arXiv:2010.04159. Retrieved from https://arxiv.org/abs/2010.04159.

Cited By

View all
  • (2025)An Overview of AI Applications in Wildlife ConservationAI and Machine Learning Techniques for Wildlife Conservation10.4018/979-8-3693-6935-7.ch002(19-48)Online publication date: 10-Jan-2025
  • (2025)A New Efficient Hybrid Technique for Human Action Recognition Using 2D Conv-RBM and LSTM with Optimized Frame SelectionTechnologies10.3390/technologies1302005313:2(53)Online publication date: 1-Feb-2025
  • (2025)Predicting the Distribution of Ailanthus altissima Using Deep Learning-Based Analysis of Satellite ImagerySymmetry10.3390/sym1703032417:3(324)Online publication date: 21-Feb-2025
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 54, Issue 10s
January 2022
831 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3551649
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 September 2022
Online AM: 06 January 2022
Accepted: 07 December 2021
Revised: 04 December 2021
Received: 02 March 2021
Published in CSUR Volume 54, Issue 10s

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Self-attention
  2. transformers
  3. bidirectional encoders
  4. deep neural networks
  5. convolutional networks
  6. self-supervision
  7. literature survey

Qualifiers

  • Survey
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13,528
  • Downloads (Last 6 weeks)1,396
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)An Overview of AI Applications in Wildlife ConservationAI and Machine Learning Techniques for Wildlife Conservation10.4018/979-8-3693-6935-7.ch002(19-48)Online publication date: 10-Jan-2025
  • (2025)A New Efficient Hybrid Technique for Human Action Recognition Using 2D Conv-RBM and LSTM with Optimized Frame SelectionTechnologies10.3390/technologies1302005313:2(53)Online publication date: 1-Feb-2025
  • (2025)Predicting the Distribution of Ailanthus altissima Using Deep Learning-Based Analysis of Satellite ImagerySymmetry10.3390/sym1703032417:3(324)Online publication date: 21-Feb-2025
  • (2025)Application of Multiple Deep Learning Architectures for Emotion Classification Based on Facial ExpressionsSensors10.3390/s2505147825:5(1478)Online publication date: 27-Feb-2025
  • (2025)Fault Detection in Induction Machines Using Learning Models and Fourier Spectrum Image AnalysisSensors10.3390/s2502047125:2(471)Online publication date: 15-Jan-2025
  • (2025)Seed Protein Content Estimation with Bench-Top Hyperspectral Imaging and Attentive Convolutional Neural Network ModelsSensors10.3390/s2502030325:2(303)Online publication date: 7-Jan-2025
  • (2025)Enconv1d Model Based on Pseudolite System for Long-Tunnel PositioningRemote Sensing10.3390/rs1705085817:5(858)Online publication date: 28-Feb-2025
  • (2025)Deep Learning-Based Seedling Row Detection and Localization Using High-Resolution UAV Imagery for Rice Transplanter Operation Quality EvaluationRemote Sensing10.3390/rs1704060717:4(607)Online publication date: 11-Feb-2025
  • (2025)MtAD-Net: Multi-Threshold Adaptive Decision Net for Unsupervised Synthetic Aperture Radar Ship Instance SegmentationRemote Sensing10.3390/rs1704059317:4(593)Online publication date: 9-Feb-2025
  • (2025)Smart Defect Detection in Aero-Engines: Evaluating Transfer Learning with VGG19 and Data-Efficient Image Transformer ModelsMachines10.3390/machines1301004913:1(49)Online publication date: 13-Jan-2025
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media