Article

SPViT: Enabling Faster Vision Transformers via Latency-Aware Soft Token Pruning

Authors:

Yanzhi WangAuthors Info & Claims

Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XI

Pages 620 - 640

https://doi.org/10.1007/978-3-031-20083-0_37

Published: 23 October 2022 Publication History

Abstract

Recently, Vision Transformer (ViT) has continuously established new milestones in the computer vision field, while the high computation and memory cost makes its propagation in industrial production difficult. Considering the computation complexity, the internal data pattern of ViTs, and the edge device deployment, we propose a latency-aware soft token pruning framework, SPViT, which can be set up on vanilla Transformers of both flatten and hierarchical structures, such as DeiTs and Swin-Transformers (Swin). More concretely, we design a dynamic attention-based multi-head token selector, which is a lightweight module for adaptive instance-wise token selection. We further introduce a soft pruning technique, which integrates the less informative tokens chosen by the selector module into a package token rather than discarding them completely. SPViT is bound to the trade-off between accuracy and latency requirements of specific edge devices through our proposed latency-aware training strategy. Experiment results show that SPViT significantly reduces the computation cost of ViTs with comparable performance on image classification. Moreover, SPViT can guarantee the identified model meets the latency specifications of mobile devices and FPGA, and even achieve the real-time execution of DeiT-T on mobile devices. For example, SPViT reduces the latency of DeiT-T to 26 ms (26%−41% superior to existing works) on the mobile device with 0.25%−4% higher top-1 accuracy on ImageNet. Our code is released at https://github.com/PeiyanFlying/SPViT.

References

[1]

Amini, A., Periyasamy, A.S., Behnke, S.: T6d-direct: transformers for multi-object 6d pose direct regression. arXiv preprint arXiv:2109.10948 (2021)

[2]

Bao, H., Dong, L., Piao, S., Wei, F.: BEit: BERT pre-training of image transformers. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=p-BhZSz59o4

[3]

Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, and Zagoruyko S Vedaldi A, Bischof H, Brox T, and Frahm J-M End-to-end object detection with transformers Computer Vision – ECCV 2020 2020 Cham Springer 213-229

Digital Library

[4]

Chang, S.E., et al.: Mix and match: a novel fpga-centric deep neural network quantization framework. In: 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 208–220. IEEE (2021)

[5]

Chefer, H., Gur, S., Wolf, L.: Transformer interpretability beyond attention visualization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 782–791 (2021)

[6]

Chen, B., et al.: Psvit: better vision transformer via token pooling and attention sharing. arXiv preprint arXiv:2108.03428 (2021)

[7]

Chen, C.F.R., Fan, Q., Panda, R.: Crossvit: cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 357–366 (2021)

[8]

Chen, H., et al.: Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12299–12310 (2021)

[9]

Chen, M., Peng, H., Fu, J., Ling, H.: Autoformer: searching transformers for visual recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12270–12280 (2021)

[10]

Chen, P., Chen, Y., Liu, S., Yang, M., Jia, J.: Exploring and improving mobile level vision transformers. arXiv preprint arXiv:2108.13015 (2021)

[11]

Chen, T., Chen, X., Ma, X., Wang, Y., Wang, Z.: Coarsening the granularity: towards structurally sparse lottery tickets. In: Proceedings of the International Conference on Machine Learning (ICML) (2022)

[12]

Chen, T., Cheng, Y., Gan, Z., Yuan, L., Zhang, L., Wang, Z.: Chasing sparsity in vision transformers: an end-to-end exploration. In: Advances in Neural Information Processing Systems (2021)

[13]

Chen, T., Saxena, S., Li, L., Fleet, D.J., Hinton, G.: Pix2seq: a language modeling framework for object detection. arXiv preprint arXiv:2109.10852 (2021)

[14]

Chen, X., Hsieh, C.J., Gong, B.: When vision transformers outperform resnets without pre-training or strong data augmentations. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=LtKcMgGOeLt

[15]

Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8126–8135 (2021)

[16]

Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems (2021). https://openreview.net/forum?id=0lz69oI5iZP

[17]

Chu, C., et al.: Pim-prune: fine-grain dcnn pruning for crossbar-based process-in-memory architecture. In: 2020 57th ACM/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE (2020)

[18]

Chu, X., et al.: Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882 (2021)

[19]

Dai, Z., Cai, B., Lin, Y., Chen, J.: Up-detr: unsupervised pre-training for object detection with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1601–1610 (2021)

[20]

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009).

[21]

Deng, J., Yang, Z., Chen, T., Zhou, W., Li, H.: Transvg: end-to-end visual grounding with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1769–1779 (2021)

[22]

Dosovitskiy, A., et al.: An image is worth 16

\times

16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=YicbFdNTTy

[23]

El-Nouby, A., Neverova, N., Laptev, I., Jégou, H.: Training vision transformers for image retrieval. arXiv preprint arXiv:2102.05644 (2021)

[24]

El-Nouby, A., et al.: XCit: Cross-covariance image transformers. In: Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems (2021). https://openreview.net/forum?id=kzPtpIpF8o

[25]

Fang, H., Mei, Z., Shrestha, A., Zhao, Z., Li, Y., Qiu, Q.: Encoding, model, and architecture: systematic optimization for spiking neural network in fpgas. In: 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD), pp. 1–9. IEEE (2020)

[26]

Fang, H., Shrestha, A., Zhao, Z., Qiu, Q.: Exploiting neuron and synapse filter dynamics in spatial temporal learning of deep spiking neural network. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence. IJCAI 2020 (2021)

[27]

Fang, H., Taylor, B., Li, Z., Mei, Z., Li, H.H., Qiu, Q.: Neuromorphic algorithm-hardware codesign for temporal pattern learning. In: 2021 58th ACM/IEEE Design Automation Conference (DAC), pp. 361–366. IEEE (2021)

[28]

Fayyaz, M., et al.: Ats: adaptive token sampling for efficient vision transformers. arXiv preprint arXiv:2111.15667 (2021)

[29]

Gao, P., Lu, J., Li, H., Mottaghi, R., Kembhavi, A.: Container: context aggregation network. arXiv preprint arXiv:2106.01401 (2021)

[30]

Gong, Y., et al.: A privacy-preserving-oriented dnn pruning and mobile acceleration framework. In: Proceedings of the 2020 on Great Lakes Symposium on VLSI, pp. 119–124 (2020)

[31]

Graham, B., et al.: Levit: a vision transformer in convnet’s clothing for faster inference. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12259–12269 (2021)

[32]

Guo, C., et al.: Accelerating sparse dnn models without hardware-support via tile-wise sparsity. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15. IEEE (2020)

[33]

Guo MH, Cai JX, Liu ZN, Mu TJ, Martin RR, and Hu SM Pct: point cloud transformer Comput. Visual Media 2021 7 2 187-199

[34]

Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. In: Advances in Neural Information Processing Systems (2021)

[35]

Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S.J.: Rethinking spatial dimensions of vision transformers. In: International Conference on Computer Vision (ICCV) (2021)

[36]

Hou, Z., et al.: Chex: channel exploration for cnn model compression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12287–12298 (2022)

[37]

Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)

[38]

Hudson, D.A., Zitnick, C.L.: Generative adversarial transformers. In: Proceedings of the 38th International Conference on Machine Learning, ICML 2021 (2021)

[39]

Jia, D., et al.: Efficient vision transformers via fine-grained manifold distillation. arXiv preprint arXiv:2107.01378 (2021)

[40]

Jiang, Z., et al.: All tokens matter: token labeling for training better vision transformers. arXiv preprint arXiv:2104.10858 (2021)

[41]

Kim, B., Lee, J., Kang, J., Kim, E.S., Kim, H.J.: Hotr: end-to-end human-object interaction detection with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 74–83 (2021)

[42]

Kornblith, S., Norouzi, M., Lee, H., Hinton, G.: Similarity of neural network representations revisited. In: International Conference on Machine Learning, pp. 3519–3529. PMLR (2019)

[43]

Li, B., et al.: Efficient transformer-based large scale language representations using hardware-friendly block structured pruning. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3187–3199 (2020)

[44]

Li, Y., Fang, H., Li, M., Ma, Y., Qiu, Q.: Neural network pruning and fast training for drl-based uav trajectory planning. In: 2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 574–579. IEEE (2022)

[45]

Li, Z., et al.: Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6197–6206 (2021)

[46]

Liang, Y., GE, C., Tong, Z., Song, Y., Wang, J., Xie, P.: EVit: expediting vision transformers via token reorganizations. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=BjyvwnXXVn_

[47]

Liu, N., et al.: Lottery ticket preserves weight correlation: is it desirable or not? In: International Conference on Machine Learning (ICML), pp. 7011–7020. PMLR (2021)

[48]

Liu, Y., Sangineto, E., Bi, W., Sebe, N., Lepri, B., De Nadai, M.: Efficient training of visual transformers with small-size datasets. arXiv preprint arXiv:2106.03746 (2021)

[49]

Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: International Conference on Computer Vision (ICCV) (2021)

[50]

Lu, Z., Liu, H., Li, J., Zhang, L.: Efficient transformer for single image super-resolution. arXiv preprint arXiv:2108.11084 (2021)

[51]

Ma, X., et al.: PCONV: the missing but desirable sparsity in DNN weight pruning for real-time execution on mobile devices. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 34, pp. 5117–5124 (2020)

[52]

Ma, X., et al.: Non-structured dnn weight pruning-is it beneficial in any platform? In: IEEE Transactions on Neural Networks and Learning Systems (TNNLS) (2021)

[53]

Ma, X., et al.: An image enhancing pattern-based sparsity for real-time inference on mobile devices. In: Proceedings of the European conference on computer vision (ECCV). pp. 629–645. Springer (2020).

Digital Library

[54]

Ma, X., et al.: Effective model sparsification by scheduled grow-and-prune methods. In: Proceedings of the International Conference on Learning Representations (ICLR) (2021)

[55]

Ma, X., et al.: Blcr: Towards real-time dnn execution with block-based reweighted pruning. In: International Symposium on Quality Electronic Design (ISQED), pp. 1–8. IEEE (2022)

[56]

Ma, X., et al.: Tiny but accurate: a pruned, quantized and optimized memristor crossbar framework for ultra efficient dnn implementation. In: 2020 25th Asia and South Pacific design automation conference (ASP-DAC), pp. 301–306. IEEE (2020)

[57]

Ma, X., et al.: Sanity checks for lottery tickets: Does your winning ticket really win the jackpot? In: Advances in Neural Information Processing Systems (NeurIPS) 34 (2021)

[58]

Mao, M., et al.: Dual-stream network for visual recognition. In: Advances in Neural Information Processing Systems (2021)

[59]

Meinhardt, T., Kirillov, A., Leal-Taixe, L., Feichtenhofer, C.: Trackformer: multi-object tracking with transformers. arXiv preprint arXiv:2101.02702 (2021)

[60]

Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3d object detection. In: ICCV (2021)

[61]

Niu, W., et al.: A compression-compilation framework for on-mobile real-time bert applications. arXiv preprint arXiv:2106.00526 (2021)

[62]

Niu, W., et al.: Grim: A general, real-time deep learning inference framework for mobile devices based on fine-grained structured weight sparsity. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2021)

[63]

Niu, W., et al.: Patdnn: achieving real-time dnn execution on mobile devices with pattern-based weight pruning. In: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 907–922 (2020)

[64]

Pan, B., Jiang, Y., Panda, R., Wang, Z., Feris, R., Oliva, A.: Ia-red

^{2}

: Interpretability-aware redundancy reduction for vision transformers. In: Advances in Neural Information Processing Systems (2021)

[65]

Pan, Z., Zhuang, B., Liu, J., He, H., Cai, J.: Scalable vision transformers with hierarchical pooling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 377–386 (2021)

[66]

Prillo, S., Eisenschlos, J.: Softsort: a continuous relaxation for the argsort operator. In: International Conference on Machine Learning, pp. 7793–7802. PMLR (2020)

[67]

Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K., Dollár, P.: Designing network design spaces. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10428–10436 (2020)

[68]

Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A.: Do vision transformers see like convolutional neural networks? arXiv preprint arXiv:2108.08810 (2021)

[69]

Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: Dynamicvit: efficient vision transformers with dynamic token sparsification. In: Advances in Neural Information Processing Systems (2021)

[70]

Ren, A., et al.: Admm-nn: an algorithm-hardware co-design framework of dnns using alternating direction methods of multipliers. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 925–938 (2019)

[71]

Renggli, C., Pinto, A.S., Houlsby, N., Mustafa, B., Puigcerver, J., Riquelme, C.: Learning to merge tokens in vision transformers. arXiv preprint arXiv:2202.12015 (2022)

[72]

Rumi, M.A., Ma, X., Wang, Y., Jiang, P.: Accelerating sparse cnn inference on gpus with performance-aware weight pruning. In: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 267–278 (2020)

[73]

Ryoo, M.S., Piergiovanni, A., Arnab, A., Dehghani, M., Angelova, A.: Tokenlearner: what can 8 learned tokens do for images and videos? In: Advances in Neural Information Processing Systems (2021)

[74]

Sanh, V., Wolf, T., Rush, A.M.: Movement pruning: adaptive sparsity by fine-tuning. arXiv preprint arXiv:2005.07683 (2020)

[75]

Srinivas, A., Lin, T.Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16519–16529 (2021)

[76]

Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021)

[77]

Tan, Z., et al.: Pcnn: pattern-based fine-grained regular pruning towards optimizing cnn accelerators. In: 2020 57th ACM/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE (2020)

[78]

Tang, Y., et al.: Patch slimming for efficient vision transformers (2021)

[79]

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., J’egou, H.: Training data-efficient image transformers & distillation through attention. In: ICML (2021)

[80]

Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

[81]

Wang, H., Zhang, Z., Han, S.: Spatten: efficient sparse attention architecture with cascade token and head pruning. In: 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 97–110. IEEE (2021)

[82]

Wang, P., et al.: Kvt: k-nn attention for boosting vision transformers. arXiv preprint arXiv:2106.00515 (2021)

[83]

Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: IEEE ICCV (2021)

[84]

Wu, B., et al.: Visual transformers: token-based image representation and processing for computer vision. arXiv preprint arXiv:2006.03677 (2020)

[85]

Wu, H., et al.: Cvt: introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 22–31 (2021)

[86]

Wu, K., Peng, H., Chen, M., Fu, J., Chao, H.: Rethinking and improving relative position encoding for vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10033–10041 (2021)

[87]

Xu, C., et al.: You only group once: efficient point-cloud processing with token representation and relation inference module. arXiv preprint arXiv:2103.09975 (2021)

[88]

Xu, W., Xu, Y., Chang, T., Tu, Z.: Co-scale conv-attentional image transformers. arXiv preprint arXiv:2104.06399 (2021)

[89]

Xu, Y., et al.: Evo-vit: slow-fast token evolution for dynamic vision transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence (2022)

[90]

Xue, F., Wang, Q., Guo, G.: Transfer: learning relation-aware facial expression representations with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3601–3610 (2021)

[91]

Yan, B., Peng, H., Fu, J., Wang, D., Lu, H.: Learning spatio-temporal transformer for visual tracking. arXiv preprint arXiv:2103.17154 (2021)

[92]

Yang, C., Wu, Z., Zhou, B., Lin, S.: Instance localization for self-supervised detection pretraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3987–3996 (2021)

[93]

Yang, F., Yang, H., Fu, J., Lu, H., Guo, B.: Learning texture transformer network for image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5791–5800 (2020)

[94]

Yang, G., Tang, H., Ding, M., Sebe, N., Ricci, E.: Transformer-based attention networks for continuous pixel-wise prediction. In: ICCV (2021)

[95]

Yu, H., Wu, J.: A unified pruning framework for vision transformers. arXiv preprint arXiv:2111.15127 (2021)

[96]

Yu, Q., Xia, Y., Bai, Y., Lu, Y., Yuille, A., Shen, W.: Glance-and-gaze vision transformer. In: Advances in Neural Information Processing Systems (2021)

[97]

Yu, S., et al.: Unified visual transformer compression. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=9jsZiUgkCZP

[98]

Yuan, G., et al.: Tinyadc: Peripheral circuit-aware weight pruning framework for mixed-signal dnn accelerators. In: 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 926–931. IEEE (2021)

[99]

Yuan, G., et al.: Improving dnn fault tolerance using weight pruning and differential crossbar mapping for reram-based edge ai. In: 2021 22nd International Symposium on Quality Electronic Design (ISQED), pp. 135–141. IEEE (2021)

[100]

Yuan, G., et al.: An ultra-efficient memristor-based dnn framework with structured weight pruning and quantization using admm. In: 2019 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), pp. 1–6. IEEE (2019)

[101]

Yuan, G., et al.: Mest: accurate and fast memory-economic sparse training framework on the edge. In: Advances in Neural Information Processing Systems (NeurIPS) 34 (2021)

[102]

Yuan, K., Guo, S., Liu, Z., Zhou, A., Yu, F., Wu, W.: Incorporating convolution designs into visual transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 579–588 (2021)

[103]

Yuan, L., et al.: Tokens-to-token vit: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 558–567 (2021)

[104]

Yuan, L., et al.: Tokens-to-token vit: training vision transformers from scratch on imagenet. arXiv preprint arXiv:2101.11986 (2021)

[105]

Yue, X., Sun, S., Kuang, Z., Wei, M., Torr, P.H., Zhang, W., Lin, D.: Vision transformer with progressive sampling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 387–396 (2021)

[106]

Zhai, X., Kolesnikov, A., Houlsby, N., Beyer, L.: Scaling vision transformers. arXiv preprint arXiv:2106.04560 (2021)

[107]

Zhang, T., et al.: A unified dnn weight pruning framework using reweighted optimization methods. In: 2021 58th ACM/IEEE Design Automation Conference (DAC), pp. 493–498. IEEE (2021)

[108]

Zhang, T., et al.: Structadmm: achieving ultrahigh efficiency in structured pruning for dnns. In: IEEE Transactions on Neural Networks and Learning Systems (TNNLS) (2021)

[109]

Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V.: Point transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16259–16268 (2021)

[110]

Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6881–6890 (2021)

[111]

Zhou, D., et al.: Refiner: refining self-attention for vision transformers (2021)

[112]

Zhu, M., Han, K., Tang, Y., Wang, Y.: Visual transformer pruning. In: KDD 2021 Workshop on Model Mining (2021)

Cited By

Wang YLi PYang YSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Visual transformer with differentiable channel selectionProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694153(50840-50858)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694153
Jiang WZhang JWang DZhang QWang ZDu BLarson K(2024)LeMeViTProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/103(929-937)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/103
Mahmud TYaman BLiu CMarculescu D(2024)PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster InferenceComputer Vision – ECCV 202410.1007/978-3-031-73337-6_7(110-128)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73337-6_7
Show More Cited By

Recommendations

Towards efficient vision transformer inference: a first study of transformers on mobile devices
HotMobile '22: Proceedings of the 23rd Annual International Workshop on Mobile Computing Systems and Applications

Convolution neural networks (CNNs) have long been dominating the model choice in on-device intelligent mobile applications. Recently, we are witnessing the fast development of vision transformers, which are notable for the use of the self-attention ...
Regularizing self-attention on vision transformers with 2D spatial distance loss
Abstract
Recently, the vision transformer (ViT) achieved remarkable results on computer vision-related tasks. However, ViT lacks the inductive biases present on CNNs, such as locality and translation equivariance. Overcoming this deficiency usually comes ...
TT-ViT: Vision Transformer Compression Using Tensor-Train Decomposition
Computational Collective Intelligence
Abstract
Inspired by Transformer, one of the most successful deep learning models in natural language processing, machine translation, etc. Vision Transformer (ViT) has recently demonstrated its effectiveness in computer vision tasks such as image ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XI

Oct 2022

800 pages

ISBN:978-3-031-20082-3

DOI:10.1007/978-3-031-20083-0

Editors:
Shai Avidan
Tel Aviv University, Tel Aviv, Israel
,
Gabriel Brostow
University College London, London, UK
,
Moustapha Cissé
Google AI, Accra, Ghana
,
Giovanni Maria Farinella
University of Catania, Catania, Italy
,
Tal Hassner
Facebook (United States), Menlo Park, CA, USA

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 23 October 2022

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang YLi PYang YSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Visual transformer with differentiable channel selectionProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694153(50840-50858)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694153
Jiang WZhang JWang DZhang QWang ZDu BLarson K(2024)LeMeViTProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/103(929-937)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/103
Mahmud TYaman BLiu CMarculescu D(2024)PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster InferenceComputer Vision – ECCV 202410.1007/978-3-031-73337-6_7(110-128)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73337-6_7
Xu KWang ZChen CGeng XLin JYang XWu MLi XLin W(2024)LPViT: Low-Power Semi-structured Pruning for Vision TransformersComputer Vision – ECCV 202410.1007/978-3-031-73209-6_16(269-287)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73209-6_16
Haurum JEscalera STaylor GMoeslund T(2024)Agglomerative Token ClusteringComputer Vision – ECCV 202410.1007/978-3-031-72998-0_12(200-218)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72998-0_12
Zhang DLiang DTan ZYe XZhang CWang JBai X(2024)Make Your ViT-Based Multi-view 3D Detectors Faster via Token CompressionComputer Vision – ECCV 202410.1007/978-3-031-72970-6_4(56-72)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72970-6_4
Liu ZZhang ZKhaki SYang STang HXu CKeutzer KHan S(2024)Sparse Refinement for Efficient High-Resolution Semantic SegmentationComputer Vision – ECCV 202410.1007/978-3-031-72855-6_7(108-127)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72855-6_7
Xiong YChen HHao TLin ZHan JZhang YWang GBao YDing G(2024)PYRA: Parallel Yielding Re-activation for Training-Inference Efficient Task AdaptationComputer Vision – ECCV 202410.1007/978-3-031-72673-6_25(455-473)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72673-6_25
Kim MHan DKim THan B(2024)Leveraging Temporal Contextualization for Video Action RecognitionComputer Vision – ECCV 202410.1007/978-3-031-72664-4_5(74-91)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72664-4_5
Huang KZou HXi YWang BXie ZYu L(2024)IVTP: Instruction-Guided Visual Token Pruning for Large Vision-Language ModelsComputer Vision – ECCV 202410.1007/978-3-031-72643-9_13(214-230)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72643-9_13
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents