Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
survey

Lightweight Deep Learning for Resource-Constrained Environments: A Survey

Published: 24 June 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Over the past decade, the dominance of deep learning has prevailed across various domains of artificial intelligence, including natural language processing, computer vision, and biomedical signal processing. While there have been remarkable improvements in model accuracy, deploying these models on lightweight devices, such as mobile phones and microcontrollers, is constrained by limited resources. In this survey, we provide comprehensive design guidance tailored for these devices, detailing the meticulous design of lightweight models, compression methods, and hardware acceleration strategies. The principal goal of this work is to explore methods and concepts for getting around hardware constraints without compromising the model’s accuracy. Additionally, we explore two notable paths for lightweight deep learning in the future: deployment techniques for TinyML and Large Language Models. Although these paths undoubtedly have potential, they also present significant challenges, encouraging research into unexplored areas.

    References

    [1]
    M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. 2016. TensorFlow: A system for large-scale machine learning. In OSDI. 265–283.
    [2]
    M. S. Abdelfattah, A. Mehrotra, Ł. Dudziak, and N. D. Lane. 2021. Zero-cost proxies for lightweight NAS. In ICLR.
    [3]
    2024. Advances in Image Manipulation Workshop in Conjunction with ECCV 2022. Retrieved from https://data.vision.ee.ethz.ch/cvl/aim22/
    [4]
    D. Amodei and D. Hernandez. 2018. AI and Compute. Retrieved from https://openai.com/blog/ai-and-compute
    [5]
    S. An, Q. Liao, Z. Lu, and J.-H. Xue. 2022. Efficient semantic segmentation via self-attention and self-distillation. IEEE Trans. Intell. Transport. Syst. 23, 9 (2022), 15256–15266.
    [6]
    R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, E. Chu, J. H. Clark, L. E. Shafey, Y. Huang, K. Meier-Hellstern, G. Mishra, E. Moreira, M. Omernick, K. Robinson, S. Ruder, Y. Tay, K. Xiao, Y. Xu, Y. Zhang, G. H. Abrego, J. Ahn, J. Austin, P. Barham, J. Botha, J. Bradbury, S. Brahma, K. Brooks, M. Catasta, Y. Cheng, C. Cherry, C. A. Choquette-Choo, A. Chowdhery, C. Crepy, S. Dave, M. Dehghani, S. Dev, J. Devlin, M. Díaz, N. Du, E. Dyer, V. Feinberg, F. Feng, V. Fienber, M. Freitag, X. Garcia, S. Gehrmann, L. Gonzalez, G. Gur-Ari, S. Hand, H. Hashemi, L. Hou, J. Howland, A. Hu, J. Hui, J. Hurwitz, M. Isard, A. Ittycheriah, M. Jagielski, W. Jia, K. Kenealy, M. Krikun, S. Kudugunta, C. Lan, K. Lee, B. Lee, E. Li, M. Li, W. Li, Y. Li, J. Li, H. Lim, H. Lin, Z. Liu, F. Liu, M. Maggioni, A. Mahendru, J. Maynez, V. Misra, M. Moussalem, Z. Nado, J. Nham, E. Ni, A. Nystrom, A. Parrish, M. Pellat, M. Polacek, A. Polozov, R. Pope, S. Qiao, E. Reif, B. Richter, P. Riley, A. C. Ros, A. Roy, B. Saeta, R. Samuel, R. Shelby, A. Slone, D. Smilkov, D. R. So, D. Sohn, S. Tokumine, D. Valter, V. Vasudevan, K. Vodrahalli, X. Wang, P. Wang, Z. Wang, T. Wang, J. Wieting, Y. Wu, K. Xu, Y. Xu, L. Xue, P. Yin, J. Yu, Q. Zhang, S. Zheng, C. Zheng, W. Zhou, D. Zhou, S. Petrov, Y. Wu. 2023. PaLM 2 technical report. Google. arXiv preprint arXiv:2305.10403 (2023).
    [7]
    A. Asperti, D. Evangelista, and M. Marzolla. 2021. Dissecting FLOPs along input dimensions for GreenAI cost estimations. In LOD. 86–100.
    [8]
    C. Banbury, C. Zhou, I. Fedorov, R. Matas, U. Thakker, D. Gope, V. Janapa Reddi, M. Mattina, and P. Whatmough. 2021. MicroNets: Neural network architectures for deploying TinyML applications on commodity microcontrollers. In Annual Conference on Machine Learning and Systems.
    [9]
    R. Banner, I. Hubara, E. Hoffer, and D. Soudry. 2018. Scalable methods for 8-bit training of neural networks. In Annual Conference on Neural Information Processing Systems.
    [10]
    M. Bastian. 2024. GPT-4 Has More than a Trillion Parameters - Report. Retrieved from https://the-decoder.com/gpt-4-has-a-trillion-parameters/
    [11]
    A. Berthelier, T. Chateau, S. Duffner, C. Garcia, and C. Blanc. 2021. Deep model compression and architecture optimization for embedded systems: A survey. J. Sig. Process. Syst. 93, 8 (2021), 863–878.
    [12]
    M. Booshehri, A. Malekpour, and P. Luksch. 2013. An improving method for loop unrolling. Int. J. Comput. Sci. Inf. Secur. 11, 5 (2013), 73–76.
    [13]
    H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han. 2020. Once-for-all: Train one network and specialize it for efficient deployment. In ICLR.
    [14]
    Y. Cai, Z. Yao, Z. Dong, A. Gholami, M. W. Mahoney, and K. Keutzer. 2020. ZeroQ: A novel zero shot quantization framework. In CVPR. 13169–13178.
    [15]
    A. Capotondi, M. Rusci, M. Fariselli, and L. Benini. 2020. CMix-NN: Mixed low-precision CNN library for memory-constrained edge devices. IEEE Trans. Circ. Syst. II 67, 5 (2020), 871–875.
    [16]
    M. Capra, B. Bussolino, A. Marchisio, G. Masera, M. Martina, and M. Shafique. 2020. Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead. IEEE Access 8 (2020), 225134–225180.
    [17]
    B. Chen, T. Medini, J. Farwell, C. Tai, A. Shrivastava, et al. 2020. SLIDE: In defense of smart algorithms over hardware acceleration for large-scale deep learning systems. Annual Conference on Machine Learning and Systems. 291–306.
    [18]
    C.-Y. Chen, L. Lo, P.-J. Huang, H.-H. Shuai, and W.-H. Cheng. 2021. FashionMirror: Co-attention feature-remapping virtual try-on with sequential template poses. In ICCV. 13809–13818.
    [19]
    D. Chen, J.-P. Mei, H. Zhang, C. Wang, Y. Feng, and C. Chen. 2022. Knowledge distillation with the reused teacher classifier. In CVPR. 11933–11942.
    [20]
    D. Chen, J.-P. Mei, Y. Zhang, C. Wang, Z. Wang, Y. Feng, and C. Chen. 2021. Cross-layer distillation with semantic calibration. In AAAI, Vol. 35. 7028–7036.
    [21]
    H. Chen, Y. Wang, C. Xu, B. Shi, C. Xu, Q. Tian, and C. Xu. 2020. AdderNet: Do we really need multiplications in deep learning? In CVPR.
    [22]
    P. Chen, S. Liu, H. Zhao, and J. Jia. 2021. Distilling knowledge via knowledge review. In CVPR. 5008–5017.
    [23]
    T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. 2014. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM SIGARCH Comput. Archit. News 42, 1 (2014), 269–284.
    [24]
    T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. 2016. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. In NIPSW.
    [25]
    W. Chen, Y. Wang, S. Yang, C. Liu, and L. Zhang. 2020. You only search once: A fast automation framework for single-stage DNN/Accelerator co-design. In DATE. 1283–1286.
    [26]
    W. Chen, D. Xie, Y. Zhang, and S. Pu. 2019. All you need is a few shifts: Designing efficient convolutional neural networks for image classification. In CVPR. 7241–7250.
    [27]
    Y. Chen, X. Dai, D. Chen, M. Liu, X. Dong, L. Yuan, and Z. Liu. 2022. Mobile-Former: Bridging MobileNet and transformer. In CVPR. 5270–5279.
    [28]
    Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam. 2014. DaDianNao: A machine-learning supercomputer. In MICRO. 609–622.
    [29]
    Y. Chen, T. Yang, X. Zhang, G. Meng, C. Pan, and J. Sun. 2019. DetNAS: Neural architecture search on object detection. In NIPS. 4–1.
    [30]
    S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer. 2014. cuDNN: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).
    [31]
    R. Child, S. Gray, A. Radford, and I. Sutskever. 2019. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019).
    [32]
    J. Cho, Y. Jung, S. Lee, and Y. Jung. 2021. Reconfigurable binary neural network accelerator with adaptive parallelism scheme. Electronics 10, 3 (2021), 230.
    [33]
    J. Choi, Z. Wang, S. Venkataramani, P. I.-J. Chuang, V. Srinivasan, and K. Gopalakrishnan. 2018. PACT: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085 (2018).
    [34]
    K. Choi, D. Hong, H. Yoon, J. Yu, Y. Kim, and J. Lee. 2021. DANCE: Differentiable accelerator/network co-exploration. In DAC.
    [35]
    F. Chollet. 2017. Xception: Deep learning with depthwise separable convolutions. In CVPR. 1251–1258.
    [36]
    K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, D. Belanger, L. Colwell, and A. Weller. 2021. Rethinking attention with performers. In ICLR.
    [37]
    X. Dai, A. Wan, P. Zhang, B. Wu, Z. He, Z. Wei, K. Chen, Y. Tian, M. Yu, P. Vajda, and J. E. Gonzalez. 2021. FBNetV3: Joint architecture-recipe search using predictor pretraining. In CVPR. 16276–16285.
    [38]
    Z. Dai, H. Liu, Q. V. Le, and M. Tan. 2021. CoAtNet: Marrying convolution and attention for all data sizes. In NIPS. 3965–3977.
    [39]
    R. David, J. Duke, A. Jain, V. Janapa Reddi, N. Jeffries, J. Li, N. Kreeger, I. Nappier, M. Natraj, T. Wang, P. Warden, and R. Rhodes. 2021. TensorFlow Lite Micro: Embedded machine learning for TinyML systems. In MLSys. 800–811.
    [40]
    J. Deng, W. Li, Y. Chen, and L. Duan. 2021. Unbiased mean teacher for cross-domain object detection. In CVPR. 4091–4101.
    [41]
    X. Dong, S. Chen, and S. Pan. 2017. Learning to prune deep neural networks via layer-wise optimal brain surgeon. In NIPS.
    [42]
    Z. Dong, Z. Yao, D. Arfeen, A. Gholami, M. W. Mahoney, and K. Keutzer. 2020. HAWQ-V2: Hessian aware trace-weighted quantization of neural networks. In NIPS. 18518–18529.
    [43]
    Z. Dong, Z. Yao, A. Gholami, M. W. Mahoney, and K. Keutzer. 2019. HAWQ: Hessian aware quantization of neural networks with mixed-precision. In ICCV. 293–302.
    [44]
    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR.
    [45]
    L. Du, Y. Du, Y. Li, J. Su, Y.-C. Kuan, C.-C. Liu, and M.-C. F. Chang. 2017. A reconfigurable streaming deep convolutional neural network accelerator for Internet of Things. IEEE Trans. Circ. Syst. I 65, 1 (2017), 198–208.
    [46]
    S. Dubey, V. K. Soni, and B. K. Dubey. 2019. Application of microcontroller in assembly line for safety and controlling. Int. J. Res. Analyt. Rev. 6, 1 (2019), 107–111.
    [47]
    S. d’Ascoli, H. Touvron, M. L. Leavitt, A. S. Morcos, G. Biroli, and L. Sagun. 2021. ConViT: Improving vision transformers with soft convolutional inductive biases. In ICML. 2286–2296.
    [48]
    M. Elhoushi, Z. Chen, F. Shafiq, Y. H. Tian, and J. Y. Li. 2021. DeepShift: Towards multiplication-less neural networks. In CVPR. 2359–2368.
    [49]
    F. Faghri, I. Tabrizian, I. Markov, D. Alistarh, D. M. Roy, and A. Ramezani-Kebrya. 2020. Adaptive gradient quantization for data-parallel SGD. In NIPS. 3174–3185.
    [50]
    Z. Fan, W. Hu, H. Guo, F. Liu, and D. Xu. 2021. Hardware and algorithm co-optimization for pointwise convolution and channel shuffle in ShuffleNet V2. In SMC. 3212–3217.
    [51]
    M. Feurer, A. Klein, K. Eggensperger, J. T. Springenberg, M. Blum, and F. Hutter. 2019. Auto-sklearn: Efficient and robust automated machine learning. In Automated Machine Learning. Springer International Publishing, Cham, 113–134.
    [52]
    2024. ONNX. Retrieved from https://onnx.ai/
    [53]
    M. Fraccaroli, E. Lamma, and F. Riguzzi. 2022. Symbolic DNN-tuner. Mach. Learn. 111, 2 (2022), 625–650.
    [54]
    J. Frankle and M. Carbin. 2019. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR.
    [55]
    E. Frantar and D. Alistarh. 2023. SparseGPT: Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774 (2023).
    [56]
    Z. Fu, M. He, Z. Tang, and Y. Zhang. 2023. Optimizing data locality by executor allocation in spark computing environment. Comput. Sci. Inf. Syst. 20, 1 (2023), 491–512.
    [57]
    J. Getzner, B. Charpentier, and S. Günnemann. 2023. Accuracy is not the only metric that matters: Estimating the energy consumption of deep learning models. In ICLR.
    [58]
    A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer. 2022. A survey of quantization methods for efficient neural network inference. LPCV, 291–326.
    [59]
    A. Gholami, K. Kwon, B. Wu, Z. Tai, X. Yue, P. Jin, S. Zhao, and K. Keutzer. 2018. SqueezeNext: Hardware-aware neural network design. In CVPRW. 1638–1647.
    [60]
    A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse. 2017. The reversible residual network: Backpropagation without storing activations. In NIPS.
    [61]
    Google. 2023. Post-training Quantization | TensorFlow Lite. Retrieved from https://www.tensorflow.org/lite/performance/post_training_quantization
    [62]
    J. Gou, B. Yu, S. J. Maybank, and D. Tao. 2021. Knowledge distillation: A survey. Int. J. Comput. Vis. 129, 6 (2021), 1789–1819.
    [63]
    B. Graham, A. El-Nouby, H. Touvron, P. Stock, A. Joulin, H. Jégou, and M. Douze. 2021. LeViT: A vision transformer in ConvNet’s clothing for faster inference. In ICCV. 12259–12269.
    [64]
    R. M. Gray and D. L. Neuhoff. 1998. Quantization. Trans. Inf. Theor. 44, 6 (1998), 2325–2383.
    [65]
    K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, and H. Yang. 2017. Angel-eye: A complete design flow for mapping CNN onto embedded FPGA. IEEE Trans. Comput.-aided Des. Integ. Circ. Syst. 37, 1 (2017), 35–47.
    [66]
    Q. Guo, X. Wang, Y. Wu, Z. Yu, D. Liang, X. Hu, and P. Luo. 2020. Online knowledge distillation via collaborative learning. In CVPR. 11020–11029.
    [67]
    Y. Guo, A. Yao, and Y. Chen. 2016. Dynamic network surgery for efficient DNNs. In NIPS.
    [68]
    Z. Guo, R. Zhang, L. Qiu, X. Ma, X. Miao, X. He, and B. Cui. 2023. CALIP: Zero-shot enhancement of clip with parameter-free attention. In AAAI, Vol. 37. 746–754.
    [69]
    M. Gupta and P. Agrawal. 2022. Compression of deep learning models for text: A survey. ACM Trans. Knowl. Discov. Data 16, 4 (2022), 1–55.
    [70]
    S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. 2015. Deep learning with limited numerical precision. In ICML, 1737–1746.
    [71]
    S. Gupta and B. Akin. 2020. Accelerator-aware neural network design using AutoML. In Annual Conference on Machine Learning and Systems Workshop.
    [72]
    T. J. Ham, S. J. Jung, S. Kim, Y. H. Oh, Y. Park, Y. Song, J.-H. Park, S. Lee, K. Park, J. W. Lee, and D.-K. Jeong. 2020. A^2303 3: Accelerating attention mechanisms in neural networks with approximation. In HPCA. 328–341.
    [73]
    T. J. Ham, Y. Lee, S. H. Seo, S. Kim, H. Choi, S. J. Jung, and J. W. Lee. 2021. ELSA: Hardware-Software co-design for efficient, lightweight self-attention mechanism in neural networks. In ISCA. 692–705.
    [74]
    K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, Z. Yang, Y. Zhang, and D. Tao. 2023. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 45, 1 (2023), 87–110.
    [75]
    S. Han, H. Mao, and W. J. Dally. 2016. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. In ICLR.
    [76]
    B. Hassibi, D. G. Stork, and G. J. Wolff. 1993. Optimal brain surgeon and general network pruning. In ICNN. 293–299.
    [77]
    K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In CVPR. 770–778.
    [78]
    X. He, K. Zhao, and X. Chu. 2021. AutoML: A survey of the state-of-the-art. Knowl.-based Syst. 212 (2021), 106622.
    [79]
    Y. He, Y. Ding, P. Liu, L. Zhu, H. Zhang, and Y. Yang. 2020. Learning filter pruning criteria for deep convolutional neural networks acceleration. In CVPR. 2006–2015.
    [80]
    Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang. 2018. Soft filter pruning for accelerating deep convolutional neural networks. In IJCAI. 2234–2240.
    [81]
    Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang. 2019. Filter pruning via geometric median for deep convolutional neural networks acceleration. In CVPR. 4340–4349.
    [82]
    Y. He, X. Liu, H. Zhong, and Y. Ma. 2019. AddressNet: Shift-based primitives for efficient convolutional neural networks. In WACV. 1213–1222.
    [83]
    Y. He, X. Zhang, and J. Sun. 2017. Channel pruning for accelerating very deep neural networks. In CVPR. 1389–1397.
    [84]
    S. C. Hidayati, T. W. Goh, J.-S. G. Chan, C.-C. Hsu, J. See, L.-K. Wong, K.-L. Hua, Y. Tsao, and W.-H. Cheng. 2020. Dress with style: Learning style from joint deep embedding of clothing styles and body shapes. Trans. Multim. 23 (2020), 365–377.
    [85]
    G. Hinton, O. Vinyals, and J. Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
    [86]
    J. Ho, A. Jain, and P. Abbeel. 2020. Denoising diffusion probabilistic models. In NIPS. 6840–6851.
    [87]
    Y. Hou, Z. Ma, C. Liu, and C. C. Loy. 2019. Learning lightweight lane detection CNNs by self attention distillation. In ICCV. 1013–1021.
    [88]
    A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, Q. V. Le, and H. Adam. 2019. Searching for MobileNetV3. In ICCV. 1314–1324.
    [89]
    A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).
    [90]
    L.-C. Hsu, C.-T. Chiu, K.-T. Lin, H.-H. Chou, and Y.-Y. Pu. 2020. ESSA: An energy-aware bit-serial streaming deep convolutional neural network accelerator. J. Syst. Archit. 111 (2020), 101831.
    [91]
    J. Hu, L. Shen, and G. Sun. 2018. Squeeze-and-excitation networks. In CVPR. 7132–7141.
    [92]
    W. Hu, Z. Che, N. Liu, M. Li, J. Tang, C. Zhang, and J. Wang. 2023. CATRO: Channel pruning via class-aware trace ratio optimization. Trans. Neural. Netw. Learn. Syst. (2023), 1–13. DOI:
    [93]
    G. Huang, S. Liu, L. Van der Maaten, and K. Q. Weinberger. 2018. CondenseNet: An efficient DenseNet using learned group convolutions. In CVPR. 2752–2761.
    [94]
    G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. 2017. Densely connected convolutional networks. In CVPR. 4700–4708.
    [95]
    J.-C. Huang and T. Leng. 1999. Generalized loop-unrolling: A method for program speedup. In ASSET. 244–248.
    [96]
    Z. Huang and N. Wang. 2019. Like what you like: Knowledge distill via neuron selectivity transfer. In ICLR.
    [97]
    I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. 2016. Binarized neural networks. In NIPS. 4114–4122.
    [98]
    F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. 2017. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. In ICLR.
    [99]
    B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In CVPR, 2704–2713.
    [100]
    Y. Jeon and J. Kim. 2018. Constructing fast network through deconstruction of convolution. In NIPS.
    [101]
    M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim. 2022. Visual prompt tuning. In ECCV. 709–727.
    [102]
    N. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai, N. Patil, S. Subramanian, A. Swing, B. Towles, C. Young, X. Zhou, Z. Zhou, and D. Patterson. 2023. TPU v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In ISCA. 1–14.
    [103]
    N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon. 2017. In-datacenter performance analysis of a tensor processing unit. In ISCA. 1–12.
    [104]
    S. Jung, C. Son, S. Lee, J. Son, J.-J. Han, Y. Kwak, S. J. Hwang, and C. Choi. 2019. Learning to quantize deep networks by optimizing quantization intervals with task loss. In CVPR. 4350–4359.
    [105]
    B. Kang, X. Chen, D. Wang, H. Peng, and H. Lu. 2023. Exploring lightweight hierarchical vision transformers for efficient visual tracking. In ICCV. 9612–9621.
    [106]
    M. Kang and B. Han. 2020. Operation-aware soft channel pruning using differentiable masks. In ICML. 7021–7032.
    [107]
    K. Kim, B. Ji, D. Yoon, and S. Hwang. 2021. Self-knowledge distillation with progressive refinement of targets. In ICCV. 6567–6576.
    [108]
    N. Kitaev, Ł. Kaiser, and A. Levskaya. 2020. Reformer: The efficient transformer. In ICLR.
    [109]
    L. Kotthoff, C. Thornton, H. H. Hoos, F. Hutter, and K. Leyton-Brown. 2019. Auto-WEKA: Automatic model selection and hyperparameter optimization in WEKA. In Automated Machine Learning. Springer International Publishing, Cham, 81–95.
    [110]
    A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In NIPS. 1097–1105.
    [111]
    S. Kumar, V. Bitorff, D. Chen, C. Chou, B. Hechtman, H. Lee, N. Kumar, P. Mattson, S. Wang, T. Wang, et al. 2019. Scale MLPerf-0.6 models on Google TPU-v3 pods. arXiv preprint arXiv:1909.09756 (2019).
    [112]
    L. Lai, N. Suda, and V. Chandra. 2018. CMSIS-NN: Efficient neural network kernels for ARM Cortex-M CPUs. arXiv preprint arXiv:1801.06601 (2018).
    [113]
    Y. LeCun, J. Denker, and S. Solla. 1989. Optimal brain damage. In NIPS.
    [114]
    N. Lee, T. Ajanthan, and P. H. Torr. 2019. SNIP: Single-shot network pruning based on connection sensitivity. In ICLR.
    [115]
    H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. 2017. Pruning filters for efficient ConvNets. In ICLR.
    [116]
    N. Li, S. Takaki, Y. Tomiokay, and H. Kitazawa. 2016. A multistage dataflow implementation of a deep convolutional neural network based on FPGA for high-speed object recognition. In SSIAI. 165–168.
    [117]
    S. Li, M. Lin, Y. Wang, Y. Wu, Y. Tian, L. Shao, and R. Ji. 2023. Distilling a powerful student model via online knowledge distillation. Trans. Neural. Netw. Learn. Syst. 34, 11 (2023), 8743–8752.
    [118]
    S. Li, M. Tan, R. Pang, A. Li, L. Cheng, Q. V. Le, and N. P. Jouppi. 2021. Searching for fast model families on datacenter accelerators. In CVPR. 8085–8095.
    [119]
    Y. Li, C. Hao, X. Zhang, X. Liu, Y. Chen, J. Xiong, W.-m. Hwu, and D. Chen. 2020. EDD: Efficient differentiable DNN architecture and implementation co-search for embedded AI solutions. In DAC. 1–6.
    [120]
    Y. Li, Y. Hu, F. Wu, and K. Li. 2022. DiVIT: Algorithm and architecture co-design of differential attention in vision transformer. J. Syst. Archit. 128, C (2022), 102520.
    [121]
    T. Liang, J. Glossner, L. Wang, S. Shi, and X. Zhang. 2021. Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing 461 (2021), 370–403.
    [122]
    Y. Liang, G. Chongjian, Z. Tong, Y. Song, J. Wang, and P. Xie. 2021. EViT: Expediting vision transformers via token reorganizations. In ICLR.
    [123]
    J. Lin, W.-M. Chen, H. Cai, C. Gan, and S. Han. 2021. MCUNetV2: Memory-efficient patch-based inference for tiny deep learning. In NIPS.
    [124]
    J. Lin, W.-M. Chen, Y. Lin, C. Gan, S. Han, et al. 2020. MCUNet: Tiny deep learning on IoT devices. In NIPS. 11711–11722.
    [125]
    S. Lin, H. Xie, B. Wang, K. Yu, X. Chang, X. Liang, and G. Wang. 2022. Knowledge distillation via the target-aware transformer. In CVPR. 10915–10924.
    [126]
    Y. Lin, D. Hafdi, K. Wang, Z. Liu, and S. Han. 2020. Neural-hardware architecture search. In NIPSWS.
    [127]
    Y.-J. Lin and T. S. Chang. 2017. Data and hardware efficient design for convolutional neural network. IEEE Trans. Circ. Syst. I 65, 5 (2017), 1642–1651.
    [128]
    B. Liu, F. Li, X. Wang, B. Zhang, and J. Yan. 2023. Ternary weight networks. In ICASSP. 1–5.
    [129]
    H. Liu, K. Simonyan, and Y. Yang. 2019. DARTS: Differentiable architecture search. In ICLR.
    [130]
    L. Liu, S. Zhang, Z. Kuang, A. Zhou, J.-H. Xue, X. Wang, Y. Chen, W. Yang, Q. Liao, and W. Zhang. 2021. Group Fisher pruning for practical network compression. In ICML.
    [131]
    X. Liu, M. Ye, D. Zhou, and Q. Liu. 2021. Post-training quantization with multiple points: Mixed precision without mixed precision. In AAAI, Vol. 35. 8697–8705.
    [132]
    Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, L. Dong, F. Wei, and B. Guo. 2022. Swin transformer V2: Scaling up capacity and resolution. In CVPR. 12009–12019.
    [133]
    Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV. 10012–10022.
    [134]
    Z. Liu, H. Mu, X. Zhang, Z. Guo, X. Yang, K.-T. Cheng, and J. Sun. 2019. Metapruning: Meta learning for automatic neural network channel pruning. In ICCV. 3296–3305.
    [135]
    G. Luo, Y. Zhou, X. Sun, Y. Wang, L. Cao, Y. Wu, F. Huang, and R. Ji. 2022. Towards lightweight transformer via group-wise transformation for vision-and-language tasks. Trans. Image. Process. 31 (2022), 3386–3398.
    [136]
    T. Luo, S. Liu, L. Li, Y. Wang, S. Zhang, T. Chen, Z. Xu, O. Temam, and Y. Chen. 2016. DaDianNao: A neural network supercomputer. IEEE Trans. Comput. 66, 1 (2016), 73–88.
    [137]
    N. Ma, X. Zhang, H.-T. Zheng, and J. Sun. 2018. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. In ECCV. 116–131.
    [138]
    2021. Mobile AI Workshop 2021. Retrieved from https://ai-benchmark.com/workshops/mai/2021/#challenges
    [139]
    2022. Mobile AI Workshop 2022. Retrieved from https://ai-benchmark.com/workshops/mai/2022/#challenges
    [140]
    2023. Mobile AI Workshop 2023. Retrieved from https://ai-benchmark.com/workshops/mai/2023/#challenges
    [141]
    S. Mehta, M. Ghazvininejad, S. Iyer, L. Zettlemoyer, and H. Hajishirzi. 2021. DeLighT: Very deep and light-weight transformer. In ICLR.
    [142]
    S. Mehta, R. Koncel-Kedziorski, M. Rastegari, and H. Hajishirzi. 2018. Pyramidal recurrent unit for language modeling. In EMNLP.
    [143]
    S. Mehta, R. Koncel-Kedziorski, M. Rastegari, and H. Hajishirzi. 2020. DeFINE: Deep factorized input token embeddings for neural sequence modeling. In ICLR.
    [144]
    S. Mehta and M. Rastegari. 2022. MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer. In ICLR.
    [145]
    L. Mezdour, K. Kadem, M. Merouani, A. S. Haichour, S. Amarasinghe, and R. Baghdadi. 2023. A deep learning model for loop interchange. In ACM SIGPLAN CC. 50–60.
    [146]
    P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu. 2018. Mixed precision training. In ICLR.
    [147]
    M. M. Naseer, K. Ranasinghe, S. H. Khan, M. Hayat, F. Shahbaz Khan, and M.-H. Yang. 2021. Intriguing properties of vision transformers. In NIPS.
    [148]
    A. Nechi, L. Groth, S. Mulhem, F. Merchant, R. Buchty, and M. Berekovic. 2023. FPGA-based deep learning inference accelerators: Where are we standing? ACM Trans. Reconfig. Technol. Syst. 16, 4 (2023), 1–32.
    [149]
    NVIDIA. 2023. NVIDIA CUDA-X: GPU Accelerated Libraries. Retrieved from https://developer.nvidia.com/gpu-accelerated-libraries
    [150]
    OpenAI. 2023. GPT-4 technical report. OpenAI. (2023).
    [151]
    A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally. 2017. SCNN: An accelerator for compressed-sparse convolutional neural networks. ACM SIGARCH Comput. Archit. News 45, 2 (2017), 27–40.
    [152]
    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In NIPS.
    [153]
    H. Peng, J. Wu, S. Chen, and J. Huang. 2019. Collaborative channel pruning for deep networks. In ICML. 5113–5122.
    [154]
    H. Pouransari, Z. Tu, and O. Tuzel. 2020. Least squares binary quantization of neural networks. In CVPRW. 698–699.
    [155]
    Z. Qi, W. Chen, R. A. Naqvi, and K. Siddique. 2022. Designing deep learning hardware accelerator and efficiency evaluation. Comput. Intell. Neurosci. 2022 (2022).
    [156]
    J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, Y. Wang, and H. Yang. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In ACM FPGA. 26–35.
    [157]
    I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár. 2020. Designing network design spaces. In CVPR. 10428–10436.
    [158]
    Y. Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C.-J. Hsieh. 2021. DynamicViT: Efficient vision transformers with dynamic token sparsification. In NIPS. 13937–13949.
    [159]
    M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. 2016. XNOR-Net: ImageNet classification using binary convolutional neural networks. In ECCV. 525–542.
    [160]
    P. P. Ray. 2022. A review on TinyML: State-of-the-art and prospects. J. King Saud Univ.-Comput. Inf. Sci. 34, 4 (2022), 1595–1623.
    [161]
    E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin. 2017. Large-scale evolution of image classifiers. In ICML. 2902–2911.
    [162]
    P. Ren, Y. Xiao, X. Chang, P.-Y. Huang, Z. Li, X. Chen, and X. Wang. 2021. A comprehensive survey of neural architecture search: Challenges and solutions. Comput. Surv. 54, 4 (2021), 1–34.
    [163]
    D. Roggen, R. Cobden, A. Pouryazdan, and M. Zeeshan. 2022. Wearable FPGA platform for accelerated DSP and AI applications. In PerComW. 66–69.
    [164]
    B. Rokh, A. Azarpeyvand, and A. Khanteymoori. 2023. A comprehensive survey on model quantization for deep neural networks in image classification. ACM Trans. Intell. Syst. Technol. 14, 6 (2023), 1–50.
    [165]
    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. 2022. High-resolution image synthesis with latent diffusion models. In CVPR. 10684–10695.
    [166]
    C. Sakr, S. Dai, R. Venkatesan, B. Zimmer, W. Dally, and B. Khailany. 2022. Optimal clipping and magnitude-aware differentiation for improved quantization-aware training. In ICML. 19123–19138.
    [167]
    M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In CVPR. 4510–4520.
    [168]
    R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni. 2020. Green AI. Commun. ACM 63, 12 (2020), 54–63.
    [169]
    L. Sekanina. 2021. Neural architecture search and hardware accelerator co-search: A survey. IEEE Access 9 (2021), 151337–151362.
    [170]
    K. P. Seng, P. J. Lee, and L. M. Ang. 2021. Embedded intelligence on FPGA: Survey, applications and challenges. Electronics 10, 8 (2021), 895.
    [171]
    Y. Shang, Z. Yuan, B. Xie, B. Wu, and Y. Yan. 2023. Post-training quantization on diffusion models. In CVPR. 1972–1981.
    [172]
    K. Simonyan and A. Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR.
    [173]
    S. Sinha. 2023. State of IoT 2023: Number of Connected IoT Devices Growing 16% to 16.7 Billion Globally. Retrieved from https://iot-analytics.com/number-connected-iot-devices/
    [174]
    Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. 2021. Score-based generative modeling through stochastic differential equations. In ICLR.
    [175]
    A. Srinivas, T.-Y. Lin, N. Parmar, J. Shlens, P. Abbeel, and A. Vaswani. 2021. Bottleneck transformers for visual recognition. In CVPR. 16519–16529.
    [176]
    A. Stoutchinin, F. Conti, and L. Benini. 2019. Optimally scheduling CNN convolutions for efficient memory access. arXiv preprint arXiv:1902.01492 (2019).
    [177]
    E. Strubell, A. Ganesh, and A. McCallum. 2019. Energy and policy considerations for deep learning in NLP. In ACL.
    [178]
    Z. Su, L. Fang, W. Kang, D. Hu, M. Pietikäinen, and L. Liu. 2020. Dynamic group convolution for accelerating convolutional neural networks. In ECCV. 138–155.
    [179]
    M. Sultana, M. Naseer, M. H. Khan, S. Khan, and F. S. Khan. 2022. Self-distilled vision transformer for domain generalization. In ACCV. 3068–3085.
    [180]
    M. Sun, Z. Liu, A. Bair, and J. Z. Kolter. 2023. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695 (2023).
    [181]
    M. Sun, H. Ma, G. Kang, Y. Jiang, T. Chen, X. Ma, Z. Wang, and Y. Wang. 2022. VAQF: Fully automatic software-hardware co-design framework for low-bit vision transformer. arXiv preprint arXiv:2201.06618 (2022).
    [182]
    Y. Sun, H. Wang, B. Xue, Y. Jin, G. G. Yen, and M. Zhang. 2020. Surrogate-assisted evolutionary deep learning using an end-to-end random forest-based performance predictor. IEEE Trans. Evolut. Computat. 24, 2 (2020), 350–364.
    [183]
    V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer. 2020. How to evaluate deep neural network processors: Tops/w (alone) considered harmful. IEEE Solid-state Circ. Mag. 12, 3 (2020), 28–41.
    [184]
    C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. 2017. Inception-v4, inception-ResNet and the impact of residual connections on learning. In AAAI.
    [185]
    C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In CVPR. 1–9.
    [186]
    C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. 2016. Rethinking the inception architecture for computer vision. In CVPR. 2818–2826.
    [187]
    A. Talwalkar. 2020. The Push for Energy Efficient “Green AI.” Retrieved from https://spectrum.ieee.org/energy-efficient-green-ai-strategies
    [188]
    J. Tan, L. Niu, J. K. Adams, V. Boominathan, J. T. Robinson, R. G. Baraniuk, and A. Veeraraghavan. 2019. Face detection and verification using lensless cameras. Trans. Computat. Imag. 5, 2 (2019), 180–194.
    [189]
    M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le. 2019. MnasNet: Platform-aware neural architecture search for mobile. In CVPR. 2820–2828.
    [190]
    M. Tan and Q. Le. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In ICML. 6105–6114.
    [191]
    M. Tan and Q. Le. 2021. EfficientNetV2: Smaller models and faster training. In ICML. 10096–10106.
    [192]
    M. Tan and Q. V. Le. 2019. MixConv: Mixed depthwise convolutional kernels. In BMVC.
    [193]
    C. Tao, L. Hou, W. Zhang, L. Shang, X. Jiang, Q. Liu, P. Luo, and N. Wong. 2022. Compression of generative pre-trained language models via quantization. In ACL.
    [194]
    A. Tarvainen and H. Valpola. 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In NIPS.
    [195]
    Y. Tay, M. Dehghani, D. Bahri, and D. Metzler. 2021. Efficient transformers: A survey. Comput. Surv. 54, 4 (2021), 1–41.
    [196]
    Y. Tian, D. Krishnan, and P. Isola. 2020. Contrastive representation distillation. In ICLR.
    [197]
    H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou. 2021. Training data-efficient image transformers & distillation through attention. In ICML. 10347–10357.
    [198]
    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
    [199]
    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
    [200]
    S. Um, S. Kim, S. Kim, and H.-J. Yoo. 2021. A 43.1 TOPS/W energy-efficient absolute-difference-accumulation operation computing-in-memory with computation reuse. IEEE Trans. Circ. Syst. II 68, 5 (2021), 1605–1609.
    [201]
    H. Vanholder. 2016. Efficient inference with tensorrt. In GPU Technology Conference.
    [202]
    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In NIPS.
    [203]
    L. N. Viet, T. N. Dinh, D. T. Minh, H. N. Viet, and Q. L. Tran. 2021. UET-Headpose: A sensor-based top-view head pose dataset. In KSE. 1–7.
    [204]
    A. Wan, X. Dai, P. Zhang, Z. He, Y. Tian, S. Xie, B. Wu, M. Yu, T. Xu, K. Chen, P. Vajda, and J. E. Gonzalez. 2020. FBNetV2: Differentiable neural architecture search for spatial and channel dimensions. In CVPR. 12965–12974.
    [205]
    H. Wang, Z. Zhang, and S. Han. 2021. SpAtten: Efficient sparse attention architecture with cascade token and head pruning. In HPCA. 97–110.
    [206]
    L. Wang, X. Dong, Y. Wang, L. Liu, W. An, and Y. Guo. 2022. Learnable lookup table for neural network quantization. In CVPR. 12423–12433.
    [207]
    N. Wang, J. Choi, D. Brand, C.-Y. Chen, and K. Gopalakrishnan. 2018. Training deep neural networks with 8-bit floating point numbers. In NIPS. 7686–7695.
    [208]
    S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma. 2020. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020).
    [209]
    X. Wang, M. Kan, S. Shan, and X. Chen. 2019. Fully learnable group convolution for acceleration of deep neural networks. In CVPR. 9049–9058.
    [210]
    X. Wang, L. L. Zhang, Y. Wang, and M. Yang. 2022. Towards efficient vision transformer inference: A first study of transformers on mobile devices. In WMCSA. 1–7.
    [211]
    Z. Wang, K. Xu, S. Wu, L. Liu, L. Liu, and D. Wang. 2020. Sparse-YOLO: Hardware/software co-design of an FPGA accelerator for YOLOv2. IEEE Access 8 (2020), 116569–116585.
    [212]
    X. Wei, C. H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang, and J. Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In DAC. 1–6.
    [213]
    M. E. Wolf and M. S. Lam. 1991. A data locality optimizing algorithm. In PLDI. 30–44.
    [214]
    M. Wortsman, G. Ilharco, S. Y. Gadre, R. Roelofs, R. Gontijo-Lopes, A. S. Morcos, H. Namkoong, A. Farhadi, Y. Carmon, S. Kornblith, and L. Schmidt. 2022. Model soups: Averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In ICML, 23965–23998.
    [215]
    B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer. 2019. FBNet: Hardware-aware efficient ConvNet design via differentiable neural architecture search. In CVPR. 10734–10742.
    [216]
    B. Wu, A. Wan, X. Yue, P. Jin, S. Zhao, N. Golmant, A. Gholaminejad, J. Gonzalez, and K. Keutzer. 2018. Shift: A zero flop, zero parameter alternative to spatial convolutions. In CVPR. 9127–9135.
    [217]
    H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang. 2021. CVT: Introducing convolutions to vision transformers. In ICCV. 22–31.
    [218]
    X. Wu, C. Li, R. Y. Aminabadi, Z. Yao, and Y. He. 2023. Understanding INT4 quantization for transformer models: Latency speedup, composability, and failure cases. arXiv preprint arXiv:2301.12017 (2023).
    [219]
    Z. Wu, Z. Liu, J. Lin, Y. Lin, and S. Han. 2020. Lite transformer with long-short range attention. In ICLR.
    [220]
    T. Xiao, P. Dollar, M. Singh, E. Mintun, T. Darrell, and R. Girshick. 2021. Early convolutions help transformers see better. In NIPS.
    [221]
    H. Xie, M.-X. Lee, T.-J. Chen, H.-J. Chen, H.-I. Liu, H.-H. Shuai, and W.-H. Cheng. 2023. Most important person-guided dual-branch cross-patch attention for group affect recognition. In ICCV. 20598–20608.
    [222]
    R. Xu, E. H.-M. Sha, Q. Zhuge, Y. Song, and H. Wang. 2023. Loop interchange and tiling for multi-dimensional loops to minimize write operations on NVMs. J. Syst. Archit. 135 (2023), 102799.
    [223]
    Y. Xue, C. Chen, and A. Słowik. 2023. Neural architecture search based on a multi-objective evolutionary algorithm with probability stack. IEEE Trans. Evolut. Comput. 27, 4 (2023).
    [224]
    C. Yang, L. Xie, C. Su, and A. L. Yuille. 2019. Snapshot distillation: Teacher-student optimization in one generation. In CVPR. 2859–2868.
    [225]
    J. Yang, B. Martinez, A. Bulat, G. Tzimiropoulos. 2021. Knowledge distillation via softmax regression representation learning. In ICLR.
    [226]
    L. Yang, H. Jiang, R. Cai, Y. Wang, S. Song, G. Huang, and Q. Tian. 2021. CondenseNet V2: Sparse feature reactivation for deep networks. In CVPR. 3569–3578.
    [227]
    T.-J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, M. Sandler, V. Sze, and H. Adam. 2018. NetAdapt: Platform-aware neural network adaptation for mobile applications. In ECCV. 285–300.
    [228]
    T.-J. Yang, Y.-L. Liao, and V. Sze. 2021. NetAdaptv2: Efficient neural architecture search with fast super-network training and architecture optimization. In CVPR. 2402–2411.
    [229]
    Z. Yao, Z. Dong, Z. Zheng, A. Gholami, J. Yu, E. Tan, L. Wang, Q. Huang, Y. Wang, M. Mahoney. 2021. HAWQ-V3: Dyadic neural network quantization. In ICML. 11875–11886.
    [230]
    J. Ye, X. Chen, N. Xu, C. Zu, Z. Shao, S. Liu, Y. Cui, Z. Zhou, C. Gong, Y. Shen, J. Zhou, S. Chen, T. Gui, Q. Zhang, and X. Huang. 2023. A comprehensive capability analysis of GPT-3 and GPT-3.5 series models. arXiv preprint arXiv:2303.10420 (2023).
    [231]
    J. Ye, X. Lu, Z. Lin, and J. Z. Wang. 2018. Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. In ICLR.
    [232]
    H. Yin, A. Vahdat, J. Alvarez, A. Mallya, J. Kautz, and P. Molchanov. 2022. AdaViT: Adaptive tokens for efficient vision transformer. In CVPR, 10809–10818.
    [233]
    J. Yoon, D. Kang, and M. Cho. 2022. Semi-supervised domain adaptation via sample-to-sample self-distillation. In WACV. 1978–1987.
    [234]
    H. You, X. Chen, Y. Zhang, C. Li, S. Li, Z. Liu, Z. Wang, and Y. Lin. 2020. ShiftAddNet: A hardware-inspired deep network. In NIPS. 2771–2783.
    [235]
    C. Yu, T. Chen, and Z. Gan. 2023. Boost transformer-based language models with GPU-friendly sparsity and quantization. In ACL. 218–235.
    [236]
    J. Yu, J. Liu, X. Wei, H. Zhou, Y. Nakata, D. Gudovskiy, T. Okuno, J. Li, K. Keutzer, and S. Zhang. 2022. Cross-domain object detection with mean-teacher transformer. In ECCV.
    [237]
    L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z. Jiang, F. E. Tay, J. Feng, and S. Yan. 2021. Tokens-to-token ViT: Training vision transformers from scratch on ImageNet. In ICCV. 558–567.
    [238]
    L. Yuan, F. E. Tay, G. Li, T. Wang, and J. Feng. 2020. Revisiting knowledge distillation via label smoothing regularization. In CVPR. 3903–3911.
    [239]
    M. Yuan and Y. Lin. 2006. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. B 68, 1 (2006), 49–67.
    [240]
    C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In ACM FPGA. 161–170.
    [241]
    C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, and J. Cong. 2018. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Trans. Comput.-aided Des. Integ. Circ. Syst. 38, 11 (2018), 2072–2085.
    [242]
    H. Zhang, Z. Hu, W. Qin, M. Xu, and M. Wang. 2021. Adversarial co-distillation learning for image recognition. Pattern Recog. 111 (2021), 107659.
    [243]
    H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y. Shum. 2023. DINO: DETR with improved DeNoising anchor boxes for end-to-end object detection. In ICLR.
    [244]
    L. Zhang, A. Rao, and M. Agrawala. 2023. Adding conditional control to text-to-image diffusion models. In ICCV. 3836–3847.
    [245]
    L. Zhang, J. Song, A. Gao, J. Chen, C. Bao, and K. Ma. 2019. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In ICCV. 3713–3722.
    [246]
    S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen. 2016. Cambricon-X: An accelerator for sparse neural networks. In MICRO. 1–12.
    [247]
    X. Zhang, X. Zhou, M. Lin, and J. Sun. 2018. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In CVPR. 6848–6856.
    [248]
    F. Faghri, I. Tabrizian, I. Markov, D. Alistarh, D. M. Roy, and A. Ramezani-Kebrya. 2020. Adaptive gradient quantization for data-parallel SGD. NIPS 33 (2020), 3174–3185.
    [249]
    Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu. 2018. Deep mutual learning. In CVPR. 4320–4328.
    [250]
    Z. Zhang, J. Li, W. Shao, Z. Peng, R. Zhang, X. Wang, and P. Luo. 2019. Differentiable learning-to-group channels via groupable convolutional neural networks. In ICCV. 3542–3551.
    [251]
    B. Zhao, Q. Cui, R. Song, Y. Qiu, and J. Liang. 2022. Decoupled knowledge distillation. In CVPR. 11953–11962.
    [252]
    D. Zhou, Q. Hou, Y. Chen, J. Feng, and S. Yan. 2020. Rethinking bottleneck structure for efficient mobile network design. In ECCV. 680–697.
    [253]
    X. Zhou, Z. Du, Q. Guo, S. Liu, C. Liu, C. Wang, X. Zhou, L. Li, T. Chen, and Y. Chen. 2018. Cambricon-S: Addressing irregularity in sparse neural networks through a cooperative software/hardware approach. In MICRO. 15–28.
    [254]
    Y. Zhou, X. Dong, B. Akin, M. Tan, D. Peng, T. Meng, A. Yazdanbakhsh, D. Huang, R. Narayanaswami, and J. Laudon. 2021. Rethinking co-design of neural architectures and hardware accelerators. arXiv preprint arXiv:2102.08619 (2021).
    [255]
    C. Zhu, S. Han, H. Mao, and W. J. Dally. 2017. Trained ternary quantization. In ICLR.
    [256]
    B. Zoph and Q. V. Le. 2017. Neural architecture search with reinforcement learning. In ICLR.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Computing Surveys
    ACM Computing Surveys  Volume 56, Issue 10
    October 2024
    954 pages
    ISSN:0360-0300
    EISSN:1557-7341
    DOI:10.1145/3613652
    • Editors:
    • David Atienza,
    • Michela Milano
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 June 2024
    Online AM: 11 May 2024
    Accepted: 02 April 2024
    Revised: 02 March 2024
    Received: 15 December 2022
    Published in CSUR Volume 56, Issue 10

    Check for updates

    Author Tags

    1. Lightweight model
    2. efficient transformer
    3. model compression
    4. quantization
    5. tinyML
    6. large language models

    Qualifiers

    • Survey

    Funding Sources

    • National Science and Technology Council, Taiwan
    • National Key Fields Industry-University Cooperation and Skilled Personnel Training Act
    • Ministry of Education (MOE) and industry partners in Taiwan

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 345
      Total Downloads
    • Downloads (Last 12 months)345
    • Downloads (Last 6 weeks)227

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media