research-article

Mandheling: mixed-precision on-device DNN training with DSP offloading

Authors:

Shangguang Wang,

Xuanzhe LiuAuthors Info & Claims

MobiCom '22: Proceedings of the 28th Annual International Conference on Mobile Computing And Networking

Pages 214 - 227

https://doi.org/10.1145/3495243.3560545

Published: 14 October 2022 Publication History

Abstract

This paper proposes Mandheling, the first system that enables highly resource-efficient on-device training by orchestrating mixed-precision training with on-chip Digital Signal Processor (DSP) offloading. Mandheling fully explores the advantages of DSP in integer-based numerical calculations using four novel techniques: (1) a CPU-DSP co-scheduling scheme to situationally mitigate the overhead from DSP-unfriendly operators; (2) a self-adaptive rescaling algorithm to reduce the overhead of dynamic rescaling in backward propagation; (3) a batch-splitting algorithm to improve DSP cache efficiency; (4) a DSP compute subgraph-reusing mechanism to eliminate the preparation overhead on DSP. We have fully implemented Mandheling and demonstrated its effectiveness through extensive experiments. The results show that, compared to the state-of-the-art DNN engines from TFLite and MNN, Mandheling reduces per-batch training time by 5.5X and energy consumption by 8.9X on average. In end-to-end training tasks, Mandheling reduces convergence time by up to 10.7X and energy consumption by 13.1X, with only 1.9%--2.7% accuracy loss compared to the FP32 precision setting.

References

[1]

Federated learning: Collaborative machine learning without centralized training data. https://ai.googleblog.com/2017/04/federated-learning-collaborative.html, 2017.

[2]

How apple personalizes siri without hoovering up your data. https://www.technologyreview.com/2019/12/11/131629/apple-ai-personalizes-siri-federated-learning/, 2019.

[3]

Qualcomm hexagon nn offload framework. https://source.codeaurora.org/quic/hexagon_nn/, 2020.

[4]

dsp-processor. https://developer.qualcomm.com/software/hexagon-dsp-sdk/dsp-processor, 2021.

[5]

General data protection regulation. https://gdpr-info.eu/, 2021.

[6]

Genshin. https://genshin.mihoyo.com/, 2021.

[7]

/hexagon-dsp-sdk. https://developer.qualcomm.com/software/hexagon-dsp-sdk, 2021.

[8]

Qualcomm hexagon. https://en.wikipedia.org/wiki/Qualcomm_Hexagon, 2021.

[9]

Tensorflow graph reusing. https://www.tensorflow.org/guide/function, 2021.

[10]

Tiktok. https://www.tiktok.com, 2021.

[11]

Tnn. https://github.com/Tencent/TNN, 2021.

[12]

Youtube. https://www.youtube.com, 2021.

[13]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation), pages 265--283, 2016.

[14]

Ron Banner, Itay Hubara, Elad Hoffer, and Daniel Soudry. Scalable methods for 8-bit training of neural networks. Advances in neural information processing systems, 31, 2018.

[15]

Jean Christophe Beyler and Philippe Clauss. Performance driven data cache prefetching in a dynamic software optimization system. In Proceedings of the 21st annual international conference on Supercomputing, pages 202--209, 2007.

Digital Library

[16]

Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Konečnỳ, Stefano Mazzocchi, H Brendan McMahan, et al. Towards federated learning at scale: System design. arXiv preprint arXiv:1902.01046, 2019.

[17]

Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of 19th International Conference on Computational Statistics, pages 177--186. Springer, 2010.

[18]

Dongqi Cai, Qipeng Wang, Yuanqiang Liu, Yunxin Liu, Shangguang Wang, and Mengwei Xu. Towards ubiquitous learning: A first measurement of on-device training performance. In Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning, pages 31--36, 2021.

Digital Library

[19]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation, pages 578--594, 2018.

[20]

Xi Chen, Xiaolin Hu, Hucheng Zhou, and Ningyi Xu. Fxpnet: Training a deep convolutional neural network in fixed-point representation. In 2017 International Joint Conference on Neural Networks, pages 2494--2501. IEEE, 2017.

[21]

Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. Advances in neural information processing systems, 28, 2015.

[22]

Biyi Fang, Xiao Zeng, and Mi Zhang. Nestdnn: Resource-aware multi-tenant on-device deep learning for continuous mobile vision. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, pages 115--127, 2018.

Digital Library

[23]

Petko Georgiev, Nicholas D Lane, Cecilia Mascolo, and David Chu. Accelerating mobile audio sensing algorithms through on-chip gpu offloading. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services, pages 306--318, 2017.

Digital Library

[24]

Petko Georgiev, Nicholas D Lane, Kiran K Rachuri, and Cecilia Mascolo. Dsp. ear: Leveraging co-processor support for continuous audio sensing on smart-phones. In Proceedings of the 12th ACM Conference on Embedded Network Sensor Systems, pages 295--309, 2014.

Digital Library

[25]

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.

[26]

Wentian Guo, Yuchen Li, and Kian-Lee Tan. Exploiting reuse for gpu subgraph enumeration. IEEE Transactions on Knowledge and Data Engineering, 2020.

[27]

Donghee Ha, Mooseop Kim, KyeongDeok Moon, and Chi Yoon Jeong. Accelerating on-device learning with layer-wise processor selection method on unified memory. IEEE Sensors, 21(7):2364, 2021.

[28]

Myeonggyun Han and Woongki Baek. Herti: A reinforcement learning-augmented system for efficient real-time inference on heterogeneous embedded systems. In 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 90--102. IEEE, 2021.

Digital Library

[29]

Myeonggyun Han, Jihoon Hyun, Seongbeom Park, Jinsu Park, and Woongki Baek. Mosaic: Heterogeneity-, communication-, and constraint-aware model slicing and execution for accurate and efficient inference. In 2019 28th International Conference on Parallel Architectures and Compilation Techniques, pages 165--177. IEEE, 2019.

Digital Library

[30]

Rui Han, Qinglong Zhang, Chi Harold Liu, Guoren Wang, Jian Tang, and Lydia Y Chen. Legodnn: block-grained scaling of deep neural networks for mobile vision. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking, pages 406--419, 2021.

Digital Library

[31]

Seungyeop Han, Haichen Shen, Matthai Philipose, Sharad Agarwal, Alec Wolman, and Arvind Krishnamurthy. Mcdnn: An approximation-based execution framework for deep stream processing under resource constraints. In Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services, pages 123--136, 2016.

Digital Library

[32]

Andrew Hard, Kanishka Rao, Rajiv Mathews, Swaroop Ramaswamy, Françoise Beaufays, Sean Augenstein, Hubert Eichner, Chloé Kiddon, and Daniel Ramage. Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604, 2018.

[33]

Chaoyang He, Songze Li, Jinhyun So, Xiao Zeng, Mi Zhang, Hongyi Wang, Xiaoyang Wang, Praneeth Vepakomma, Abhishek Singh, Hang Qiu, et al. Fedml: A research library and benchmark for federated machine learning. arXiv preprint arXiv:2007.13518, 2020.

[34]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016.

[35]

Loc N Huynh, Youngki Lee, and Rajesh Krishna Balan. Deepmon: Mobile gpu-based deep learning framework for continuous vision applications. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services, pages 82--95, 2017.

Digital Library

[36]

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2704--2713, 2018.

[37]

Joo Seong Jeong, Jingyu Lee, Donghyun Kim, Changmin Jeon, Changjin Jeong, Youngki Lee, and Byung-Gon Chun. Band: coordinated multi-dnn inference on heterogeneous mobile processors. In Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services, pages 235--247, 2022.

Digital Library

[38]

Divyansh Jhunjhunwala, Advait Gadhikar, Gauri Joshi, and Yonina C Eldar. Adaptive quantization of model updates for communication-efficient federated learning. In 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3110--3114. IEEE, 2021.

[39]

Xiaotang Jiang, Huan Wang, Yiliu Chen, Ziqi Wu, Lichuan Wang, Bin Zou, Yafeng Yang, Zongyang Cui, Yu Cai, Tianhang Yu, et al. Mnn: A universal and efficient inference engine. arXiv preprint arXiv:2002.12418, 2020.

[40]

Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1--2):1--210, 2021.

[41]

Youngsok Kim, Joonsung Kim, Dongju Chae, Daehyun Kim, and Jangwoo Kim. μlayer: Low latency on-device inference using cooperative single-layer acceleration and processor-friendly quantization. In Proceedings of the Fourteenth EuroSys Conference 2019, pages 1--15, 2019.

[42]

Jakub Konečnỳ, H Brendan McMahan, Daniel Ramage, and Peter Richtárik. Federated optimization: Distributed machine learning for on-device intelligence. arXiv preprint arXiv:1610.02527, 2016.

[43]

Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.

[44]

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

[45]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.

[46]

Nicholas D Lane, Sourav Bhattacharya, Petko Georgiev, Claudio Forlivesi, Lei Jiao, Lorena Qendro, and Fahim Kawsar. Deepx: A software accelerator for low-power deep learning inference on mobile devices. In 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks, pages 1--12. IEEE, 2016.

[47]

Nicholas D Lane, Petko Georgiev, and Lorena Qendro. Deepear: robust smart-phone audio sensing in unconstrained acoustic environments using deep learning. In Proceedings of the 2015 ACM international joint conference on pervasive and ubiquitous computing, pages 283--294, 2015.

Digital Library

[48]

Royson Lee, Stylianos I Venieris, Lukasz Dudziak, Sourav Bhattacharya, and Nicholas D Lane. Mobisr: Efficient on-device super-resolution through heterogeneous mobile processors. In The 25th Annual International Conference on Mobile Computing and Networking, pages 1--16, 2019.

Digital Library

[49]

Ang Li, Jingwei Sun, Pengcheng Li, Yu Pu, Hai Li, and Yiran Chen. Hermes: an efficient federated learning framework for heterogeneous mobile clients. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking, pages 420--437, 2021.

Digital Library

[50]

Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary convolutional neural network. Advances in neural information processing systems, 30, 2017.

[51]

Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neural networks with few multiplications. arXiv preprint arXiv:1510.03009, 2015.

[52]

TensorFlow Lite. Deploy machine learning models on mobile and iot devices, 2019.

[53]

Yizhi Liu, Yao Wang, Ruofei Yu, Mu Li, Vin Sharma, and Yida Wang. Optimizing cnn model inference on cpus. In 2019 USENIX Annual Technical Conference, pages 1025--1040, 2019.

Digital Library

[54]

Christos Louizos, Matthias Reisser, Tijmen Blankevoort, Efstratios Gavves, and Max Welling. Relaxed quantization for discretized neural networks. arXiv preprint arXiv:1810.01875, 2018.

[55]

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273--1282. PMLR, 2017.

[56]

Sparsh Mittal. A survey of recent prefetching techniques for processor caches. ACM Computing Surveys, 49(2):1--35, 2016.

Digital Library

[57]

Fan Mo, Hamed Haddadi, Kleomenis Katevas, Eduard Marin, Diego Perino, and Nicolas Kourtellis. Ppfl: privacy-preserving federated learning with trusted execution environments. In Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services, pages 94--108, 2021.

Digital Library

[58]

Chaoyue Niu, Fan Wu, Shaojie Tang, Lifeng Hua, Rongfei Jia, Chengfei Lv, Zhihua Wu, and Guihai Chen. Billion-scale federated learning on mobile clients: A submodel design with tunable privacy. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking, pages 1--14, 2020.

Digital Library

[59]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703, 2019.

[60]

Robert Pyka, Christoph Faßbach, Manish Verma, Heiko Falk, and Peter Marwedel. Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications. In Proceedings of the 10th international workshop on Software & compilers for embedded systems, pages 41--50, 2007.

Digital Library

[61]

Rajib Rana, Margee Hume, John Reilly, Raja Jurdak, and Jeffrey Soar. Opportunistic and context-aware affect sensing on smartphones. IEEE Pervasive Computing, 15(02):60--69, 2016.

Digital Library

[62]

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, pages 525--542. Springer, 2016.

[63]

Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečnỳ, Sanjiv Kumar, and H Brendan McMahan. Adaptive federated optimization. arXiv preprint arXiv:2003.00295, 2020.

[64]

Venu Gopal Reddy. Neon technology introduction. ARM Corporation, 4(1):1--33, 2008.

[65]

Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, Ali Jadbabaie, and Ramtin Pedarsani. Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization. In International Conference on Artificial Intelligence and Statistics, pages 2021--2031. PMLR, 2020.

[66]

Nir Shlezinger, Mingzhe Chen, Yonina C Eldar, H Vincent Poor, and Shuguang Cui. Federated learning with quantization constraints. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 8851--8855. IEEE, 2020.

[67]

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[68]

Samuel L Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V Le. Don't decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.

[69]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818--2826, 2016.

[70]

Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. Communication compression for decentralized training. Advances in Neural Information Processing Systems, 31, 2018.

[71]

Devesh Tiwari, Sanghoon Lee, James Tuck, and Yan Solihin. Mmt: Exploiting finegrained parallelism in dynamic memory management. In 2010 IEEE International Symposium on Parallel & Distributed Processing, pages 1--12. IEEE, 2010.

[72]

Haozhao Wang, Zhihao Qu, Qihua Zhou, Haobo Zhang, Boyuan Luo, Wenchao Xu, Song Guo, and Ruixuan Li. A comprehensive survey on training acceleration for large machine learning models in iots. IEEE Internet of Things Journal, 2021.

[73]

Manni Wang, Shaohua Ding, Ting Cao, Yunxin Liu, and Fengyuan Xu. Asymo: scalable and efficient deep-learning inference on asymmetric mobile cpus. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking, pages 215--228, 2021.

Digital Library

[74]

Maolin Wang, Seyedramin Rasoulinezhad, Philip HW Leong, and Hayden KH So. Niti: Training integer neural networks using integer-only arithmetic. arXiv preprint arXiv:2009.13108, 2020.

[75]

Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan. Training deep neural networks with 8-bit floating point numbers. Advances in neural information processing systems, 31, 2018.

[76]

Qipeng Wang, Mengwei Xu, Chao Jin, Xinran Dong, Jinliang Yuan, Xin Jin, Gang Huang, Yunxin Liu, and Xuanzhe Liu. Melon: Breaking the memory wall for resource-efficient on-device machine learning. 2022.

Digital Library

[77]

Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi. Training and inference with integers in deep neural networks. arXiv preprint arXiv:1802.04680, 2018.

[78]

Xundong Wu, Yong Wu, and Yong Zhao. Binarized neural networks on the imagenet classification task. arXiv preprint arXiv:1604.03058, 2016.

[79]

Mengwei Xu, Feng Qian, Qiaozhu Mei, Kang Huang, and Xuanzhe Liu. Deep-type: On-device deep learning for input personalization service with minimal privacy concern. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies.

[80]

Mengwei Xu, Feng Qian, Mengze Zhu, Feifan Huang, Saumay Pushp, and Xuanzhe Liu. Deepwear: Adaptive local offloading for on-wearable deep learning. IEEE Transactions on Mobile Computing, 19(2):314--330, 2019.

[81]

Mengwei Xu, Mengze Zhu, Yunxin Liu, Felix Xiaozhu Lin, and Xuanzhe Liu. Deepcache: Principled cache for mobile deep vision. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, pages 129--144, 2018.

Digital Library

[82]

Chengxu Yang, Qipeng Wang, Mengwei Xu, Zhenpeng Chen, Kaigui Bian, Yunxin Liu, and Xuanzhe Liu. Characterizing impacts of heterogeneity in federated learning upon large-scale smartphone data. In Proceedings of the Web Conference 2021, pages 935--946, 2021.

Digital Library

[83]

Yukuan Yang, Lei Deng, Shuang Wu, Tianyi Yan, Yuan Xie, and Guoqi Li. Training high-performance and large-scale deep neural networks with full 8-bit integers. Neural Networks, 125:70--82, 2020.

[84]

Hyunho Yeo, Chan Ju Chong, Youngmok Jung, Juncheol Ye, and Dongsu Han. Nemo: enabling neural-enhanced video streaming on commodity mobile devices. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking, pages 1--14, 2020.

Digital Library

[85]

Qunsong Zeng, Yuqing Du, Kaibin Huang, and Kin K Leung. Energy-efficient resource management for federated edge learning with cpu-gpu heterogeneous computing. IEEE Transactions on Wireless Communications, 20(12):7947--7962, 2021.

[86]

Jinrui Zhang, Deyu Zhang, Xiaohui Xu, Fucheng Jia, Yunxin Liu, Xuanzhe Liu, Ju Ren, and Yaoxue Zhang. Mobipose: Real-time multi-person pose estimation on mobile devices. In Proceedings of the 18th Conference on Embedded Networked Sensor Systems, pages 136--149, 2020.

Digital Library

[87]

Qiyang Zhang, Xiang Li, Xiangying Che, Xiao Ma, Ao Zhou, Mengwei Xu, Shangguang Wang, Yun Ma, and Xuanzhe Liu. A comprehensive benchmark of deep learning libraries on mobile devices. arXiv preprint arXiv:2202.06512, 2022.

[88]

Xishan Zhang, Shaoli Liu, Rui Zhang, Chang Liu, Di Huang, Shiyi Zhou, Jiaming Guo, Qi Guo, Zidong Du, Tian Zhi, et al. Fixed-point back-propagation training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2330--2338, 2020.

[89]

Qin Zhao, Rodric Rabbah, and Weng-Fai Wong. Dynamic memory optimization using pool allocation and prefetching. ACM SIGARCH Computer Architecture News, 33(5):27--32, 2005.

Digital Library

[90]

Kai Zhong, Tianchen Zhao, Xuefei Ning, Shulin Zeng, Kaiyuan Guo, Yu Wang, and Huazhong Yang. Towards lower bit multiplication for convolutional neural network training. arXiv preprint arXiv:2006.02804, 3(4), 2020.

[91]

Qihua Zhou, Song Guo, Zhihao Qu, Jingcai Guo, Zhenda Xu, Jiewei Zhang, Tao Guo, Boyuan Luo, and Jingren Zhou. Octo: {INT8} training with loss-aware compensation and backward quantization for tiny on-device learning. In 2021 USENIX Annual Technical Conference, pages 177--191, 2021.

[92]

Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.

[93]

Yiren Zhou, Seyed-Mohsen Moosavi-Dezfooli, Ngai-Man Cheung, and Pascal Frossard. Adaptive quantization for deep neural network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.

[94]

Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.

[95]

Feng Zhu, Ruihao Gong, Fengwei Yu, Xianglong Liu, Yanfei Wang, Zhelong Li, Xiuqi Yang, and Junjie Yan. Towards unified int8 training for convolutional neural network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1969--1979, 2020.

Cited By

Cai YZhang ZGui JLiu BZhao XLi RLi ZLi DBalzarotti DXu W(2024)FAMOSProceedings of the 33rd USENIX Conference on Security Symposium10.5555/3698900.3698917(289-306)Online publication date: 14-Aug-2024
https://dl.acm.org/doi/10.5555/3698900.3698917
Kwon YLi RVenieris SChauhan JLane NMascolo CSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)TinyTrainProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693103(25812-25843)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3693103
Xu MCai DWu YLi XWang SBagchi SZhang Y(2024)FwdLLMProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692028(579-596)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691992.3692028
Show More Cited By

Index Terms

Mandheling: mixed-precision on-device DNN training with DSP offloading
1. Computing methodologies
  1. Artificial intelligence
2. Human-centered computing
  1. Ubiquitous and mobile computing

Recommendations

Comparison of OpenCL and RenderScript for mobile devices

With the recent advances in the programmability and performance of mobile Graphics Processing Units (GPUs), General-Purpose Graphics Processing Unit (GPGPU) technologies have become available even in mobile devices such as smartphones and tablets. Among ...
An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures
MLHPC'17: Proceedings of the Machine Learning on HPC Environments

Traditionally, Deep Learning (DL) frameworks like Caffe, TensorFlow, and Cognitive Toolkit exploited GPUs to accelerate the training process. This has been primarily achieved by aggressive improvements in parallel hardware as well as through ...
Optimizing Convolutional Neural Networks on the Sunway TaihuLight Supercomputer

The Sunway TaihuLight supercomputer is powered by SW26010, a new 260-core processor designed with on-chip fusion of heterogeneous cores. In this article, we present our work on optimizing the training process of convolutional neural networks (CNNs) on ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MobiCom '22: Proceedings of the 28th Annual International Conference on Mobile Computing And Networking

October 2022

932 pages

ISBN:9781450391818

DOI:10.1145/3495243

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOBILE: ACM Special Interest Group on Mobility of Systems, Users, Data and Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

the National Key R&D Program of China
Beijing Nova Program
the National Natural Science Foundation of China
Young Elite Scientists Sponsorship Program by CAST
the PKU-Baidu Fund Project
NSFC
the National Natural Science Fund for the Excellent Young Scientists Fund Program (Overseas)

Conference

ACM MobiCom '22

Sponsor:

SIGMOBILE

ACM MobiCom '22: The 28th Annual International Conference on Mobile Computing and Networking

October 17 - 21, 2022

NSW, Sydney, Australia

Acceptance Rates

Overall Acceptance Rate 440 of 2,972 submissions, 15%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
1,061
Total Downloads

Downloads (Last 12 months)324
Downloads (Last 6 weeks)23

Reflects downloads up to 26 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Cai YZhang ZGui JLiu BZhao XLi RLi ZLi DBalzarotti DXu W(2024)FAMOSProceedings of the 33rd USENIX Conference on Security Symposium10.5555/3698900.3698917(289-306)Online publication date: 14-Aug-2024
https://dl.acm.org/doi/10.5555/3698900.3698917
Kwon YLi RVenieris SChauhan JLane NMascolo CSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)TinyTrainProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693103(25812-25843)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3693103
Xu MCai DWu YLi XWang SBagchi SZhang Y(2024)FwdLLMProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692028(579-596)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691992.3692028
Zhang LFu ZShi BLi XLai RYang CZhou AMa XWang SXu MBagchi SZhang Y(2024)More is differentProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692009(285-302)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691992.3692009
Zhu SVoigt TRahimian FKo J(2024)On-device Training: A First Overview on Existing SystemsACM Transactions on Sensor Networks10.1145/369600320:6(1-39)Online publication date: 14-Sep-2024
https://dl.acm.org/doi/10.1145/3696003
Siam SAhn HLiu LAlam SShen HCao ZShroff NKrishnamachari BSrivastava MZhang M(2024)Artificial Intelligence of Things: A SurveyACM Transactions on Sensor Networks10.1145/369063921:1(1-75)Online publication date: 30-Aug-2024
https://dl.acm.org/doi/10.1145/3690639
Ouyang BYe SZeng LQian TLi JChen X(2024)Pluto and Charon: A Time and Memory Efficient Collaborative Edge AI Framework for Personal LLMs Fine-tuningProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673043(762-771)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673043
Fang CLiu SZhou ZGuo BTang JMa KYu ZShu YLiu JTan RHe YChen J(2024)AdaShadow: Responsive Test-time Model Adaptation in Non-stationary Mobile EnvironmentsProceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems10.1145/3666025.3699339(295-308)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3666025.3699339
Yin WXu DHuang GZhang YWei SXu MLiu XShu YLiu JTan RHe YChen J(2024)PieBridge: Fast and Parameter-Efficient On-Device Training via Proxy NetworksProceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems10.1145/3666025.3699327(126-140)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3666025.3699327
Xu DZhang HYang LLiu RXu MLiu X(2024)WiP: Efficient LLM Prefilling with Mobile NPUProceedings of the Workshop on Edge and Mobile Foundation Models10.1145/3662006.3662066(33-35)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3662006.3662066
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten