research-article

Open access

Hybrid SLM and LLM for Edge-Cloud Collaborative Inference

Authors:

Huiqiang Jiang,

Ting CaoAuthors Info & Claims

EdgeFM '24: Proceedings of the Workshop on Edge and Mobile Foundation Models

Pages 36 - 41

https://doi.org/10.1145/3662006.3662067

Published: 11 June 2024 Publication History

Abstract

Edge-Cloud collaboration for deep learning inference has been actively studied, to enhance the inference performance by leveraging both Edge and Cloud resources. However, traditional Edge-Cloud collaboration based on model partitioning or confidence score are not suitable in the LLM (large language models) era, because of its autoregressive generation and the generality across diverse tasks. This paper proposes a dynamic token-level Edge-Cloud collaboration for LLMs. A SLM (small language model) such as TinyLlama resides on the Edge devices, through token-level interaction with the Cloud-side LLMs during inference, approaching LLM quality with a controllable cost similar to SLM. Evaluation results show that our method can only use 25.8% LLM cost to achieve LLM-comparable quality on GSM8K task.

References

[1]

Mario Almeida, Stefanos Laskaridis, Stylianos I Venieris, Ilias Leontiadis, and Nicholas D Lane. 2022. Dyno: Dynamic Onloading of deep neural networks from cloud to device. ACM Transactions on Embedded Computing Systems 21, 6 (2022). 1--24.

Digital Library

[2]

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. 2024. Medusa: Simple 11m inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774 (2024).

[3]

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318 (2023).

[4]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.033 74 (2021).

[5]

Yuxuan Chen, Rongpeng Li, Zhifeng Zhao, Chenghui Peng, Jianjun Wu, Ekram Hossain, and Honggang Zhang. 2023. Netgpt: A native-ai network architecture beyond provisioning personalized generative services. arXiv preprint arXiv:2307.06148 (2023).

[6]

Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, and Beidi Chen. 2024. Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding. arXiv preprint arXiv:2402.12374 (2024).

[7]

Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, et al. 2023. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886 (2023).

[8]

Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. 2024. MobileVLM V2: Faster and Stronger Baseline for Vision Language Model. arXiv preprint arXiv:2402.03766 (2024).

[9]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021).

[10]

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 (2022).

[11]

Ziyan Fu, Ju Ren, Yunxin Liu, Ting Cao, Deyu Zhang, Yuezhi Zhou, and Yaoxue Zhang. 2022. Hyperion: A Generic and Distributed Mobile Offloading Framework on OpenCL. In Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems, SenSys 2022, Boston, Massachusetts, November 6-9, 2022, Jeremy Gummeson, Sunghoon Ivan Lee, Jie Gao, and Guoliang Xing (Eds.). ACM, 607--621. https://doi.org/10.1145/3560905.3568511

Digital Library

[12]

Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. 2023. Textbooks are all you need. arXiv preprint arXiv:2306.11644 (2023).

[13]

Mojan Javaheripi, Sébastien Bubeck, Marah Abdin, Jyoti Aneja, Sebastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, et al. 2023. Phi-2: The surprising power of small language models. Microsoft Research Blog (2023).

[14]

Sehoon Kim, Karttikeya Mangalam, Suhong Moon, Jitendra Malik, Michael W Mahoney, Amir Gholami, and Kurt Keutzer. 2024. Speculative decoding with big little decoder. Advances in Neural Information Processing Systems 36 (2024).

[15]

Stefanos Laskaridis, Stylianos I Venieris, Mario Almeida, Ilias Leontiadis, and Nicholas D Lane. 2020. SPINN: synergistic progressive inference of neural networks over device and cloud. In Proceedings of the 26th annual international conference on mobile computing and networking. 1--15.

Digital Library

[16]

Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning. PMLR, 19274--19286.

[17]

En Li, Liekang Zeng, Zhi Zhou, and Xu Chen. 2019. Edge AI: On-demand accelerating deep neural network inference via edge computing. IEEE Transactions on Wireless Communications 19, 1 (2019), 447--457.

[18]

Min Li, Yu Li, Ye Tian, Li Jiang, and Qiang Xu. 2021. AppealNet: An Efficient and Highly-Accurate Edge/Cloud Collaborative Architecture for DNN Inference. In 58th ACM/IEEE Design Automation Conference, DAC 2021, San Francisco, CA, USA, December 5-9, 2021. IEEE, 409--414. https://doi.org/10.1109/DAC18074.2021.9586176

Digital Library

[19]

Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. 2023. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463 (2023).

[20]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics 12 (2024), 157--173.

[21]

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. 2023. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781 (2023).

[22]

Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium.

[23]

Tal Schuster, Adam Fisch, Tommi S. Jaakkola, and Regina Barzilay. 2021. Consistent Accelerated Inference via Confident Adaptive Transformers. CoRR abs/2104.08803 (2021). arXiv:2104.08803 https://arxiv.org/abs/2104.08803

[24]

Jiawei Shao and Jun Zhang. 2020. Bottlenet++: An end-to-end approach for feature compression in device-edge co-inference systems. In 2020 IEEE International Conference on Communications Workshops (ICC Workshops). IEEE, 1--6.

[25]

Jiawei Shao and Jun Zhang. 2020. Communication-computation trade-off in resource-constrained edge inference. IEEE Communications Magazine 58, 12 (2020), 20--26.

[26]

Benjamin Spector and Chris Re. 2023. Accelerating llm inference with staged speculative decoding. arXiv preprint arXiv:2308.04623 (2023).

[27]

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295 (2024).

[28]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).

[29]

Xuezhi Wang and Denny Zhou. 2024. Chain-of-Thought Reasoning Without Prompting. arXiv preprint arXiv:2402.10200 (2024).

[30]

Yiding Wang, Kai Chen, Haisheng Tan, and Kun Guo. 2023. Tabi: An efficient multi-level inference system for large language models. In Proceedings of the Eighteenth European Conference on Computer Systems. 233--248.

Digital Library

[31]

Daliang Xu, Wangsong Yin, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe Liu. 2023. Llmcad: Fast and scalable on-device large language model inference. arXiv preprint arXiv:2309.04255 (2023).

[32]

Zichuan Xu, Liqian Zhao, Weifa Liang, Omer F Rana, Pan Zhou, Qiufen Xia, Wenzheng Xu, and Guowei Wu. 2020. Energy-aware inference offloading for DNN-driven applications in mobile edge clouds. IEEE Transactions on Parallel and Distributed Systems 32, 4 (2020), 799--814.

Digital Library

[33]

Min Xue, Huaming Wu, Ruidong Li, Minxian Xu, and Pengfei Jiao. 2021. EosDNN: An efficient offloading scheme for DNN inference acceleration in local-edge-cloud collaborative environments. IEEE Transactions on Green Communications and Networking 6, 1 (2021), 248--264.

[34]

Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. 2024. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385 (2024).

Recommendations

Hybrid Cloud: An Inter Cloud Communication Mechanism
CCIOT '18: Proceedings of the 2018 International Conference on Cloud Computing and Internet of Things

Cloud computing has been in existence since the early 2000s and the growth in this domain of Computer Science has been by leaps and bounds. Cloud computing service providers offer 3 different services - Software, Platform and Infrastructure. Hybrid ...
Hybrid Cloud Scheduling Method for Cloud Bursting
Abstract
In the paper, we consider the hybrid cloud model used for cloud bursting, when the computational capacity of the private cloud provider is insufficient to deal with the peak number of customers’ applications, the private cloud will rely on the ...
Human-LLM Collaborative Annotation Through Effective Verification of LLM Labels
CHI '24: Proceedings of the CHI Conference on Human Factors in Computing Systems
Large language models (LLMs) have shown remarkable performance across various natural language processing (NLP) tasks, indicating their significant potential as data annotators. Although LLM-generated annotations are more cost-effective and efficient to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

EdgeFM '24: Proceedings of the Workshop on Edge and Mobile Foundation Models

June 2024

44 pages

ISBN:9798400706639

DOI:10.1145/3662006

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGMOBILE: ACM Special Interest Group on Mobility of Systems, Users, Data and Computing

In-Cooperation

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2024

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

MOBISYS '24

Sponsor:

SIGMOBILE

MOBISYS '24: The 22nd Annual International Conference on Mobile Systems, Applications and Services

June 3 - 7, 2024

Tokyo, Minato-ku, Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
367
Total Downloads

Downloads (Last 12 months)367
Downloads (Last 6 weeks)361

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents