Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3662006.3662066acmconferencesArticle/Chapter ViewAbstractPublication PagesmobisysConference Proceedingsconference-collections
short-paper

WiP: Efficient LLM Prefilling with Mobile NPU

Published: 11 June 2024 Publication History

Abstract

Large language models (LLMs) play a crucial role in various Natural Language Processing (NLP) tasks, prompting their deployment on mobile devices for inference. However, a significant challenge arises due to high waiting latency, especially for long prompts. This paper introduces mllm-NPU, the first system enabling efficient on-device LLM prefilling acceleration using on-chip Neural Processing Units (NPUs). Despite the impressive compute capabilities of NPUs, direct application to LLM prefilling often falls short. To this end, mllm-NPU incorporates two key techniques: (1) chunk-wise CPU-NPU co-scheduling to handle static compute graphs and INT8-only acceleration problems. (2) dynamic outlier inference to deal with static activation quantization sacrificing accuracy problem.

References

[1]
2023. Gboard - the Google Keyboard - Apps on Google Play --- play.google.com. https://play.google.com/store/apps/details?id=com.google.android.inputmethod.latin&hl=en.
[2]
2023. Offlice copilot. https://blogs.microsoft.com/blog/2023/03/16/introducing-microsoft-365-copilot-your- copilot- for- work/.
[3]
2023. Siri --- apple.com. https://www.apple.com/siri/.
[4]
2023. Snapdragon 8gen3 SoC. https://www.qualcomm.com/products/mobile/snapdragon/smartphones/snapdragon-8-series-mobile-platforms/snapdragon-8-gen-3-mobile-platform.
[5]
2024. LlamaTouch. https://github.com/LlamaTouch/LlamaTouch.
[6]
Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, and Ramachandran Ramjee. 2023. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369 (2023).
[7]
Bryan Wang, Gang Li, and Yang Li. 2023. Enabling conversational interaction with mobile ui using large language models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1--17.
[8]
Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2023. Empowering llm to use smartphone for intelligent task automation. arXiv preprint arXiv:2308.15272 (2023).
[9]
Daliang Xu, Mengwei Xu, Chiheng Lou, Li Zhang, Gang Huang, Xin Jin, and Xuanzhe Liu. 2024. SoCFlow: Efficient and Scalable DNN Training on SoC-Clustered Edge Servers. (2024).
[10]
Daliang Xu, Mengwei Xu, Qipeng Wang, Shangguang Wang, Yun Ma, Kang Huang, Gang Huang, Xin Jin, and Xuanzhe Liu. 2022. Mandheling: Mixed-precision on-device dnn training with dsp offloading. In Proceedings of the 28th Annual International Conference on Mobile Computing And Networking. 214--227.
[11]
Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization. arXiv preprint arXiv:2403.01136 (2024).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
EdgeFM '24: Proceedings of the Workshop on Edge and Mobile Foundation Models
June 2024
44 pages
ISBN:9798400706639
DOI:10.1145/3662006
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Large language model
  2. Mobile device
  3. NPU

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Funding Sources

Conference

MOBISYS '24
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 222
    Total Downloads
  • Downloads (Last 12 months)222
  • Downloads (Last 6 weeks)50
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media