short-paper

WiP: Efficient LLM Prefilling with Mobile NPU

Authors:

Xuanzhe LiuAuthors Info & Claims

EdgeFM '24: Proceedings of the Workshop on Edge and Mobile Foundation Models

Pages 33 - 35

https://doi.org/10.1145/3662006.3662066

Published: 11 June 2024 Publication History

Get Access

Abstract

Large language models (LLMs) play a crucial role in various Natural Language Processing (NLP) tasks, prompting their deployment on mobile devices for inference. However, a significant challenge arises due to high waiting latency, especially for long prompts. This paper introduces mllm-NPU, the first system enabling efficient on-device LLM prefilling acceleration using on-chip Neural Processing Units (NPUs). Despite the impressive compute capabilities of NPUs, direct application to LLM prefilling often falls short. To this end, mllm-NPU incorporates two key techniques: (1) chunk-wise CPU-NPU co-scheduling to handle static compute graphs and INT8-only acceleration problems. (2) dynamic outlier inference to deal with static activation quantization sacrificing accuracy problem.

References

[1]

2023. Gboard - the Google Keyboard - Apps on Google Play --- play.google.com. https://play.google.com/store/apps/details?id=com.google.android.inputmethod.latin&hl=en.

Google Scholar

[2]

2023. Offlice copilot. https://blogs.microsoft.com/blog/2023/03/16/introducing-microsoft-365-copilot-your- copilot- for- work/.

Google Scholar

[3]

2023. Siri --- apple.com. https://www.apple.com/siri/.

Google Scholar

[4]

2023. Snapdragon 8gen3 SoC. https://www.qualcomm.com/products/mobile/snapdragon/smartphones/snapdragon-8-series-mobile-platforms/snapdragon-8-gen-3-mobile-platform.

Google Scholar

[5]

2024. LlamaTouch. https://github.com/LlamaTouch/LlamaTouch.

Google Scholar

[6]

Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, and Ramachandran Ramjee. 2023. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369 (2023).

Google Scholar

[7]

Bryan Wang, Gang Li, and Yang Li. 2023. Enabling conversational interaction with mobile ui using large language models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1--17.

Digital Library

Google Scholar

[8]

Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2023. Empowering llm to use smartphone for intelligent task automation. arXiv preprint arXiv:2308.15272 (2023).

Google Scholar

[9]

Daliang Xu, Mengwei Xu, Chiheng Lou, Li Zhang, Gang Huang, Xin Jin, and Xuanzhe Liu. 2024. SoCFlow: Efficient and Scalable DNN Training on SoC-Clustered Edge Servers. (2024).

Google Scholar

[10]

Daliang Xu, Mengwei Xu, Qipeng Wang, Shangguang Wang, Yun Ma, Kang Huang, Gang Huang, Xin Jin, and Xuanzhe Liu. 2022. Mandheling: Mixed-precision on-device dnn training with dsp offloading. In Proceedings of the 28th Annual International Conference on Mobile Computing And Networking. 214--227.

Digital Library

Google Scholar

[11]

Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization. arXiv preprint arXiv:2403.01136 (2024).

Google Scholar

Index Terms

WiP: Efficient LLM Prefilling with Mobile NPU

Recommendations

NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing
ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

Modern transformer-based Large Language Models (LLMs) are constructed with a series of decoder blocks. Each block comprises three key components: (1) QKV generation, (2) multi-head attention, and (3) feed-forward networks. In batched processing, QKV ...
Efficient Execution of Deep Neural Networks on Mobile Devices with NPU
IPSN '21: Proceedings of the 20th International Conference on Information Processing in Sensor Networks (co-located with CPS-IoT Week 2021)

Many Deep Neural Network (DNN) based applications have been developed and run on mobile devices. Although these advanced DNN models can provide better results, they also suffer from high computational overhead which means long delay and more energy ...
Exploiting remote GPGPU in mobile devices

Smart mobile devices including smart phones and tablets have become one of the most popular devices in the personal computing environment. One of the major characteristics of mobile applications is that the applications in the field of entertainment ...

Comments

Information & Contributors

Information

Published In

EdgeFM '24: Proceedings of the Workshop on Edge and Mobile Foundation Models

June 2024

44 pages

ISBN:9798400706639

DOI:10.1145/3662006

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Funding Sources

National Natural Science Foundation of China

Conference

MOBISYS '24

Sponsor:

SIGMOBILE

MOBISYS '24: The 22nd Annual International Conference on Mobile Systems, Applications and Services

June 3 - 7, 2024

Tokyo, Minato-ku, Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
148
Total Downloads

Downloads (Last 12 months)148
Downloads (Last 6 weeks)49

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Index Terms

Recommendations

NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing

Efficient Execution of Deep Neural Networks on Mobile Devices with NPU

Exploiting remote GPGPU in mobile devices

Comments

Published In

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Other Metrics

Article Metrics

Other Metrics

Login options

Full Access

PDF

eReader

Abstract

References

Index Terms

Recommendations

NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing

Efficient Execution of Deep Neural Networks on Mobile Devices with NPU

Exploiting remote GPGPU in mobile devices

Comments

Information

Published In

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations