WiP: Efficient LLM Prefilling with Mobile NPU
Abstract
References
Index Terms
- WiP: Efficient LLM Prefilling with Mobile NPU
Recommendations
NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing
ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3Modern transformer-based Large Language Models (LLMs) are constructed with a series of decoder blocks. Each block comprises three key components: (1) QKV generation, (2) multi-head attention, and (3) feed-forward networks. In batched processing, QKV ...
Efficient Execution of Deep Neural Networks on Mobile Devices with NPU
IPSN '21: Proceedings of the 20th International Conference on Information Processing in Sensor Networks (co-located with CPS-IoT Week 2021)Many Deep Neural Network (DNN) based applications have been developed and run on mobile devices. Although these advanced DNN models can provide better results, they also suffer from high computational overhead which means long delay and more energy ...
Exploiting remote GPGPU in mobile devices
Smart mobile devices including smart phones and tablets have become one of the most popular devices in the personal computing environment. One of the major characteristics of mobile applications is that the applications in the field of entertainment ...
Comments
Information & Contributors
Information
Published In
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
Check for updates
Author Tags
Qualifiers
- Short-paper
- Research
- Refereed limited
Funding Sources
Conference
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- 0Total Citations
- 222Total Downloads
- Downloads (Last 12 months)222
- Downloads (Last 6 weeks)50
Other Metrics
Citations
View Options
Get Access
Login options
Check if you have access through your login credentials or your institution to get full access on this article.
Sign in