Efficient Inference Offloading for Mixture-of-Experts Large Language Models in Internet of Medical Things
Abstract
:1. Introduction
- To obtain efficient medical LLMs for healthcare applications, we fine-tune an LLM based on the MoE architecture, named MedMixtral 8x7B, using medical datasets to meet individuals’ personalized healthcare needs.
- To deploy MedMixtral 8x7B on IoMT devices, we propose a novel offloading strategy, which allows the deployment of MedMixtral 8x7B in the IoMT with less resource requirement, thus enhancing the privacy protection for users.
- To assess latency’s impact on LLM inference speed, we analyze both local and interconnection communication models. We highlight the critical role of latency in inference processes and propose several strategies to reduce it. These include enhancing disk speed, storing more model weights in the CPU’s RAM rather than on disk, and opting for wireless communications.
2. Related Work
2.1. Large Language Models
2.2. LLM Efficient Inference Offloading Methods
2.3. AI Used in Communication
3. Methods
3.1. Communication Model Design
3.1.1. Interconnection Communication Model
3.1.2. Local Communication Model
3.1.3. Total Communication Latency
3.2. MedMixtral 8x7B
- Question1: Hello doctor, We have a 9-month-old baby girl. She has five brown marks, two of them are 2 cm size and the other three of 0.5 cm. They are irregular in shape. Initial two marks appeared when she was 3 months old and now when she is 9 months three more for added that are comparatively light and 0.5 cm of size. Are these birthmarks? Will they fade with the time? Do we need to take any precautions so that they do not increase or get added further? Can we also provide any medication to cure them?
- Answer1: Hello, The brown marks you are describing are commonly known as birthmarks. The two larger ones that appeared when your baby was three months old are likely to be called “cafe-au-lait” spots, while the smaller, lighter ones may be called freckles. Cafe-au-lait spots are harmless and may fade slightly with time, but they do not typically disappear completely. Freckles, on the other hand, may darken with sun exposure but can also be lightened with topical treatments. While these birthmarks are generally not a cause for concern, it is always a good idea to monitor any changes in their size, shape, or color. If you notice any new or unusual growths, or if the existing birthmarks become significantly larger or darker, it is recommended to consult a healthcare professional. In terms of precautions, it is important to protect your baby’s skin from the sun, as UV radiation can darken the birthmarks and make them more noticeable. You can do this by dressing your baby in lightweight, long-sleeved clothing and a wide-brimmed hat, and by using a broad-spectrum sunscreen with an SPF of at least 30 on any exposed skin. As for medication, there is currently no cure or effective treatment for birthmarks. However, if the birthmarks are causing any discomfort, such as itching or irritation, your healthcare provider may recommend a topical cream or ointment to alleviate the symptoms. I hope this information is helpful. Please let me know if you have any further questions or concerns.
- Question2: Hello doctor, I have been going to a dentist about a tooth that is sore. They started a root canal but then did not finish. I have gone back several times and every time they open up the tooth and insert medicine. My tooth is in agony right now. I am just wondering if this is a standard procedure during a root canal? Should not the dentist remove the nerves so I am not in so much pain? Or is it the standard procedure to fight the infection first and then remove the nerves?
- Answer2: Hello, During a root canal procedure, the dentist typically removes the nerve tissue first in order to alleviate the pain caused by the infection. However, in some cases where the infection is severe, the dentist may choose to administer medication and apply a temporary filling to the tooth to eliminate the infection before completing the root canal. This is often done to make the procedure more manageable for the patient and to ensure the best possible outcome. The fact that your dentist has been repeatedly opening your tooth and applying medication suggests that they are trying to eliminate the infection before completing the root canal. While this approach can be somewhat uncomfortable, it is not uncommon and is often necessary to ensure the success of the procedure. I would recommend speaking with your dentist about your concerns and asking for more information about the approach they are taking. They may be able to provide you with additional pain relief measures or offer further insight into the procedure and what to expect.
- Question3: Hello doctor, I have been diagnosed with mild hereditary spherocytosis. My eyes still yellow. Is it unlikely hematologist who diagnosed me three years ago misdiagnosed? I do not doubt he knows his stuff. More about me getting worried. I lost weight a couple of months ago. I went very yellow in the eyes. I have always felt tired. I have had gallbladder out four years ago, I bruise easily, I do get little red skin dots at times, menstrual cycles can be irregular and spleen slightly enlarged.
- Answer3: Hello, While it is unlikely that your hematologist misdiagnosed you, the symptoms you have described could be indicative of other conditions such as liver disease or a more severe form of hereditary spherocytosis. The jaundice you mentioned, along with weight loss, fatigue, easy bruising, and enlarged spleen, are all potential symptoms of these conditions. It would be best to consult with your hematologist and possibly a gastroenterologist or liver specialist to rule out any other potential causes for your symptoms. They may want to perform additional tests, such as a liver function test or a bone marrow biopsy, to further evaluate your condition. It is important to remember that this advice is intended to be general in nature, and specific causes may not apply to your situation. Always consult with a healthcare professional for an accurate diagnosis. I hope this information helps, and please let me know if you have any further questions or concerns.
- Question4: Good evening, doctor. I have a severe toothache.It hurts so bad.It is difficult for me to eat.The pain is like a sharp knife cutting my teeth.What is the cause of it?
- Answer4: Good evening, I’m sorry to hear that you’re experiencing tooth pain. The description you provided suggests that the pain is severe and sharp, which could indicate a few possible causes. The most common cause of tooth pain is tooth decay, which can cause a sharp pain when eating or drinking something sweet, hot or cold. Another possible cause is an infection or abscess in the tooth or gum, which can cause a persistent, throbbing pain. In some cases, tooth pain can also be caused by gum disease, injury, or a cracked tooth. It is important to see a dentist as soon as possible to determine the cause of the pain and to receive appropriate treatment. In the meantime, you can try taking over-the-counter pain relievers to help manage the pain. However, it is not recommended to apply pain relievers directly to the tooth or gums, as this can burn the gum tissue.
3.3. Efficient Inference Offloading
Efficient Inference Offloading Algorithm
4. Results
4.1. MedMixtral 8x7B
4.2. Offloading Strategy
4.3. Communication Latency
5. Discussion and Limitations
5.1. Hallucinations in LLMs within Medical Settings
5.2. Expansion of the HealthCareMagic-100k Dataset
5.3. Improving Evaluation Methods
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
Symbol | Description in Device |
Channel bandwidth | |
Channel capacity | |
Propagation delay | |
Processing delay | |
Queueing delay | |
Waiting delay | |
Data volume | |
k | Boltzmann constant |
Latency from the CPU’s RAM to the GPU’s VRAM | |
Interconnection communication latency | |
Latency from the disk to the CPU’s RAM | |
Latency from the disk to the GPU’s VRAM | |
Local communication latency | |
Set of IoMT devices | |
Model weight for one layer | |
Speed from the CPU’s RAM to the GPU’s VRAM | |
Speed from the disk to the CPU’s RAM | |
Signal power | |
Signal-to-noise ratio | |
T | Absolute temperature in kelvin |
Total communication latency | |
The number of model layers in device n | |
The number of model layers offloading in the disk in device n | |
White Gaussian noise |
References
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Amin, S.U.; Hossain, M.S. Edge intelligence and Internet of Things in healthcare: A survey. IEEE Access 2020, 9, 45–59. [Google Scholar] [CrossRef]
- Xu, M.; Niyato, D.; Zhang, H.; Kang, J.; Xiong, Z.; Mao, S.; Han, Z. Cached Model-as-a-Resource: Provisioning Large Language Model Agents for Edge Intelligence in Space-air-ground Integrated Networks. arXiv 2024, arXiv:2403.05826. [Google Scholar]
- Xu, M.; Dusit, N.; Kang, J.; Xiong, Z.; Mao, S.; Han, Z.; Kim, D.I.; Letaief, K.B. When large language model agents meet 6g networks: Perception, grounding, and alignment. arXiv 2024, arXiv:2401.07764. [Google Scholar]
- Jiang, A.Q.; Sablayrolles, A.; Roux, A.; Mensch, A.; Savary, B.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Hanna, E.B.; Bressand, F.; et al. Mixtral of experts. arXiv 2024, arXiv:2401.04088. [Google Scholar]
- Eliseev, A.; Mazur, D. Fast inference of mixture-of-experts language models with offloading. arXiv 2023, arXiv:2312.17238. [Google Scholar]
- Gugger, S.; Debut, L.; Wolf, T.; Schmid, P.; Mueller, Z.; Mangrulkar, S.; Sun, M.; Bossan, B. Accelerate: Training and Inference at Scale Made Simple, Efficient and Adaptable. 2022. Available online: https://github.com/huggingface/accelerate (accessed on 6 March 2021).
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://www.mikecaptain.com/resources/pdf/GPT-1.pdf (accessed on 11 June 2018).
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
- Zhang, S.; Roller, S.; Goyal, N.; Artetxe, M.; Chen, M.; Chen, S.; Dewan, C.; Diab, M.; Li, X.; Lin, X.V.; et al. Opt: Open pre-trained transformer language models. arXiv 2022, arXiv:2205.01068. [Google Scholar]
- Zhang, Z.; Han, X.; Zhou, H.; Ke, P.; Gu, Y.; Ye, D.; Qin, Y.; Su, Y.; Ji, H.; Guan, J.; et al. CPM: A large-scale generative Chinese pre-trained language model. AI Open 2021, 2, 93–99. [Google Scholar] [CrossRef]
- Zhang, Z.; Gu, Y.; Han, X.; Chen, S.; Xiao, C.; Sun, Z.; Yao, Y.; Qi, F.; Guan, J.; Ke, P.; et al. Cpm-2: Large-scale cost-effective pre-trained language models. AI Open 2021, 2, 216–224. [Google Scholar] [CrossRef]
- Nijkamp, E.; Pang, B.; Hayashi, H.; Tu, L.; Wang, H.; Zhou, Y.; Savarese, S.; Xiong, C. Codegen: An open large language model for code with multi-turn program synthesis. arXiv 2022, arXiv:2203.13474. [Google Scholar]
- Le Scao, T.; Fan, A.; Akiki, C.; Pavlick, E.; Ilić, S.; Hesslow, D.; Castagné, R.; Luccioni, A.S.; Yvon, F.; Gallé, M.; et al. Bloom: A 176b-Parameter Open-Access Multilingual Language Model. 2023. Available online: https://inria.hal.science/hal-03850124/ (accessed on 20 November 2023).
- Zeng, A.; Liu, X.; Du, Z.; Wang, Z.; Lai, H.; Ding, M.; Yang, Z.; Xu, Y.; Zheng, W.; Xia, X.; et al. Glm-130b: An open bilingual pre-trained model. arXiv 2022, arXiv:2210.02414. [Google Scholar]
- Luo, R.; Sun, L.; Xia, Y.; Qin, T.; Zhang, S.; Poon, H.; Liu, T.Y. BioGPT: Generative pre-trained transformer for biomedical text generation and mining. Briefings Bioinform. 2022, 23, bbac409. [Google Scholar] [CrossRef] [PubMed]
- Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large language models encode clinical knowledge. arXiv 2022, arXiv:2212.13138. [Google Scholar] [CrossRef] [PubMed]
- Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Hou, L.; Clark, K.; Pfohl, S.; Cole-Lewis, H.; Neal, D.; et al. Towards expert-level medical question answering with large language models. arXiv 2023, arXiv:2305.09617. [Google Scholar]
- Rasley, J.; Rajbhandari, S.; Ruwase, O.; He, Y. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, 6–10 July 2020; pp. 3505–3506. [Google Scholar]
- Shoeybi, M.; Patwary, M.; Puri, R.; LeGresley, P.; Casper, J.; Catanzaro, B. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv 2019, arXiv:1909.08053. [Google Scholar]
- Kwon, W.; Li, Z.; Zhuang, S.; Sheng, Y.; Zheng, L.; Yu, C.H.; Gonzalez, J.; Zhang, H.; Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, Koblenz, Germany, 23–26 October 2023; pp. 611–626. [Google Scholar]
- Zhao, Y.; Gu, A.; Varma, R.; Luo, L.; Huang, C.C.; Xu, M.; Wright, L.; Shojanazeri, H.; Ott, M.; Shleifer, S.; et al. Pytorch fsdp: Experiences on scaling fully sharded data parallel. arXiv 2023, arXiv:2304.11277. [Google Scholar] [CrossRef]
- Han, P.; Wang, S.; Leung, K.K. Adaptive gradient sparsification for efficient federated learning: An online learning approach. In Proceedings of the 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS), Singapore, 29 November–1 December 2020; pp. 300–310. [Google Scholar]
- Han, P.; Wang, S.; Jiao, Y.; Huang, J. Federated Learning While Providing Model as a Service: Joint Training and Inference Optimization. arXiv 2023, arXiv:2312.12863. [Google Scholar]
- Piovesan, N.; López-Pérez, D.; De Domenico, A.; Geng, X.; Bao, H.; Debbah, M. Machine learning and analytical power consumption models for 5G base stations. IEEE Commun. Mag. 2022, 60, 56–62. [Google Scholar] [CrossRef]
- Ayed, F.; De Domenico, A.; Garcia-Rodriguez, A.; López-Pérez, D. Accordion: A communication-aware machine learning framework for next generation networks. IEEE Commun. Mag. 2023, 61, 104–110. [Google Scholar] [CrossRef]
- Du, Y.; Liew, S.C.; Chen, K.; Shao, Y. The power of large language models for wireless communication system development: A case study on fpga platforms. arXiv 2023, arXiv:2307.07319. [Google Scholar]
- Bariah, L.; Zhao, Q.; Zou, H.; Tian, Y.; Bader, F.; Debbah, M. Large language models for telecom: The next big thing? arXiv 2023, arXiv:2306.10249. [Google Scholar]
- Bariah, L.; Zou, H.; Zhao, Q.; Mouhouche, B.; Bader, F.; Debbah, M. Understanding telecom language through large language models. In Proceedings of the GLOBECOM 2023—2023 IEEE Global Communications Conference, Kuala Lumpur, Malaysia, 4–8 December 2023; pp. 6542–6547. [Google Scholar]
- Soman, S.; HG, R. Observations on LLMs for telecom domain: Capabilities and limitations. arXiv 2023, arXiv:2305.13102. [Google Scholar]
- Zhang, T.; Xu, C.; Lian, Y.; Tian, H.; Kang, J.; Kuang, X.; Niyato, D. When Moving Target Defense Meets Attack Prediction in Digital Twins: A Convolutional and Hierarchical Reinforcement Learning Approach. IEEE J. Sel. Areas Commun. 2023, 41, 3293–3305. [Google Scholar] [CrossRef]
- Zhang, T.; Xu, C.; Shen, J.; Kuang, X.; Grieco, L.A. How to Disturb Network Reconnaissance: A Moving Target Defense Approach Based on Deep Reinforcement Learning. IEEE Trans. Inf. Forensics Secur. 2023, 18, 5735–5748. [Google Scholar] [CrossRef]
- Zhang, T.; Xu, C.; Zou, P.; Tian, H.; Kuang, X.; Yang, S.; Zhong, L.; Niyato, D. How to mitigate DDoS intelligently in SD-IoV: A moving target defense approach. IEEE Trans. Ind. Inform. 2022, 19, 1097–1106. [Google Scholar] [CrossRef]
- Fedus, W.; Zoph, B.; Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res. 2022, 23, 1–39. [Google Scholar]
- Li, Y.; Li, Z.; Zhang, K.; Dan, R.; Jiang, S.; Zhang, Y. ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge. Cureus 2023, 15, e40895. [Google Scholar] [CrossRef]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Precision | Recall | F1 Score | |
---|---|---|---|
ChatGPT | 0.837 | 0.8445 | 0.8406 |
Mixtral 8x7B | 0.821 | 0.8434 | 0.8320 |
MedMixtral 8x7B | 0.838 | 0.8447 | 0.8413 |
Llama 8B | 0.799 | 0.8384 | 0.8179 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yuan, X.; Kong, W.; Luo, Z.; Xu, M. Efficient Inference Offloading for Mixture-of-Experts Large Language Models in Internet of Medical Things. Electronics 2024, 13, 2077. https://doi.org/10.3390/electronics13112077
Yuan X, Kong W, Luo Z, Xu M. Efficient Inference Offloading for Mixture-of-Experts Large Language Models in Internet of Medical Things. Electronics. 2024; 13(11):2077. https://doi.org/10.3390/electronics13112077
Chicago/Turabian StyleYuan, Xiaoming, Weixuan Kong, Zhenyu Luo, and Minrui Xu. 2024. "Efficient Inference Offloading for Mixture-of-Experts Large Language Models in Internet of Medical Things" Electronics 13, no. 11: 2077. https://doi.org/10.3390/electronics13112077
APA StyleYuan, X., Kong, W., Luo, Z., & Xu, M. (2024). Efficient Inference Offloading for Mixture-of-Experts Large Language Models in Internet of Medical Things. Electronics, 13(11), 2077. https://doi.org/10.3390/electronics13112077