SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

Mudvari, Akrit; Jiang, Yuang; Tassiulas, Leandros

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2410.10759 (cs)

[Submitted on 14 Oct 2024 (v1), last revised 16 Oct 2024 (this version, v2)]

Title:SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

Authors:Akrit Mudvari, Yuang Jiang, Leandros Tassiulas

View PDF HTML (experimental)

Abstract:Large language models (LLMs) have been a disruptive innovation in recent years, and they play a crucial role in our daily lives due to their ability to understand and generate human-like text. Their capabilities include natural language understanding, information retrieval and search, translation, chatbots, virtual assistance, and many more. However, it is well known that LLMs are massive in terms of the number of parameters. Additionally, the self-attention mechanism in the underlying architecture of LLMs, Transformers, has quadratic complexity in terms of both computation and memory with respect to the input sequence length. For these reasons, LLM inference is resource-intensive, and thus, the throughput of LLM inference is limited, especially for the longer sequences. In this report, we design a collaborative inference architecture between a server and its clients to alleviate the throughput limit. In this design, we consider the available resources on both sides, i.e., the computation and communication costs. We develop a dynamic programming-based algorithm to optimally allocate computation between the server and the client device to increase the server throughput, while not violating the service level agreement (SLA). We show in the experiments that we are able to efficiently distribute the workload allowing for roughly 1/3 reduction in the server workload, while achieving 19 percent improvement over a greedy method. As a result, we are able to demonstrate that, in an environment with different types of LLM inference requests, the throughput of the server is improved.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
Cite as:	arXiv:2410.10759 [cs.DC]
	(or arXiv:2410.10759v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2410.10759

Submission history

From: Akrit Mudvari [view email]
[v1] Mon, 14 Oct 2024 17:38:41 UTC (726 KB)
[v2] Wed, 16 Oct 2024 16:31:37 UTC (726 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators