Online Speculative Decoding

Liu, Xiaoxuan; Hu, Lanxiang; Bailis, Peter; Stoica, Ion; Deng, Zhijie; Cheung, Alvin; Zhang, Hao

Computer Science > Artificial Intelligence

arXiv:2310.07177v1 (cs)

[Submitted on 11 Oct 2023 (this version), latest version 10 Jun 2024 (v4)]

Title:Online Speculative Decoding

Authors:Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Ion Stoica, Zhijie Deng, Alvin Cheung, Hao Zhang

View PDF

Abstract:Speculative decoding is a pivotal technique to accelerate the inference of large language models (LLMs) by employing a smaller draft model to predict the target model's outputs. However, its efficacy can be limited due to the low predictive accuracy of the draft model, particularly when faced with diverse text inputs and a significant capability gap between the draft and target models. We introduce online speculative decoding (OSD) to address this challenge. The main idea is to continually update (multiple) draft model(s) on observed user query data using the abundant excess computational power in an LLM serving cluster. Given that LLM inference is memory-bounded, the surplus computational power in a typical LLM serving cluster can be repurposed for online retraining of draft models, thereby making the training cost-neutral. Since the query distribution of an LLM service is relatively simple, retraining on query distribution enables the draft model to more accurately predict the target model's outputs, particularly on data originating from query distributions. As the draft model evolves online, it aligns with the query distribution in real time, mitigating distribution shifts. We develop a prototype of online speculative decoding based on online knowledge distillation and evaluate it using both synthetic and real query data on several popular LLMs. The results show a substantial increase in the token acceptance rate by 0.1 to 0.65, which translates into 1.22x to 3.06x latency reduction.

Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2310.07177 [cs.AI]
	(or arXiv:2310.07177v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2310.07177

Submission history

From: Xiaoxuan Liu [view email]
[v1] Wed, 11 Oct 2023 04:03:42 UTC (908 KB)
[v2] Tue, 17 Oct 2023 18:02:19 UTC (908 KB)
[v3] Fri, 7 Jun 2024 00:14:47 UTC (2,243 KB)
[v4] Mon, 10 Jun 2024 01:36:31 UTC (2,243 KB)

Computer Science > Artificial Intelligence

Title:Online Speculative Decoding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:Online Speculative Decoding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators