Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding

Ankner, Zachary; Parthasarathy, Rishab; Nrusimha, Aniruddha; Rinard, Christopher; Ragan-Kelley, Jonathan; Brandon, William

Computer Science > Machine Learning

arXiv:2402.05109 (cs)

[Submitted on 7 Feb 2024 (v1), last revised 7 Oct 2024 (this version, v2)]

Title:Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding

Authors:Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, William Brandon

View PDF HTML (experimental)

Abstract:To combat the memory bandwidth-bound nature of autoregressive LLM inference, previous research has proposed the speculative decoding frame-work. To perform speculative decoding, a small draft model proposes candidate continuations of the input sequence that are then verified in parallel by the base model. One way to specify the draft model, as used in the recent Medusa decoding framework, is as a collection of lightweight heads, called draft heads, that operate on the base model's hidden states. To date, all existing draft heads have been sequentially independent, meaning that they speculate tokens in the candidate continuation independently of any preceding tokens in the candidate continuation. In this work, we propose Hydra heads: a sequentially-dependent drop-in replacement for standard draft heads that significantly improves the accuracy of draft head speculation. We further explore the design space of Hydra head training objectives and architectures, and propose a carefully tuned Hydra head recipe, which we call Hydra++, that improves decoding throughput by up to 1.31x and 2.70x compared to Medusa decoding and autoregressive de-coding respectively. Overall, Hydra heads are a simple and well-motivated intervention on standard draft heads that significantly improve the end-to-end speed of draft head-based speculative decoding. We make our code publicly available at this https URL.

Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2402.05109 [cs.LG]
	(or arXiv:2402.05109v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2402.05109

Submission history

From: Zachary Ankner [view email]
[v1] Wed, 7 Feb 2024 18:58:50 UTC (144 KB)
[v2] Mon, 7 Oct 2024 16:21:29 UTC (181 KB)

Computer Science > Machine Learning

Title:Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators