FuseMax: Leveraging Extended Einsums to Optimize Attention Accelerator Design

Nayak, Nandeeka; Wu, Xinrui; Odemuyiwa, Toluwanimi O.; Pellauer, Michael; Emer, Joel S.; Fletcher, Christopher W.

Computer Science > Hardware Architecture

arXiv:2406.10491 (cs)

[Submitted on 15 Jun 2024 (v1), last revised 31 Oct 2024 (this version, v3)]

Title:FuseMax: Leveraging Extended Einsums to Optimize Attention Accelerator Design

Authors:Nandeeka Nayak, Xinrui Wu, Toluwanimi O. Odemuyiwa, Michael Pellauer, Joel S. Emer, Christopher W. Fletcher

View PDF HTML (experimental)

Abstract:Attention for transformers is a critical workload that has recently received significant "attention" as a target for custom acceleration. Yet, while prior work succeeds in reducing attention's memory-bandwidth requirements, it creates load imbalance between operators that comprise the attention computation (resulting in severe compute under-utilization) and requires on-chip memory that scales with sequence length (which is expected to grow over time).
This paper ameliorates these issues, enabling attention with nearly 100% compute utilization, no off-chip memory traffic bottlenecks, and on-chip buffer size requirements that are independent of sequence length. The main conceptual contribution is to use a recently proposed abstraction -- the cascade of Einsums -- to describe, formalize, and taxonomize the space of attention algorithms that appear in the literature. In particular, we show how Einsum cascades can be used to infer non-trivial lower bounds on the number of passes a kernel must take through its input data, which has implications for either required on-chip buffer capacity or memory traffic. We show how this notion can be used to meaningfully divide the space of attention algorithms into several categories and use these categories to inform our design process.
Based on the above characterization, we propose FuseMax -- a novel mapping and binding of attention onto a spatial array-style architecture. On attention, in an iso-area comparison, FuseMax achieves an average 6.7x speedup over the prior state-of-the-art, FLAT, while using 79\% of the energy. Similarly, on full end-to-end transformer inference, FuseMax achieves an average 5.3x speedup over FLAT using 83 of the energy.

Comments:	16 pages, 12 figures
Subjects:	Hardware Architecture (cs.AR)
Cite as:	arXiv:2406.10491 [cs.AR]
	(or arXiv:2406.10491v3 [cs.AR] for this version)
	https://doi.org/10.48550/arXiv.2406.10491

Submission history

From: Nandeeka Nayak [view email]
[v1] Sat, 15 Jun 2024 04:07:05 UTC (355 KB)
[v2] Tue, 25 Jun 2024 22:22:12 UTC (249 KB)
[v3] Thu, 31 Oct 2024 22:34:20 UTC (614 KB)

Computer Science > Hardware Architecture

Title:FuseMax: Leveraging Extended Einsums to Optimize Attention Accelerator Design

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Hardware Architecture

Title:FuseMax: Leveraging Extended Einsums to Optimize Attention Accelerator Design

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators