FuseMax: Leveraging Extended Einsums to Optimize Attention Accelerator Design

Nandeeka Nayak^∗, Xinrui Wu^∗∗, Toluwanimi O. Odemuyiwa^∗∗∗, Michael Pellauer^†, Joel S. Emer^†‡, Christopher W. Fletcher^∗
^∗University of California, Berkeley, ^∗∗Tsinghua University, ^∗∗∗University of California, Davis,
^†NVIDIA, ^‡Massachusetts Institute of Technology
{nandeeka, cwfletcher}@berkeley.edu, xr-wu20@mails.tsinghua.edu.cn,
todemuyiwa@ucdavis.edu, mpellauer@nvidia.com, jsemer@mit.edu

Abstract

Attention for transformers is a critical workload that has recently received significant ‘attention’ as a target for custom acceleration. Yet, while prior work succeeds in reducing attention’s memory-bandwidth requirements, it creates load imbalance between attention operators (resulting in severe compute under-utilization) and requires on-chip memory that scales with sequence length (which is expected to grow over time).

This paper ameliorates these issues, enabling attention with nearly 100% compute utilization, no off-chip memory traffic bottlenecks, and on-chip buffer size requirements that are independent of sequence length. The main conceptual contribution is to use a recently proposed abstraction—the cascade of Einsums—to describe, formalize and taxonomize the space of attention algorithms that appear in the literature. In particular, we show how Einsum cascades can be used to infer non-trivial lower bounds on the number of passes a kernel must take through its input data, which has implications for either required on-chip buffer capacity or memory traffic. We show how this notion can be used to meaningfully divide the space of attention algorithms into several categories and use these categories to inform our design process.

Based on the above characterization, we propose FuseMax—a novel mapping of attention onto a spatial array-style architecture. On attention, in an iso-area comparison, FuseMax achieves an average $6.7\times$ speedup over the prior state-of-the-art FLAT [27] while using $79\%$ of the energy. Similarly, on the full end-to-end transformer inference, FuseMax achieves an average $5.3\times$ speedup over FLAT using $83\%$ of the energy.

I Introduction

Over the past few years, transformers [48] have emerged as the model architecture of choice for a wide range of machine learning applications, from natural language processing [18, 29, 45, 46] to computer vision [19, 33] to speech recognition [5, 25]. This rise has been accompanied by a corresponding wave of proposals for accelerating transformers in both software [13, 15, 16] and hardware [27, 57].

Fortunately, many of the layers (projections, fully connected layers, etc.) used by transformers look very similar to prior generations of machine learning models. Its resource-intensive tensor products can be described and evaluated with existing tensor algebra accelerator modeling tools [28, 35, 40], and many of the other layers (e.g., layer normalization) have negligible impact on performance and can be safely ignored.

However, attention [48]—usually described as a matrix multiplication, a softmax, and then another matrix multiplication—does not fit into either of these boxes. For example, the softmax is both memory intensive (featuring low algorithmic reuse) and compute intensive (featuring exponentiation and division). Furthermore, attention’s characteristics preclude many “free lunches” often used to improve efficiency for other DNN models. For example, because all tensors are a function of the model inputs, there is no opportunity to amortize memory access costs with an increased batch size. Additionally, since none of the operands can be computed before the inputs are given, compression/strength reduction techniques (e.g., quantization [55, 22], sparsity [49, 34, 44], etc.) must be applied dynamically, leading to more complicated algorithms and hardware designs.

To illustrate the difficulty in accelerating attention, consider the state-of-the-art accelerator for attention: FLAT [27]. FLAT uses fusion to reduce attention memory bandwidth bottlenecks on a spatial architecture (e.g., a TPU [26]). Specifically, FLAT maps attention’s matrix multiplications to the 2D spatial array and softmax operations to a separate 1D array. While FLAT’s design does make attention compute bound, it becomes compute bottlenecked in the 1D array (the softmax), causing severe under utilization of the 2D array. While one could add additional PEs to the 1D array, this results in commensurate area costs.

Making matters worse, FLAT requires that the entire vector over which the softmax is performed be buffered on chip. This vector is proportional to the sequence length, which is growing rapidly with time (e.g., Google reports 10 million length sequences in research, which would require 100s of MegaBytes to buffer [1]). When the vector/sequence length grows beyond allowable buffer capacity, FLAT is forced to spill, which contributes significantly to attention energy consumption and can even make attention memory-bandwidth bound.

This paper. We address the above challenges by proposing a novel spatial architecture – FuseMax – to accelerate attention, with particular emphasis on removing bottlenecks imposed by the softmax. Our architecture addresses all of the aforementioned issues associated with FLAT. Namely:

•

FuseMax is compute bound, but provides almost 100% utilization of both the 2D and 1D arrays throughout the attention operation, without adding additional PEs to the 1D array.
•

FuseMax’s on-chip memory requirements are invariant to sequence length and require no extra spills to memory regardless of sequence length.

The technical core of the paper is three parts.

First, Section III demonstrates a novel analysis on kernels that uses the recently proposed cascade of Einsums abstraction [35]. In a nutshell, an Einsum defines an iteration space over tensors and what computation is done on and between tensors at each point in the iteration space. A cascade of Einsums is a sequence of dependent Einsums that can be used to describe and specify a larger kernel.

While prior work [35, 38] provides a precise definition for Einsums, a major contribution in our work is to show how this definition can be leveraged to inform accelerator design. Specifically, we recognize that the cascade makes explicit precisely what dependencies there are between Einsums. We show how this can be used to make non-trivial deductions about a kernel’s allowed fusion granularity and algorithmic minimum per-tensor live footprint. The relationship between the live footprint and the buffer capacity, in turn, has implications for the required data movement.

In more detail, this analysis provides insight into the number of passes an algorithm performs, i.e., the number of times a given element of an input must be revisited after visiting every other element of the input. Normally, one strives to choose a dataflow that exploits maximal reuse in a given element (or tile of elements) to avoid having frequently reload it. However, some algorithms preclude this strategy. In this work, we describe how to count the number of passes a cascade requires and present two methods for reducing the number of passes. In general, fewer passes is preferable; although, interestingly, we find that decreasing the number of passes can increase the required compute. Given that an Einsum cascade is mapping/scheduling agnostic, this analysis provides insight given any possible scheduling of the cascade onto hardware.

Next, Section IV applies the cascade of Einsums abstraction to describe and formalize the attention kernel. Using the notion of passes introduced in Section III, we taxonomize the space of numerically stable attention proposals that appear in the literature. For example, in a naïve implementation of attention, one must traverse the entire softmax input to build the softmax denominator and only after that can one revisit and scale each input (softmax numerator) by the denominator. We show how transforming the attention cascade reduces the number of passes required. Because this analysis is performed on the cascade of Einsums, our lower bounds on passes hold for all mapping choices, including application of fusion. For example, despite using fusion, FLAT employs a 3-pass cascade and its reliance on large on-chip buffering is a symptom of trying to avoid three passes-worth of DRAM traffic.

Additionally, we find that expressing attention as a cascade of Einsums reveals that optimizations that were previously conflated can actually be applied separately. We specifically call out one that is used by 1-pass algorithms to eliminate the need for a second pass after the final softmax denominator has been calculated. We recognize that this optimization has the added benefit of decreasing the required divisions, which is not only useful for but can be applied to 2- and 3-pass cascades as well.

Finally, in the last part of the techical core (Section V), we use the insights from Section IV as a starting point to develop a novel mapping for attention that can be lowered to a spatial architecture. We call our architecture FuseMax. FuseMax adopts the 1-pass attention cascade used in FlashAttention-2 [15]. However, despite using the cascade from FlashAttention-2, mapping this cascade to a spatial architecture is non-trivial. In particular, FlashAttention-2 maps the cascade onto a GPU, an architecture that features homogeneous PEs, each with relatively large per-PE storage, and expensive inter-PE communication. Spatial architectures feature opposite characteristics: heterogeneous PEs, each with smaller per-PE storage, and cheap (but restricted) inter-PE communication. Specifically, the networks that connect the PEs within the 2D array allow efficient communication primarily between neighbors. We overcome these differences and demonstrate a novel mapping for the 1-pass cascade that achieves high utilization for entire transformer layers. Our architecture requires only minimal changes to a standard spatial architecture and is performance/energy robust to long sequence lengths (e.g., 1M tokens and beyond).

To summarize, we make the following contributions:

•

We show how cascades of Einsums can be used to inform accelerator design, both in terms of reasoning about compute requirements and per-tensor live footprints. We formalize lower bounds on the number of passes a cascade imposes given any possible mapping of the cascade onto hardware.
•

We use cascades of Einsums, and the observation about pass lower bounds, to provide a taxonomy and precise specification of numerically stable attention algorithms in the literature. Orthogonally, we show how previously-entangled attention optimizations can be applied across attention algorithms.
•

We propose a novel mapping (dataflow) for attention for a spatial architecture—which we call FuseMax—that achieves high utilization for both 2D and 1D array PEs, and has memory traffic requirements that are independent of sequence length.
•

We evaluate FuseMax on BERT [18], TrXL [14], T5 [46], and XLM [29] and demonstrate a $6.7\times$ speedup on attention with $79\%$ of the energy and a $5.3\times$ speedup on the full end-to-end inference with $83\%$ of the energy relative to FLAT.

II Background

In this section, we describe the concepts and terminology used in the remainder of the paper.

II-A Tensors

This paper focuses on algebraic computations on tensors, where a tensor is a multidimensional array. A tensor’s rank refers to a specific dimension of the tensor, while the tensor’s shape is the set of valid coordinates for each of the tensor’s ranks. We use the notation $N$ -tensor to denote a tensor with $N$ ranks, where a 0-tensor is a scalar, a 1-tensor is a vector, a 2-tensor is a matrix, etc.

We adopt the format-agnostic fibertree abstraction of tensors, where a tensor is represented as a tree of fibers, as detailed in prior work [53, 47, 37, 24, 52, 42, 35, 50], using the specific version described in Nayak et al. [35, Section 2.1]. In this abstraction, a fiber consists of the set of coordinates for a given rank with common coordinates for all higher-level ranks. Each coordinate is coupled with a payload. The payload may contain a reference to a fiber in the next lower rank, or to a leaf data value.

II-B Traditional Einsums

An Einsum expression defines a computation on a set of tensor operands using an iteration space that specifies the set of points where the computations are performed [35, 38]. For example, we describe matrix-matrix multiplication (GEMM) computation with the following Einsum:

\displaystyle Z_{m,n}=A_{k,m}\times B_{k,n}

(1)

where $A$ and $B$ are input 2-tensors of shape $K\times M$ and $K\times N$ , respectively. $Z$ is a output 2-tensor with shape $M\times N$ . Throughout this paper, the shape of a rank is also the name of that rank (e.g., rank $K$ in $A$ has a shape of $K$ ).

The iteration space of this Einsum is $[0,K)\times[0,M)\times[0,N)$ . Execution of this Einsum must: (1) walk every $(k,m,n)$ point in the iteration space; and, at each point (2) project into the data space of all input tensors, (3) multiply the corresponding data values, and (4) place the result at the corresponding data point in $Z$ . If a value already exists at an $(m,n)$ point in $Z$ (due to computation at a previous $(k,m,n)$ point), reduce the two values together using addition. Note that the Einsum specifies what to compute; it does not indicate the order in which one walks the iteration space. These aspects are left to mapping [10, 40, 35].

II-C Extended Einsums

Traditional Einsums sufficiently express standard traditional algebra, including those supported in Basic Linear Algebra Subprograms (BLAS) [30, 20] and tensor network contractions [2]. However, they cannot handle more complex computations. The recently proposed Extended General Einsums notation (EDGE) [38], extends Einsums to handle graph algorithm computations. We find this abstraction useful for also expressing complex tensor algebra computations and use its notation throughout the paper. We now briefly summarize the portions of EDGE that we leverage.

II-C1 User-Defined Computations

EDGE separates computations into three “actions”: map ( $\bigwedge$ ), reduce ( $\bigvee$ ), and populate ( $=$ ) [38, Section 5]. Map specifies the pair-wise computation between the shared ranks of two tensors, reduce specifies the computation for the reduction step of an Einsum, and default populate ( $=$ ) places a computed value from the right-hand side (RHS) of the Einsum to its location on the left-hand side (LHS).

Each map and reduce action contains two operations: merge and compute. Compute defines the operation to apply between two data values, and can be any user-defined function. Merge specifies which regions of the iteration space to touch; execution will not need to access the data space corresponding to culled points. Together, merge and compute precisely define the computations in an Einsum. Common merge operations include intersection ( $\cap$ ), which touches points with non-zero values in both operands; and union ( $\cup$ ), which touches points where at least one of the operands is non-zero.

The full EDGE specification for GEMM is then:

\displaystyle Z_{m,n}=A_{k,m}\cdot B_{k,n}::\bigwedge_{k}\times(\cap)\bigvee_{% k}+(\cup),

(2)

where $\bigwedge_{k}$ specifies a map action between $A$ and $B$ on the $k$ rank and the intersection merge operator ( $\cap$ ) culls $k$ points where at least one operand is zero. The compute operator ( $\times$ ) multiplies the data values of coordinates surviving intersection. The reduce action ( $\bigvee_{k}$ ) on the $k$ rank gathers all non-empty points in the $k$ rank and reduces them using addition ( $+$ ).

In this work, we use three user-defined computations:

1.

Maximum ( $\max(\cup)$ ) takes the maximum of two values. Suppose we have the following expression: $Z_{m}=A_{m}\cdot B_{m}::\bigwedge_{m}\max(\cup)$ . The union merge operator ( $\cup$ ) filters out any $m$ coordinates where both operands contain $0$ (and places 0 in the output). The $\max$ compute operator then returns the maximum of the two operands.
2.

Divide ( $\div(\leftarrow)$ ) divides two data values. Given the following expression, $Z_{m}=A_{m}\cdot B_{m}::\bigwedge_{m}\div(\leftarrow)$ , the merge operator ( $\leftarrow$ ) only touches $m$ points where there is a non-zero value in the $B$ operand (see [38, Appendix]), and the compute operator divides the data value in $A$ with the data value in $B$ .
3.

Exponentiation: we follow the example in EDGE [38, Section 7.4]. The expression $Z_{m}=e^{A_{m}}$ , where $e$ is Euler’s number, applies the exponential function to every element in $A$ . The exponent can also be an Einsum expression: $Z_{m}=e^{A_{m}\cdot B_{m}}::\bigwedge_{m}\times(\cap)$ .

In addition to map and reduce, EDGE enables the expression of user-defined unary operations on tensors. For example, we can express the application of the non-linear, sigmoid function ( $\sigma$ ) on each element of a tensor $A$ as $Z_{m}=\sigma(A_{m})$ .

II-C2 Shorthand Notation

Throughout this paper, we take advantage of EDGE’s shorthand notation [38, Section 6] in the following ways:

•

We drop all reduce actions that consist of add and union in the compute and merge operator, respectively ( $\bigvee+(\cup)$ ). Thus, $Z_{m}=A_{k,m}::\bigvee_{k}+(\cup)$ becomes $Z_{m}=A_{k,m}$ .
•

We express all map actions using infix notation; that is, $A_{k,m}\cdot B_{k,n}::\bigwedge_{k}\times(\cap)$ becomes $A_{k,m}\times B_{k,n}$ .
•

When $\max$ is part of a map action ( $A_{m}\cdot B_{m}::\bigwedge_{m}\max(\cup)$ ), we replace it with the following shorthand: $\max(A_{m},B_{m})$
•

When $\div$ is part of a map action ( $A_{m}\cdot B_{m}::\bigwedge_{m}\div(\leftarrow)$ ), we replace it with the following: $A_{m}/B_{m}$

II-C3 Filtering Rank Expressions

EDGE also enables expressing Einsums that touch only a subset of the data space of their constituent tensors. For example, we may express prefix-sum of a tensor $A_{k}$ with the following Einsum:

\displaystyle S_{i+1}=A_{k:k\leq i}

For each coordinate $i$ , $S_{i+1}$ is built by reducing together the subset of $A$ whose coordinates are $\leq i$ . Note that this definition of prefix-sum computes the entire sum for a given $i$ without iteratively reusing the previous sum.

II-C4 Expressing Iterative Computations

EDGE expresses recursion and iteration through generative/iterative ranks. We use the term standard ranks to differentiate non-iterative ranks from iterative ranks. We can express the iterative prefix-sum as follows:

	$\displaystyle S_{i+1}$	$\displaystyle=S_{i}+A_{i}$		(3)
		$\displaystyle\diamond:i\equiv K$		(4)

Here, $S$ is the iterative tensor that changes on each iteration, with the iterative rank, $i$ , ranging from $0$ to $K$ . Equation 4 indicates the stopping condition for the iterative expression (when $i$ is equal to $K$ ).

II-C5 Cascades of Einsums

TeAAL [35] introduces the concept of cascades of Einsums, which expresses directed acyclic graphs (DAG) of Einsum expressions as a sequence of sub-Einsums. One can view the unrolled iterative expression in Equation 3 as a cascade:

	$\displaystyle S_{1}$	$\displaystyle=S_{0}+A_{0}$
	$\displaystyle S_{2}$	$\displaystyle=S_{1}+A_{1}$
	$\displaystyle...$
	$\displaystyle S_{K}$	$\displaystyle=S_{K-1}+A_{K}$

Finally, we use the EDGE Initialization label to specify computations that initialize tensors, which occur once. We use the EDGE Extended Einsum(s) label to specify the computation that occurs on each iteration of a cascade of Einsums [38] (see Einsum Cascade 5).

II-D Mapping

An Einsum specifies the computation, while a mapping indicates how computation occurs in space and time on an accelerator [10, 40]. Mapping specifications include aspects such as loop order, partitioning, and work scheduling (sequential vs. parallel operations) [35]. Throughout this paper, some mapping choices like partitioning are expressed directly in the cascade of Einsums (e.g., ranks $M1,M0$ result from partitioning the $M$ rank in Einsum Cascade 5).

To understand how mapping interacts with iterative ranks and Einsum cascades, we introduce the concept of an iteration space fibertree, or is-fibertree. The is-fibertree is a special tree where each fiber belongs to a rank in the iteration space of the Einsum.

II-E Tensor Algebra Accelerators

In recent years, the popularity of domain-specific tensor algebra accelerators has increased. A typical accelerator based on a spatial architecture consists of off-chip main memory, an on-chip shared global buffer, various scratchpads, and a 1D and/or 2D processing engine (PE) array where each PE contains compute units [57, 27, 26, 37, 10]. This design minimizes memory transfer latency while maximizing compute utilization [12, 8, 10, 26, 9]. Various tools enable the quick modeling and design space exploration of tensor algebra accelerators, including Timeloop [40] and Accelergy [51], GAMMA [56], and DOSA [23].

III Passes Performed by a Cascade of Einsums

Our first contribution is to demonstrate a novel analysis that can be applied using a cascade of Einsums. The key insight is that cascades of Einsums provide a precise description of the iteration space for each Einsum and the data space for each constituent tensor, enabling us to derive the algorithmic minimum live footprint for each tensor, with implications for the allowed fusion schedules and required buffer capacity/memory traffic. Because this analysis relies only on the cascade of Einsums, it holds for any choice of mapping.

III-A Calculating the Number of Passes

We will apply our analysis to attention in Section IV. To illustrate ideas, we first start with a simple pedagogical example, shown in Cascade 1.

Einsum Cascade 1: An example 2-pass cascade.

{mdframed}

	$\displaystyle Y$	$\displaystyle=A_{k}\times B_{k}$		(5)
	$\displaystyle Z$	$\displaystyle=Y\times A_{k}$		(6)

Equation 5 performs a dot product between $A_{k}$ and $B_{k}$ , and Equation 6 multiplies the first equation’s result $Y$ by $A_{k}$ again to produce $Z$ . If we want to minimize data traffic of $A_{k}$ , we need to choose a dataflow for each Einsum that keeps $A_{k}$ stationary and fuses the two Einsums together. In other words, the dataflow must finish using the first element of $A_{k}$ before moving onto the next. However, such a dataflow does not exist for this cascade. Any implementation must visit every element of $A_{k}$ to compute $Y$ before it can revisit any element of $A_{k}$ to compute $Z$ .

We define a pass that a cascade performs over a particular fiber of a particular rank and tensor to be a traversal of every element of that fiber. Each time an element must be revisited after visiting every other element of that fiber, there is an additional pass. For example, Cascade 1 performs two passes over the $K$ rank of $A_{k}$ .

Since an Einsum’s iteration space can also be represented as a fibertree (i.e., an is-fibertree – see Section II-D), we extend our definition of an iteration space for a cascade of Einsums by considering its iteration space to be the sequence of the is-fibertrees for each Einsum. Now, in a scenario where fibers for a particular rank exist in multiple is-fibertrees; in each, they project to the same tensor; and there is a dependency such that all of the elements of the earlier is-fibertree’s fiber must be read before any element can be read again by the later is-fibertree (for all mappings of the cascade), we refer to that read-read sequence as creating an additional pass. When there is a sequence of $N$ such read-read dependencies, we say the cascade is an $(N+1)$ -pass cascade. For our example, Cascade 1 requires two passes of the $K$ rank.

III-B Implications of the Number of Passes

The number of passes a cascade performs is relevant because it restricts possible fusion schedules. Einsums within a pass can be fused at will, producing and consuming a tile of the intermediate at a time. Einsums in different passes cannot be fused. Revisiting Cascade 1, Equations 5 and 6 cannot be fused on the $K$ rank. Any implementation must visit all elements of the $K$ fiber of $A$ to produce $Y$ before it can visit any of the elements of that fiber to produce $Z$ .

This analysis also provides a non-trivial lower bound on the tensors’ live footprints. For example, the algorithmic minimum live footprint for tensor $A$ is $K$ . In other words, an architecture must either have enough buffer space to hold an entire $K$ fiber of $A$ or spill and reload that fiber, incurring memory traffic proportional to the shape of $K$ . We note that this analysis is mapping independent. There is no dataflow for this cascade that enables a smaller live footprint.

III-C Reducing the Number of Passes via Reassociation

Given the restrictions that multi-pass cascades place on the allowed dataflows and tensor live footprints, it can be beneficial to manipulate the cascade to reduce the number of passes required. Crucially, these manipulations are functionally equivalent and only change how $Z$ is computed. In this section, we will present two methods for doing so, though we leave a full analysis of the space of pass-reduction approaches to future work.

III-C1 Deferring the Multiplication by $Y$

First, we recognize that, by the distributive property, Equation 6 can be factored to perform the reduction of $A_{k}$ first, before multiplying the result by $Y$ . Doing so, we get the following cascade:

Einsum Cascade 2: A reassociation of Cascade 1 that defers the

Y\times

to compute

Z

with 1-pass of the

K

rank.

{mdframed}

$\displaystyle Y$	$\displaystyle=A_{k}\times B_{k}$	(7)
$\displaystyle X$	$\displaystyle=A_{k}$	(8)
$\displaystyle Z$	$\displaystyle=Y\times X$	(9)

Now, because there is no read-after-write dependency between Equations 7 and 8, both Einsums can be included in the same pass. In fact, because Equation 8 reduces away the $K$ rank, Cascade 2 is a 1-pass cascade with respect to this rank. This reassociation actually provides a second benefit over Cascade 1: Equation 9 now only requires one multiplication (as opposed to $K$ multiplications in Equation 6).

III-C2 Iteratively Constructing $Y$ and $Z$

Einsum Cascade 3: A reassociation of Cascade 1 that iteratively constructs

Y

and

Z

with 1-pass of the

K

rank

{mdframed}

Initialization:

	$\displaystyle RY_{i:i=0}=0$		(10)
	$\displaystyle RZ_{i:i=0}=0$		(11)

Extended Einsums:

$\displaystyle RY_{i+1}$	$\displaystyle=RY_{i}+A_{i}\times B_{i}$	(12)
$\displaystyle RZ_{i+1}$	$\displaystyle=RZ_{i}\times\frac{RY_{i+1}}{RY_{i}}+RY_{i+1}\times A_{i}$	(13)
$\displaystyle Z$	$\displaystyle=RZ_{K}$	(14)
	$\displaystyle\diamond:i\equiv K$	(15)

Alternatively, we can iteratively construct $Y$ and $Z$ as we perform the pass through $A_{k}$ . To do so, we will take a similar approach to the prefix-sum (see Sections II-C3-II-C4) and build intermediate $Y$ s and $Zs$ .

	$\displaystyle RY_{i+1}$	$\displaystyle=A_{k:k\leq i}\times B_{k:k\leq i}$		(16)
	$\displaystyle RZ_{i+1}$	$\displaystyle=RY_{i+1}\times A_{k:k\leq i}$		(17)

Just like with the prefix-sum, this version requires a lot of extra compute, but, because $Y=RY_{K}$ and therefore $Z=RZ_{K}$ , the final result is the same.

We remove this extra work by making the $I$ ranks of $RY_{i+1}$ and $RZ_{i+1}$ iterative. This is shown in Cascade 3. Iterative $RY_{i+1}$ (Equation 12) looks very similar to the iterative prefix-sum. However, computing $RZ_{i+1}$ is a little more complicated. We start by introducing one more intermediate $S_{i}$ , which is the prefix-sum for $A_{k}$ :

\displaystyle S_{i}=A_{k:k\leq i-1}

(18)

Now, we can combine Equations 17 and 18 to write $RZ_{i}$ in terms of this prefix-sum:

\displaystyle RZ_{i}=RY_{i}\times S_{i}

(19)

Dividing both sides by $RY_{i}$ , we derive an alternate definition for $S_{i}$ :

\displaystyle S_{i}=\frac{RZ_{i}}{RY_{i}}

$S_{i+1}$ can also be written using this alternative definition:

\displaystyle S_{i+1}=\frac{RZ_{i}}{RY_{i}}+A_{i}

(20)

We can combine Equations 19 and 20 to compute $RZ_{i+1}$ in terms of $RZ_{i}$ (i.e., iteratively):

\displaystyle RZ_{i+1}=RY_{i+1}\times\left(\frac{RZ_{i}}{RY_{i}}+A_{i}\right)

Distributing $RY_{i+1}$ and performing some reassociation, we get Equation 13.

Cascade 3 is also a 1-pass cascade, performing one pass of the $K$ rank of $A_{k}$ (indexed with the variable $i$ ) and iteratively building $RY_{i+1}$ and $RZ_{i+1}$ . Unfortunately, unlike Cascade 2, Cascade 3 does require extra compute over the original Cascade 1. However, memory bandwidth-limited workloads can afford to trade off extra compute for reduced memory traffic, and Cascade 3 may still provide benefit.

IV Taxonomizing Attention as Einsum Cascades

Our second contribution is to apply the cascade of Einsums abstraction and the notion of passes to transformer models to describe, taxonomize, and highlight trade-offs in the space of attention implementations. This section first looks at the transformer model as a whole, identifying attention as an important kernel (Section IV-A). We then give an overview of attention and a “straightforward” (but inefficient) algorithm for softmax by writing them as cascades of Einsums (Sections IV-B-IV-C). Finally, we describe how optimizations to softmax can be described by modifying the cascades and provide a taxonomy of the space using the number of passes required by each cascade (Sections IV-D-IV-E).

IV-A Transformers

Refer to caption — (a) Encoder architecture

Transformer models generally follow the architecture defined in [48]. In this work, which addresses the impact of long sequence lengths during self-attention, we focus on the encoder architecture. Figure 1(a) gives an overview. The transformer first projects the input (by multiplying it by weight tensors) to form a query, key, and value. Self-attention is made up of three operations: a matrix multiplication of the query and key, a softmax on the result, and another matrix multiplication, which combines the softmax output with the value. The attention output is then deprojected (again, multiplying by a weight tensor), normalized, passed through a two-layer feed-forward neural network (FFN), and normalized once more.

As the sequence length grows, the relative importance of the different operations changes. Figure 1(b) shows that at shorter sequence lengths, the weight-times-activation “linear” layers are a larger fraction of the total required compute, while at long sequence lengths, the attention dominates. In all cases, the additional non-linearities (e.g., the normalization, the ReLU between the FFN layers, etc.) have negligible impact. In the next section, we focus on describing attention more precisely, and use our analysis to understand prior work on efficient implementations.

IV-B Redefining Attention’s “Matrix Multiplications”

In the original transformer paper [48], the kernel was described with the following equation:

\displaystyle Attention(Q,K,V)=softmax(\frac{QK^{T}}{\sqrt{d_{k}}})V

(21)

However, this equation says almost nothing about what the inputs $Q$ , $K$ , and $V$ look like or what iteration space needs to be traversed. We clarify these points by rewriting Equation 21 as a cascade of Einsums, with the exception of the softmax, whose cascade we will explore in Section IV-C:

$\displaystyle QK_{m,p}$	$\displaystyle=\frac{1}{\sqrt{E}}\times Q_{e,p}\times K_{e,m}$	(22)
$\displaystyle A_{m,p}$	$\displaystyle=softmax(QK_{m,p})$	(23)
$\displaystyle AV_{f,p}$	$\displaystyle=A_{m,p}\times V_{f,m}$	(24)

Here, Equations 22¹¹1In Equation 22, we also substitute $E$ for $d_{k}$ following the notation defined in Section II-B, where the shape of a rank is also its rank name. and 24 look like matrix multiplications. Taking Equation 24 as an example, for each point in the iteration space $F\times M\times P$ , we perform a multiplication using elements from two 2-tensors ( $A_{m,p}$ and $V_{f,m}$ ) to produce a 2-tensor output ( $AV_{f,p}$ ), which requires reducing across the inputs’ shared rank $M$ .

Equations 22-24 can be modified to refer to the full batched, multi-head self attention [48] by adding $B$ and $H$ ranks to all tensors. This changes the characteristics of the kernel. Adding the $B$ and $H$ ranks means that Equations 22 and 24 behave like many independent matrix multiplications instead of one monolithic matrix multiplication. The challenges with attention, described in Section I, follow clearly from this modification. Because all tensors contain a $B$ rank, the matrix multiplications are all unique to the specific batch’s inputs. Therefore, none of these tensors can be computed before the inputs are given, and there is no data sharing between the different elements in the batch. To simplify notation, we assume the presence of the $B$ and $H$ ranks but omit writing them throughout the rest of paper.

IV-C Softmax as a Cascade of Einsums

We now apply the same precise notation to the softmax. A softmax [6] over a 1-tensor is traditionally expressed with the following equation:

\displaystyle A_{m}=\frac{e^{I_{m}}}{\sum_{k}e^{I_{k}}}

(25)

In the context of attention, this operation becomes two dimensional and can be expressed using the following cascade with input $QK_{m,p}$ :

$\displaystyle SN_{m,p}$	$\displaystyle=e^{QK_{m,p}}$	(26)
$\displaystyle SD_{p}$	$\displaystyle=SN_{m,p}$	(27)
$\displaystyle A_{m,p}$	$\displaystyle=SN_{m,p}/SD_{p}$	(28)

For each point in the iteration space ( $m$ , $p$ ), we exponentiate $QK_{m,p}$ to generate the softmax numerator ( $SN_{m,p}$ in Equation 26), reduce $SN_{m,p}$ with addition to produce the softmax denominator ( $SD_{p}$ in Equation 27), and finally, divide the numerator and denominator to produce the final result ( $A_{m,p}$ in Equation 28).

IV-C1 Improving Numerical Stability

Because $e^{QK_{m,p}}$ can easily become extremely large, the above formulation suffers from overflow. Therefore, practical implementations [3, 41] often prefer the numerically stable variant that replaces Equation 26 with:

	$\displaystyle GM_{p}$	$\displaystyle=QK_{m,p}::\bigvee_{m}\text{max}(\cup)$		(29)
	$\displaystyle SN_{m,p}$	$\displaystyle=e^{QK_{m,p}-GM_{p}}$		(30)

and drop the $\frac{1}{\sqrt{E}}$ term when computing $QK_{m,p}$ ²²2The $\frac{1}{\sqrt{E}}$ term was introduced to bound the magnitude of $SN_{m,p}$ [48]. Because the numerically stable softmax variant already accomplishes this, the scaling is often omitted [16, 15, 13].. To compute the global maximum³³3“Global” here refers to over the $M$ fiber. $GM_{p}$ , we reduce $QK_{m,p}$ with the operator max (instead of $+$ ). Notice that subtracting $GM_{p}$ from $QK_{m,p}$ in the exponent is equivalent to dividing by $e^{GM_{p}}$ , and because the $\frac{1}{e^{GM_{p}}}$ term appears in both the numerator ( $SN_{m,p}$ via Equation 30) and denominator ( $SD_{p}$ via Equation 27), the result ( $A_{m,p}$ ) stays the same. This construction improves numerical stability by bounding the values of the softmax numerator $SN_{m,p}$ to the range $(0,1]$ .

IV-D Optimizing Softmax Compute

We now describe an optimization to attention that reduces compute requirements, specifically division. This optimization was used in FlashAttention-2 [15]. We point out that it can be applied more broadly, i.e., to any cascade we discuss in Section IV-E. Equation 28 requires $M\times P$ divisions. While this is the best we can do for an independent softmax, we note that attention does not use the softmax in isolation [48]. Instead, it subsequently multiplies the result, $A_{m,p}$ , and another tensor, $V_{f,m}$ , per Equation 24, reproduced here:

\displaystyle AV_{f,p}=A_{m,p}\times V_{f,m}

To optimize the full attention cascade, we can refactor Equations 28 and 24 by, instead, first combining $SN_{m,p}$ and $V_{f,m}$ (Equation 31) and reducing across the $M$ rank and then performing the division (Equation 32), as follows:

	$\displaystyle SNV_{f,p}$	$\displaystyle=SN_{m,p}\times V_{f,m}$		(31)
	$\displaystyle AV_{f,p}$	$\displaystyle=SNV_{f,p}/SD_{p}$		(32)

This reassociation does $F\times P$ divisions instead of $M\times P$ divisions. Since $M$ is the sequence length and $F$ is an embedding dimension (i.e., $M\gg F$ ), this reassociation reduces the required divisions (by a factor of $\frac{M}{F}$ ).

IV-E Optimizing Softmax Live Footprint and Memory Traffic

3-pass	2-pass	1-pass
PyTorch [41]	Tileflow [57]	FlashAttention [16]
TensorFlow [3]	Choi et al. [13]	FlashAttention-2 [15]
FLAT [27]
E.T. [7]

TABLE I: Classifying prior attention algorithms.

We can also apply the analysis described in Section III to the efficient attention literature. We find that existing approaches to attention can be classified as either 3-pass, 2-pass, or 1-pass cascades, where an $N$ -pass cascade performs $N$ passes of a given $M$ fiber. See Table I. Next, we describe the key ideas of each.

IV-E1 3-Pass Attention Cascades

The 3-pass cascade is the straightforward, numerically stable cascade that we already discussed in Section IV-C1, namely Equations 29-30 followed by Equations 27-28, reproduced in Cascade 4 for clarity.

Einsum Cascade 4: The 3-pass attention cascade.

{mdframed}

$\displaystyle GM_{p}$	$\displaystyle=QK_{m,p}::\bigvee_{m}\text{max}(\cup)$	/* Pass 1 */	(33)
$\displaystyle SN_{m,p}$	$\displaystyle=e^{QK_{m,p}-GM_{p}}$	/* Pass 2 */	(34)
$\displaystyle SD_{p}$	$\displaystyle=SN_{m,p}$		(35)
$\displaystyle A_{m,p}$	$\displaystyle=SN_{m,p}/SD_{p}$	/* Pass 3 */	(36)

In Pass 1, we compute $GM_{p}$ ; in Pass 2, we compute $SN_{m,p}$ and $SD_{p}$ ; and in Pass 3, we compute $A_{m,p}$ . Notice that we must finish an entire $M$ fiber of Equation 33 (reading an entire $M$ fiber of $QK_{m,p}$ ) before $GM_{p}$ is ready to start Equation 34 (where we must read the same $M$ fiber of $QK_{m,p}$ again). Similarly, we must finish an entire $M$ fiber of Equation 35 (reading an entire $M$ fiber of $SN_{m,p}$ ) before $SD_{p}$ is ready to start Equation 36 (where we must read the same $M$ fiber of $SN_{m,p}$ again). Regardless of the mapping (including fusion), this cascade must perform three passes, since they are a consequence of the dependencies between Einsums.

IV-E2 2-Pass Attention Cascades

We now briefly summarize the 2-pass cascade, deferring details due to space. Rather than computing the global max and then starting the softmax (as in the 3-pass cascade), the 2-pass cascade first partitions the input, computes a per-partition local max and applies it to form a variant of $SN_{m,p}$ whose elements are adjusted by the local max and likewise partitioned. Analogously, each partition gets a local denominator (also adjusted by the same local max). While this is occurring, it builds the global max from the local max values. Next, in a second pass, it uses the global max to correct the per-partition numerators and denominators and compute the softmax output.

IV-E3 1-Pass Attention Cascades

While prior work proposes multiple different 1-pass cascades [16, 15] that take advantage of the reassociations presented in Section III-C. However, the main ideas are the same. First, modify the cascade to multiply the softmax numerator-times- $V$ and then compute the division (as described in Section IV-D). This reassociation combines the second and third passes of Cascade 4 (see Section III-C1). To ensure numerical stability, we cannot use this strategy to combine the first and second passes, so we instead use the iterative approach (see Section III-C2). Rather than using the per-partition local max to compute the local numerator and denominator, instead keep a running max that represents the max value seen so far. Each time a new running max is computed, adjust previous results (e.g., numerator-times- $V$ , denominator, etc.) with this max.

Next we describe FlashAttention-2’s 1-pass cascade (Cascade 5) because we use it to build FuseMax. Note, despite the evidently increased compute relative to the 3-pass cascade, we will carefully design a mapping in Section V to hide these overheads on a spatial architecture.

Einsum Cascade 5: A 1-pass attention cascade. Note that

M1

is used as a standard rank (e.g., to access

BQK_{m1,m0,p}

) and as an iterative rank (e.g., to access

RM_{m1,p}

). Therefore, the stopping condition for all iterative ranks is

m1=M1+1

(Equation 53).

{mdframed}

Initialization:

$\displaystyle BQK_{m1,m0,p}$	$\displaystyle=QK_{m1\times M0+m0,p}$	(37)
$\displaystyle BV_{f,m1,m0}$	$\displaystyle=V_{f,m1\times M0+m0}$	(38)
$\displaystyle RM_{m1:m1=0,p}$	$\displaystyle=-\infty$	(39)
$\displaystyle RD_{m1:m1=0,p}$	$\displaystyle=0$	(40)
$\displaystyle RNV_{m1:m1=0,p}$	$\displaystyle=0$	(41)

Extended Einsums:

$\displaystyle LM_{m1,p}$	$\displaystyle=BQK_{m1,m0,p}::\bigvee_{m0}\text{max}(\cup)$	(42)
$\displaystyle RM_{m1+1,p}$	$\displaystyle=max(RM_{m1,p},LM_{m1,p})$	(43)
$\displaystyle SLN_{m1,m0,p}$	$\displaystyle=e^{BQK_{m1,m0,p}-RM_{m1+1,p}}$	(44)
$\displaystyle SLD_{m1,p}$	$\displaystyle=SLN_{m1,m0,p}$	(45)
$\displaystyle SLNV_{f,m1,p}$	$\displaystyle=SLN_{m1,m0,p}\times BV_{f,m1,m0}$	(46)
$\displaystyle PRM_{m1,p}$	$\displaystyle=e^{RM_{m1,p}-RM_{m1+1,p}}$	(47)
$\displaystyle SPD_{m1,p}$	$\displaystyle=RD_{m1,p}\times PRM_{m1,p}$	(48)
$\displaystyle RD_{m1+1,p}$	$\displaystyle=SLD_{m1,p}+SPD_{m1,p}$	(49)
$\displaystyle SPNV_{f,m1,p}$	$\displaystyle=RNV_{f,m1,p}\times PRM_{m1,p}$	(50)
$\displaystyle RNV_{f,m1+1,p}$	$\displaystyle=SLNV_{f,m1,p}+SPNV_{f,m1,p}$	(51)
$\displaystyle AV_{f,p}$	$\displaystyle=RNV_{f,M1,p}/RD_{M1,p}$	(52)
	$\displaystyle\diamond:m1\equiv M1+1$	(53)

We will start by expressing the partitioning of both of the inputs $QK_{m,p}$ and $V_{f,m}$ into M1 chunks of M0 elements each (Equations 37-38). This allows us to perform operations like maximum on individual $M0$ fibers, rather than on the whole tensor (Equation 42). The problem is, of course, that the local maximum is not necessarily the same for all $M0$ fibers and so will not just cancel nicely like the global maximum.

We resolve this by instead using the running maximum ( $RM_{m1,p}$ )—the global maximum of all inputs seen so far—instead of the local maximum. We recognize that $M1$ can also serve as an iterative rank, and iteratively build up $RM_{m1,p}$ . After initializing $RM_{0,p}$ to $-\infty$ (Equation 39), we compute a new running maximum $RM_{m1+1,p}$ using the running maximum computed in the previous iteration $RM_{m1,p}$ and the new local maximum $LM_{m1,p}$ (Equation 43).

We can now use the running maximum to compute a local numerator $SLN_{m1,m0,p}$ (Equation 44), a local denominator $SLD_{m1,p}$ (Equation 45), and even the dot product result $SLNV_{f,m1,p}$ (Equation 46) using the partitioned $BV_{f,m1,m0}$ (Equation 38).

Now consider the softmax denominator. Eventually, we would like to reduce $SLD_{m1,p}$ into a 0-tensor, but because its values may have been computed with different maximums, we cannot simply use addition. Instead, by introducing a new running denominator $RD_{m1,p}$ with iterative rank $M1$ , we can correct the old denominator $RD_{m1,p}$ to the new running maximum $RM_{m1+1,p}$ and then perform the addition. we must initialize the running denominator at the start of the computation to 0 (Equation 40). Then, at each point $m1$ , the correction factor $PRM_{m1,p}$ allows us to correct the previous running denominator $RD_{m1,p}$ with the new maximum (Equation 48). In other words, $RD_{m1,p}$ is downscaled by $e^{RM_{m1,p}}$ . $SPD_{m1,p}$ “switches” the downscaling factor on $RD_{m1,p}$ to $e^{RM_{m1+1,p}}$ by multiplying $RD_{m1,p}$ by $\frac{e^{RM_{m1,p}}}{e^{RM_{m1+1,p}}}$ ( $PRM_{m1,p}$ ). Once $SLD_{m1,p}$ and $SPD_{m1,p}$ have the same maximum, they can be combined to produce the new running denominator $RD_{m1+1,p}$ (Equation 49). We can do the same to compute the running numerator-times- $V$ (Equations 41, 50-51).

Finally, $AV_{f,p}$ can be computed by dividing the final numerator-times- $V$ by the final denominator. By construction, at this point, $RNV_{f,M1,p}$ and $RD_{M1,p}$ are both downscaled by the same maximum $RM_{M1,p}$ (conveniently, also the global maximum) and can be correctly combined.

V Mapping Attention Onto A Spatial Array

Based on the framework from Section IV, we now describe FuseMax, an efficient mapping of an attention algorithm (specifically the 1-pass cascade in Cascade 5) to a spatial array-style architecture.

The goal when mapping a cascade onto hardware is to fully utilize all available compute units. In our evaluation of prior work (Figure 6 and Section VI-B), we observe that at short sequence lengths, the 2D PE array is under-utilized because it must wait for the 1D PE array to compute the softmax. At longer sequence lengths, both arrays are under-utilized since the workload becomes memory-bandwidth limited.

FuseMax’s mapping addresses these issues to achieve full utilization on both the 1D and 2D PE arrays. First, we decrease the compute performed by the 1D array by (1) applying the division reduction optimization (Section IV-D) and (2) sharing the other operations (sum/max/exp) between the 1D and 2D arrays. Similarly, we ensure that the workload is never memory-bandwidth limited by deeply fusing all Einsums in the cascade to restrict the live footprint to only what can be buffered on-chip. No matter the sequence length, our dataflow is never forced to spill any of its intermediates off-chip.

Architecture. We assume a standard spatial array-style architecture for our mapping. See Figure 2. We set parameters to match the cloud configuration in prior work [27].

Figure 3 shows the evolution of the 2D PE array architectre, from a fixed-dataflow multiply-accumulate TPU PE (Figure 3(a)) to a flexible-dataflow multiply-accumulate PE (Figure 3(b)) to a FuseMax PE (Figure 3(c)). Note, although both the 1D and 2D PE arrays in FuseMax perform exponentiation, we implement exponentiation with 6 sequential multiply-accumulate operations [36, 49] and therefore do not require a dedicated exponentiation unit.

Fusion and Partitioning. Prior attention accelerators [27, 57] explore fusing many of attention’s loop nests together. However, because these accelerators all use multi-pass cascades, the algorithmic minimum live footprint of some tensors (e.g., $QK_{m,p}$ ) is $O(M)$ , meaning that for long sequence lengths, intermediates cannot be buffered on chip.

FuseMax leverages fusion in conjunction with the 1-pass cascade to eliminate the memory traffic of these tensors, regardless of the sequence length. Specifically, we partition on both $M$ and $P$ (forming $M1,M0$ and $P2,P1,P0$ ), and maximally fuse all levels in the attention loopnest as shown in Mapping 1. That is, all Einsums in Cascade 5 are fused except for the last (which is fused to the rest only on $P2$ ).

Mapping 1: The FuseMax mapping as a loopnest. We partition on both

M

and

P

and map the innermost ranks

M0

and

P0

to the spatial array PEs. ComputeRNVTile forms a tile of

QK_{m,p}

(i.e.,

BQK_{m1,:,p2,p1,:}

) and then performs Equations 37-51 from Cascade 5. ComputeAVTile performs Equation 52. Note that each equation (Einsum) represents a loopnest: by writing all equations in ComputeRNVTile under a single loopnest, we mean that we are maximally fusing those loopnests. Outer loops over

B

and

H

(if performing batched multihead attention) are not shown.

{mdframed}

for p2 ...:
  for m1 ...:
    for p1 ...:
      parallel_for p0 ...:
        parallel_for m0 ...:
          (RNV[:, m1 + 1, p2, p1, p0],
           RD[m1 + 1, p2, p1, p0]) =
              ComputeRNVTile(
                Q[:, p2, p1, p0],
                K[:, m1, m0], V[:, m1, m0])
  for p1 ...:
    parallel_for p0 ...:
      AV[:, p2, p1, p0] =
        ComputeAVTile(
          RNV[:, m1 + 1, p2, p1, p0],
          RD[m1 + 1, p2, p1, p0])

Parallelization and Spatial Reduction. While prior work implementing attention in hardware [27, 57] does utilize the 2D spatial array for the tensor products, it fails to do so for the softmax, choosing instead to use the 1D array. However, because there are far fewer total PEs in the 1D array than the 2D array, the softmax becomes a bottleneck. FuseMax improves utilization of the 2D spatial array by using it for both the tensor products and the exponentiation operator in the softmax. FuseMax parallelizes across the $M0$ and $P0$ ranks throughout the attention kernel (see Mapping 1). We set $M0\times P0=\#\;\mathrm{2D\;Array}\;\mathrm{PEs}$ . The large spatial reductions required when parallelizing across the $M0$ rank are easily handled by the low-cost inter-PE communication network.

Pipelining. The dependencies between different Einsums in our cascade necessitate fine-grain pipeline parallelism to achieve high utilization of both the 1D and 2D spatial arrays. Figure 4 shows the waterfall diagram for FuseMax in the steady state. Time is broken into epochs. Each epoch performs the same set of tile-granular operations at specific tile-relative coordinates (given by $a,b,c,d$ in the figure). Across all epochs, the kernel evaluates all tiles and each Einsum in Cascade 5 is mapped to either the 2D or 1D array for all epochs (as shown in the figure).

A major design consideration when pipelining the mapping is how to overcome the latency of fills and drains to/from the spatial array. Consider a tile of $QK_{m,p}$ of shape $M0\times P0$ . Per Equation 22, the iteration space to evaluate this tile is $E\times M0\times P0$ which becomes $E$ cycles on the spatial array. For the networks we evaluate, $E=64$ or $128$ . Assume $E=64$ . This means, assuming an output stationary dataflow, that while each PE performs 64 MACCs, it takes $\sim 256$ cycles to both fill and drain the spatial array. Without careful interleaving, this combination of parameters causes low utilization because, for example, the running max $RM_{m1+1,p1,:}$ cannot be computed until a tile of $QK_{m1,:,p1,:}$ is completed and spatially reduced (drained) to form the local max $LM_{m1,p1,:}$ (Equations 42-43).

We address the above issues with two levels of interleaving. First, we interleave the construction of dependent tiles across epochs. This is reminiscent of software pipelining. For example, in Figure 4 the $d$ -th tile of $BQK$ and $LM$ are completed in Epoch $i$ (as they correspond to a fill followed by a drain and can be easily pipelined). The $RM$ (which has to wait for the drain) for tile $d$ takes place in a later epoch. Instead, Epoch $i$ computes an earlier tile’s running maximum $RM[c]$ .

Second, we interleave the construction of certain tiles within an epoch at a fine (e.g., cycle-by-cycle) granularity. See the notation ‘ $A|B$ ’ in Figure 4. This is to ensure high utilization of both the 2D and 1D PE arrays at all times. To make this more clear, Figure 5 shows the start up and steady-state interleaving of $SLNV$ and $BQK$ in the 2D array and $SPNV$ and $RNV$ in the 1D array. In each cycle, a given PE in the 2D array computes a value for either $BQK$ or $SLNV$ and this alternates cycle by cycle. Each neighbor-neighbor link in the array is active in every cycle—carrying data for one of the two operation types. By interleaving $SLNV$ with $BQK$ , the 1D PEs can concurrently compute $SPNV$ and $RNV$ .

Putting everything together, as Section VI-B will show, the above enables high utilization of all 2D and 1D array PEs.

VI Evaluation

In this section, we demonstrate how the FuseMax dataflow achieves improvements in both performance and energy relative to the state of the art, for both attention and the end-to-end transformer inference.

VI-A Experimental Set-Up

First, we present the experimental set-up details common to all following subsections.

Workloads. We evaluate all accelerators and configurations using the same transformer models used by FLAT [27]: BERT-Base [18] (BERT), TrXL-wt103 [14] (TrXL), T5-small [46] (T5), and XLM [29]. We omit FlauBERT [31] because it uses the same hyperparameters as TrXL. We also note that though T5 is an encoder-decoder model, we only evaluate the encoder in this work. Following FLAT, we use a batch size $B=64$ for all evaluations.

Modeling with Timeloop and Accelergy. We perform our evaluation using two tools for tensor algebra accelerator modeling and design space exploration: Timeloop [40] and Accelergy [51]. We use these tools to build models of the accelerator architectures at a 45nm technology node and evaluate each Einsum individually. Results from individual Einsums are combined using heuristics presented in prior work for evaluating full cascades [35]. Together, these tools allow us to evaluate execution time, energy, and area for all our designs. We perform floating-point division using the design in Xia et al. [54], scaled down to a 45nm technology node [51].

Unfused Baseline. We build the unfused baseline by combining the costs of three phases: $QK$ (Equation 22), the 3-pass softmax (Cascade 4), and $AV$ (Equation 24). Because this baseline is unfused, each phase can be scheduled independently, but proceed sequentially and require outputs to be written to memory between phases. We use Timeloop to search for efficient mappings to perform $QK$ and $AV$ . Additionally, we model the softmax for the unfused baseline by allowing the accelerator to load the $M$ fibers of the input on-chip one-by-one (spilling if there is not enough space) before performing the compute. We model the memory traffic, compute, and energy required to perform all Einsums required for attention.

FLAT Baseline. Our main baseline is the state-of-the-art attention accelerator FLAT [27]. Though we started with the FLAT authors’ original code, we found and corrected a number of bugs. Through private correspondence with the FLAT authors, we verified the bugs were indeed bugs. We also discovered a couple of larger conceptual errors, which the authors told us to avoid by restricting FLAT to only search through configurations without these issues.

Beyond correcting the FLAT codebase, we created and validated a Timeloop model that reproduces the FLAT authors’ (corrected) code to within $<1\%$ error. However, the FLAT codebase does not model the cost to perform the softmax. Specifically, their model ignores the cost of data transfers (between any levels of the memory hierarchy) and uses $2^{30}$ 1D PEs. When comparing FuseMax and FLAT in this work, we augment our Timeloop model to model softmax correctly per the 3-pass cascade implicitly assumed by FLAT.

Hardware parameters. Figure 2 shows the selected hardware parameters. We chose the PE array dimension to match FLAT’s cloud accelerator and the global buffer capacity by normalizing the area. Also following FLAT, we use a 940 MHz frequency. We use Accelergy to model the area of both designs and find that FuseMax is 17% smaller.

VI-B Evaluating Attention

We now evaluate FuseMax to demonstrate the benefits it provides on the attention kernel by comparing it to the two baselines.

Utilization. Figure 6(a) shows the utilization of the 1D PE array when performing attention. We see that, because fused dataflows (FLAT / FuseMax) do not have to wait for the whole $QK$ Einsum to complete to begin the softmax, they achieve high utilization. While FLAT’s utilization drops for sequence lengths $\geq 256\text{K}$ —it becomes memory bandwidth limited because it must spill the $QK$ and $A$ tensors to memory—FuseMax achieves full utilization for all sequence lengths.

Similarly, Figure 6(b) shows the utilization of the 2D PE array. Because of the large amount of compute required for the softmax, both baselines achieve very poor utilization of this array. On the other hand, at long sequence lengths, FuseMax achieves almost 100% utilization. We observe that both baselines do achieve slightly higher utilization on XLM, which can be attributed to the higher intensity caused by a larger embedding dimension ( $E$ / $F$ ).

Speedup. Figure 7 shows that FuseMax achieves an average speedup of $10\times$ over the unfused baseline and $6.7\times$ over FLAT. We note FuseMax achieves lower speedup on XLM only because the baselines are able to achieve higher utilization of the 2D array on this transformer (Figure 6(b)).

Energy. Figure 8 shows that FuseMax uses $77\%$ the energy of the unfused baseline and $79\%$ the energy of FLAT.⁴⁴4FLAT reports larger energy savings over the unfused baseline because it only reports energy associated with DRAM traffic during the tensor products. The energy use of the unfused baseline and FLAT are dominated by the DRAM access energy, the global buffer access energy, and the $QK$ and $AV$ (Equations 22 and 24) compute energy. FuseMax achieves its energy savings by significantly reducing the DRAM access energy.

VI-C Evaluating Transformer Inference

To evaluate the benefits of FuseMax on end-to-end transformer inference, we include the other required linear layers (Section IV-A). We use Timeloop to search for optimal mappings for these linear layers and use the same mappings for all three accelerator configurations. The attention modeling remains the same as Section VI-B.

Speedup. Figure 9 shows the performance improvement achieved by FuseMax. Across the sequence lengths tested, FuseMax achieves an average speedup of $7.6\times$ over the unfused baseline and $5.3\times$ over FLAT. As discussed in Section IV-A, as sequence length grows, attention becomes a larger fraction of the total required compute. Therefore, at 1M tokens, FuseMax achieves an average $10\times$ speedup over the unfused baseline and $7.5\times$ speedup over FLAT.

Energy. Figure 10 shows the energy reduction achieved by FuseMax. Here, we see similar results: as attention becomes a larger fraction of the kernel, the energy reduction increases. FuseMax uses $82\%$ of the unfused baseline and $83\%$ of FLAT’s energy during end-to-end inference.

VII Related Work

Spatial architectures have been applied successfully to a variety of domains in academia [10, 11, 43, 39] and industry [26, 4]. Beyond FLAT [27] (discussed in the main body of the paper), TileFlow [57] is a framework for modeling and searching for efficient fused dataflows (including for attention) on spatial architectures. Though TileFlow does explore a broader space of dataflows than FLAT, even implementing the 2-pass softmax cascade (Section IV-E2), its dataflows remain softmax-compute limited.

Quantization and sparsity have also been successfully applied to reduce the transformer inference compute and live footprint. We view these schemes as complementary to our work. GPTQ [21], AWQ [32], and LLM.int8() [17] quantize model weights to 4 or 8 bits without significant accuracy degradation. Outlier-aware quantization schemes like GOBO [55] and OliVe [22] quantize both weights and activations to a low-bit precision on specific hardware designs. SpAtten [49] prunes entire tokens and heads, while Sanger [34] and DOTA [44] use quantized or low-rank projected $Q$ and $K$ tensors to estimate which values of $QK$ and $A$ can be safely pruned. All of these algorithms are expressible as cascades of Einsums, and therefore, may be combined with FuseMax to improve performance and energy efficiency, though we leave their specification and implementation to future work.

VIII Conclusion

This paper advanced the state of the art in spatial accelerator design for transformer inference. To do so, we expressed attention and its variants as cascades of Einsums. We used these cascades to reason about attention’s characteristics, independent of its mapping/scheduling. Using these principles, we proposed FuseMax—an accelerator that uses deep fusion and fine-grain pipelining to map attention onto a spatial architecture. FuseMax achieves $\sim 100\%$ utilization of both PE arrays, demonstrating $6.7\times$ speedup over the prior state-of-the-art (FLAT) using $79\%$ of the energy on attention and $5.3\times$ speedup over FLAT using $83\%$ of the energy on end-to-end inference.

Our work shows that cascades of Einsums provide a powerful abstraction for representing and analyzing domain-specific kernels. Future work may explore their application to other attention variants (e.g., those exploiting quantization and sparsity) or even other domains (e.g., fully homomorphic encryption, scientific computing, relational algebra, etc.). Doing so enables mapping-agnostic analysis and may elucidate previously undiscovered cascades and schedules for these algorithms.

References

[1] “Our next-generation model: Gemini 1.5,” https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/#context-window.
[2] “Tensor network contractions,” ser. Lecture Notes in Physics, vol. 964. Springer Cham, 2020.
[3] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/
[4] AWS. (2024) Trainium architecture. [Accessed April 16, 2024]. [Online]. Available: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trainium.html
[5] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: a framework for self-supervised learning of speech representations,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS ’20. Red Hook, NY, USA: Curran Associates Inc., 2020.
[6] J. S. Bridle, “Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition,” in NATO Neurocomputing, 1989. [Online]. Available: https://api.semanticscholar.org/CorpusID:59636530
[7] S. Chen, S. Huang, S. Pandey, B. Li, G. R. Gao, L. Zheng, C. Ding, and H. Liu, “E.t.: Re-thinking self-attention for transformer models on gpus,” in SC21: International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–14.
[8] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” ACM Sigplan Notices, vol. 49, no. 4, pp. 269–284, 2014.
[9] Y. Chen, Y. Xie, L. Song, F. Chen, and T. Tang, “A survey of accelerator architectures for deep neural networks,” Engineering, vol. 6, no. 3, pp. 264–274, 2020.
[10] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks,” in ISCA’16.
[11] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss v2: A flexible and high-performance accelerator for emerging deep neural networks,” 2018.
[12] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun et al., “Dadiannao: A machine-learning supercomputer,” in MICRO’14.
[13] J. Choi, H. Li, B. Kim, S. Hwang, and J. H. Ahn, “Accelerating transformer networks through recomposing softmax layers,” in 2022 IEEE International Symposium on Workload Characterization (IISWC), 2022, pp. 92–103.
[14] A. CONNEAU and G. Lample, “Cross-lingual language model pretraining,” in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32. Curran Associates, Inc., 2019. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf
[15] T. Dao, “Flashattention-2: Faster attention with better parallelism and work partitioning,” 2023.
[16] T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” 2022.
[17] T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, “Llm.int8(): 8-bit matrix multiplication for transformers at scale,” ArXiv, vol. abs/2208.07339, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:251564521
[18] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in North American Chapter of the Association for Computational Linguistics, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:52967399
[19] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. [Online]. Available: https://openreview.net/forum?id=YicbFdNTTy
[20] I. S. Duff, M. A. Heroux, and R. Pozo, “An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum,” ACM Trans. Math. Softw., vol. 28, no. 2, pp. 239–267, 2002. [Online]. Available: https://doi.org/10.1145/567806.567810
[21] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: Accurate post-training compression for generative pretrained transformers,” arXiv preprint arXiv:2210.17323, 2022.
[22] C. Guo, J. Tang, W. Hu, J. Leng, C. Zhang, F. Yang, Y. Liu, M. Guo, and Y. Zhu, “Olive: Accelerating large language models via hardware-friendly outlier-victim pair quantization,” in Proceedings of the 50th Annual International Symposium on Computer Architecture, ser. ISCA ’23. ACM, Jun. 2023. [Online]. Available: http://dx.doi.org/10.1145/3579371.3589038
[23] C. Hong, Q. Huang, G. Dinh, M. Subedar, and Y. S. Shao, “DOSA: Differentiable model-based one-loop search for DNN accelerators,” in 56th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’23. IEEE, Oct. 2023, pp. 209–224.
[24] O. Hsu, M. Strange, R. Sharma, J. Won, K. Olukotun, J. S. Emer, M. A. Horowitz, and F. Kjølstad, “The sparse abstract machine,” in ASPLOS’23, 2023.
[25] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 29, p. 3451–3460, oct 2021. [Online]. Available: https://doi.org/10.1109/TASLP.2021.3122291
[26] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-datacenter performance analysis of a tensor processing unit,” in ISCA ’17.
[27] S.-C. Kao, S. Subramanian, G. Agrawal, A. Yazdanbakhsh, and T. Krishna, “Flat: An optimized dataflow for mitigating attention bottlenecks,” ser. ASPLOS 2023. New York, NY, USA: Association for Computing Machinery, 2023, p. 295–310. [Online]. Available: https://doi.org/10.1145/3575693.3575747
[28] H. Kwon, P. Chatarasi, M. Pellauer, A. Parashar, V. Sarkar, and T. Krishna, “Understanding reuse, performance, and hardware cost of DNN dataflow: A data-centric approach,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO. ACM, 2019, pp. 754–768.
[29] G. Lample and A. Conneau, “Cross-lingual language model pretraining,” ArXiv, vol. abs/1901.07291, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:58981712
[30] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh, “Basic linear algebra subprograms for fortran usage,” ACM Trans. Math. Softw., vol. 5, no. 3, pp. 308–323, 1979. [Online]. Available: https://doi.org/10.1145/355841.355847
[31] H. Le, L. Vial, J. Frej, V. Segonne, M. Coavoux, B. Lecouteux, A. Allauzen, B. Crabbé, L. Besacier, and D. Schwab, “Flaubert: Unsupervised language model pre-training for french,” CoRR, vol. abs/1912.05372, 2019. [Online]. Available: http://arxiv.org/abs/1912.05372
[32] J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quantization for llm compression and acceleration,” in MLSys, 2024.
[33] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9992–10 002.
[34] L. Lu, Y. Jin, H. Bi, Z. Luo, P. Li, T. Wang, and Y. Liang, “Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture,” MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:239012114
[35] N. Nayak, T. O. Odemuyiwa, S. Ugare, C. Fletcher, M. Pellauer, and J. Emer, “Teaal: A declarative framework for modeling sparse tensor accelerators,” in Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’23. New York, NY, USA: Association for Computing Machinery, 2023, p. 1255–1270. [Online]. Available: https://doi.org/10.1145/3613424.3623791
[36] P. Nilsson, A. U. R. Shaik, R. Gangarajaiah, and E. Hertz, “Hardware implementation of the exponential function using taylor series,” in 2014 NORCHIP. IEEE, oct 2014, pp. 1–4. [Online]. Available: https://doi.org/10.1109/NORCHIP.2014.7004740
[37] T. O. Odemuyiwa, H. Asghari-Moghaddam, M. Pellauer, K. Hegde, P.-A. Tsai, N. Crago, A. Jaleel, J. D. Owens, E. Solomonik, J. Emer, and C. Fletcher, “Accelerating sparse data orchestration via dynamic reflexive tiling,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’23, vol. 3, Mar. 2023, pp. 18–32.
[38] T. O. Odemuyiwa, J. S. Emer, and J. D. Owens, “The EDGE language: Extended general einsums for graph algorithms,” CoRR, vol. abs/2404.11591, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2404.11591
[39] A. Parashar, M. Pellauer, M. Adler, B. Ahsan, N. Crago, D. Lustig, V. Pavlov, A. Zhai, M. Gambhir, A. Jaleel, R. Allmon, R. Rayess, S. Maresh, and J. Emer, “Efficient spatial processing element control via triggered instructions,” IEEE Micro, vol. 34, no. 3, pp. 120–137, 2014.
[40] A. Parashar, P. Raina, Y. S. Shao, Y.-H. Chen, V. A. Ying, A. Mukkara, R. Venkatesan, B. Khailany, S. W. Keckler, and J. Emer, “Timeloop: A systematic approach to dnn accelerator evaluation,” in 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019, pp. 304–315.
[41] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, PyTorch: an imperative style, high-performance deep learning library. Red Hook, NY, USA: Curran Associates Inc., 2019.
[42] M. Pellauer, J. Clemons, V. Balaji, N. C. Crago, A. Jaleel, D. Lee, M. O’Connor, A. Parashar, S. Treichler, P. Tsai, S. W. Keckler, and J. S. Emer, “Symphony: Orchestrating sparse and dense tensors with hierarchical heterogeneous processing,” ACM Transactions on Computing Systems, vol. 41, pp. 4:1–4:30, 2023. [Online]. Available: https://doi.org/10.1145/3630007
[43] R. Prabhakar, Y. Zhang, D. Koeplinger, M. Feldman, T. Zhao, S. Hadjis, A. Pedram, C. Kozyrakis, and K. Olukotun, “Plasticine: A reconfigurable architecture for parallel paterns,” SIGARCH Comput. Archit. News, vol. 45, no. 2, pp. 389–402, Jun. 2017. [Online]. Available: http://doi.acm.org/10.1145/3140659.3080256
[44] Z. Qu, L. Liu, F. Tu, Z. Chen, Y. Ding, and Y. Xie, “Dota: detect and omit weak attentions for scalable transformer acceleration,” in Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 14–26. [Online]. Available: https://doi.org/10.1145/3503222.3507738
[45] A. Radford and K. Narasimhan, “Improving language understanding by generative pre-training,” 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:49313245
[46] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” vol. 21, no. 1, jan 2020.
[47] V. Sze, Y. Chen, T. Yang, and J. S. Emer, Efficient Processing of Deep Neural Networks, ser. Synthesis Lectures on Computer Architecture. Springer, 2020.
[48] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017.
[49] H. Wang, Z. Zhang, and S. Han, “Spatten: Efficient sparse attention architecture with cascade token and head pruning,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, Feb. 2021. [Online]. Available: http://dx.doi.org/10.1109/HPCA51647.2021.00018
[50] J. Won, C. Hong, C. Mendis, J. Emer, and S. Amarasinghe, “Unified convolution framework: A compiler-based approach to support sparse convolutions,” in MLSys’23, 2023.
[51] Y. N. Wu, J. S. Emer, and V. Sze, “Accelergy: An architecture-level energy estimation methodology for accelerator designs,” in ICCAD’19, 2019.
[52] Y. N. Wu, P. Tsai, S. Muralidharan, A. Parashar, V. Sze, and J. S. Emer, “HighLight: Efficient and flexible DNN acceleration with hierarchical structured sparsity,” in IEEE/ACM International Symposium on Microarchitecture, ser. MICRO. ACM, Oct. 2023, pp. 1106–1120. [Online]. Available: https://doi.org/10.1145/3613424.3623786
[53] Y. N. Wu, P.-A. Tsai, A. Parashar, V. Sze, and J. S. Emer, “Sparseloop: An analytical approach to sparse tensor accelerator modeling,” in 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, Oct. 2022, pp. 1377–1395. [Online]. Available: https://doi.org/10.1109/MICRO56248.2022.00096
[54] J. Xia, W. Fu, M. Liu, and M. Wang, “Low-latency bit-accurate architecture for configurable precision floating-point division,” Applied Sciences, vol. 11, no. 11, 2021. [Online]. Available: https://www.mdpi.com/2076-3417/11/11/4988
[55] A. H. Zadeh, I. Edo, O. M. Awad, and A. Moshovos, “Gobo: Quantizing attention-based nlp models for low latency and energy efficient inference,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, Oct. 2020. [Online]. Available: http://dx.doi.org/10.1109/MICRO50266.2020.00071
[56] G. Zhang, N. Attaluri, J. S. Emer, and D. Sanchez, “Gamma: Leveraging gustavson’s algorithm to accelerate sparse matrix multiplication,” in ASPLOS’21.
[57] S. Zheng, S. Chen, S. Gao, L. Jia, G. Sun, R. Wang, and Y. Liang, “Tileflow: A framework for modeling fusion dataflow via tree-based analysis,” in Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’23. New York, NY, USA: Association for Computing Machinery, 2023, p. 1271–1288. [Online]. Available: https://doi.org/10.1145/3613424.3623792

FuseMax: Leveraging Extended Einsums to Optimize Attention Accelerator Design

Abstract

I Introduction

II Background

II-A Tensors

II-B Traditional Einsums

II-C Extended Einsums

II-C1 User-Defined Computations

II-C2 Shorthand Notation

II-C3 Filtering Rank Expressions

II-C4 Expressing Iterative Computations

II-C5 Cascades of Einsums

II-D Mapping

II-E Tensor Algebra Accelerators

III Passes Performed by a Cascade of Einsums

III-A Calculating the Number of Passes

III-B Implications of the Number of Passes

III-C Reducing the Number of Passes via Reassociation

III-C1 Deferring the Multiplication by Y𝑌Yitalic_Y

III-C2 Iteratively Constructing Y𝑌Yitalic_Y and Z𝑍Zitalic_Z

IV Taxonomizing Attention as Einsum Cascades

IV-A Transformers

IV-B Redefining Attention’s “Matrix Multiplications”

IV-C Softmax as a Cascade of Einsums

IV-C1 Improving Numerical Stability

IV-D Optimizing Softmax Compute

IV-E Optimizing Softmax Live Footprint and Memory Traffic

IV-E1 3-Pass Attention Cascades

IV-E2 2-Pass Attention Cascades

IV-E3 1-Pass Attention Cascades

V Mapping Attention Onto A Spatial Array

VI Evaluation

VI-A Experimental Set-Up

VI-B Evaluating Attention

VI-C Evaluating Transformer Inference

VII Related Work

VIII Conclusion

References

III-C1 Deferring the Multiplication by $Y$

III-C2 Iteratively Constructing $Y$ and $Z$