Local Search For Fast Matrix Multiplication
Local Search For Fast Matrix Multiplication
3
Institute for Formal Models and Verification, J. Kepler University Linz, Austria
1 Introduction
Strassen observed 50 years ago that C can also be computed with only 7 multi-
plications [15]. His scheme proceeds in two steps. In the first step he introduces
auxiliary variables M1 , . . . , M7 which are defined as the product of certain linear
combinations of the entries of A and B. In the second step the entries of C are
⋆
Supported by NSF grant CCF-1813993 and AFRL Award FA8750-15-2-0096.
⋆⋆
Supported by the Austrian FWF grants P31571-N32 and F5004.
⋆⋆⋆
Supported by the Austrian FWF grant NFN S11408-N23 and the LIT AI Lab funded
by the State of Upper Austria.
obtained as certain linear combinations of the Mi :
M1 = (a11 + a22 )(b11 + b22 ) c11 = M1 + M4 − M5 + M7
M2 = (a21 + a22 )(b11 ) c12 = M3 + M5
M3 = (a11 )(b12 − b22 ) c21 = M2 + M4
M4 = (a22 )(b21 − b11 ) c22 = M1 − M2 + M3 + M6
M5 = (a11 + a12 )(b22 )
M6 = (a21 − a11 )(b11 + b12 )
M7 = (a12 − a22 )(b21 + b22 )
Recursive application of this scheme gave rise to the first algorithm for multi-
plying arbitrary n × n matrices in subcubic complexity. Winograd [16] showed
that Strassen’s scheme is optimal in the sense that there does not exist a sim-
ilar scheme with fewer than 7 multiplications, and de Groote [7] showed that
Strassen’s scheme is essentially unique.
Less is known about 3 × 3 matrices. The naive scheme requires 27 multi-
plications, and in 1976 Laderman [10] found one with 23. Similar as Strassen,
he defines M1 , . . . , M23 as products of certain linear combinations of the entries
of A and B. The entries of C = AB are then obtained as linear combinations
of M1 , . . . , M23 . It is not known whether 23 is optimal (the best lower bound
is 19 [2]). It is known however that Laderman’s scheme is not unique. A small
number of intrinsically different schemes have been found over the years. Of par-
ticular interest are schemes in which all coefficients in the linear combinations
are +1, −1, or 0. The only four such schemes (up to equivalence) we are aware
of are due to Laderman, Smirnov [14], Oh et al. [12], and Courtois et al. [6].
While Smirnov and Oh et al. found their multiplicatoin schemes with computer-
based search using non-linear numerical optimization methods, Courtois found
his multiplication scheme using a SAT solver. This is also what we do here. We
present two approaches which allowed us to generate more than 13,000 mutu-
ally inequivalent new matrix multiplication schemes for 3 × 3 matrices, using
altogether about 35 years of CPU years. We believe that the new schemes are
of interest to the matrix multiplication community. We therefore make them
publicly available in various formats and grouped by invariants at
http://www.algebra.uni-linz.ac.at/research/matrix-multiplication/.
Comparing the coefficients of all terms ai1 i2 bj1 j2 ck1 k2 in the equations cij =
P
k aik bkj leads to the polynomial equations
23
(ℓ) (ℓ) (ℓ)
X
αi1 i2 βj1 j2 γk1 k2 = δi2 j1 δi1 k1 δj2 k2
ℓ=1
for i1 , i2 , j1 , j2 , k1 , k2 ∈ {1, 2, 3}. These 729 cubic equations with 621 variables
are also known as Brent equations [4]. The δuv on the right are Kronecker-deltas,
i.e., δuv = 1 if u = v and δuv = 0 otherwise. Each solution of the system of these
equations corresponds to a matrix multiplication scheme. The equations become
slightly more symmetric if we flip the indices of the γij , and since this is the
variant mostly used in the literature, we will also adopt it from now on.
(ℓ) (ℓ) (ℓ)
Another view on the Brent equations is as follows. View the αi1 i2 , βj1 j2 , γk1 k2
as variables, as before, and regard ai1 i2 , bj1 j2 , ck1 k2 as polynomial indeterminants.
Then the task consists of instantiating the variables in such a way that
23 3 X
3 X
3
(ℓ) (ℓ) (ℓ)
X X
(α11 a11 + · · · )(β11 b11 + · · · )(γ11 c11 + · · · ) = aij bjk cki
ℓ=1 i=1 j=1 k=1
holds as equation of polynomials in the variables ai1 i2 , bj1 j2 , ck1 k2 . Expanding the
left hand side and equating coefficients leads to the Brent equations as stated
before (but with indices of γ flipped, as agreed). In other words, expanding
the left hand side, all terms have to cancel except for the terms on the right.
We found it convenient to say that a term ai1 i2 bj1 j2 ck1 k2 has “type m” if m =
δi2 j1 + δj2 k1 + δk2 i1 . With this terminology, all terms of types 0, 1, 2 have to
cancel each other, and all terms of type 3 have to survive. Note that since all 27
type 3 terms must be produced by the 23 summands on the left, some summands
must produce more than one type 3 term.
For solving the Brent equations with a SAT solver, we use Z2 as coefficient
domain, so that multiplication translates into ‘and’ and addition translates into
(ℓ)
‘xor’. When, for example, the variable αi1 i2 is true in a solution of the corre-
sponding SAT instance, this indicates that the term ai1 i2 occurs in Mℓ , and
(ℓ)
likewise for the b-variables. If γk1 k2 is true, this means that Mℓ appears in the
(ℓ) (ℓ) (ℓ)
linear combination for ck1 k2 . We call αi1 i2 , βj1 j2 , and γk1 k2 the base variables.
In order to bring the Brent equations into CNF, we use Tseitin transforma-
tion, i.e., we introduce definitions for subformulas to avoid exponential blow-up.
To keep the number of fresh variables low, we do not introduce one new variable
for every cube but only for pairs of literals, i.e., we encode a cube (α ∧ β ∧ γ) as
u ↔ (α ∧ β) and v ↔ (u ∧ γ). In this way, we can reuse u. Furthermore, a sum
v1 ⊕ · · · ⊕ vm with m ≥ 4 is encoded by w ↔ (v1 ⊕ v2 ⊕ v3 ) and v4 ⊕ · · · ⊕ vm ⊕ w,
with the latter sum being encoded recursively. This encoding seems to require the
smallest sum of the number of variables and the number of clauses—a commonly
used optimality heuristic. The used scripts are available at
https://github.com/marijnheule/matrix-challenges/tree/master/src.
The generation of new schemes proceeds in several steps, with SAT solving being
the first and main step. If the SAT solver finds a solution, we next check whether
it is equivalent to any known or previously found solution modulo de Groote’s
symmetry group [7]. If so, we discard it. Otherwise, we next try to simplify the
new scheme by searching for an element in its orbit which has a smaller number
of terms. The scheme can then be used to initiate a new search. In the fourth
step, we use Gröbner bases to lift the scheme from the coefficient domain Z2 to Z.
Finally, we cluster large sets of similar solutions into parameterized families.
In the present paper, we give a detailed description of the first step in this
workflow. The subsequent steps use algebraic techniques unrelated to SAT and
will be described in [9].
3 Solving Methods
The core of a scheme is the pairing of the type 3 terms. Our first method focuses
on finding schemes with new cores, while our second method searches for schemes
that are similar to an existing one and generally has the same core. For all exper-
iments we used the local search SAT solver yalsat [1] as this solver performed
best on instances from this application. We also tried solving these instances
using CDCL solvers, but the performance was disappointing. We observed a
possible explanation: The runtime of CDCL solvers tends to be exponential in
the average backtrack level (ABL) on unsatisfiable instances. For most formu-
las arising from other applications, ABL is small (< 50), while on the matrix
multiplication instances ABL is large (> 100).
Two of the known schemes, those of Smirnov [14] and Courtois et al. [6], have
the property that each type 3 term occurs exactly once and at most two type 3
terms occur in the same summand. We decided to use this pattern to search for
new schemes: randomly pair four type 3 terms and assign the remaining type 3
terms to the other 19 summands. Only in very rare cases, random pairing could
be extended to a valid scheme in reasonable time, say a few minutes. In the other
cases it is not known whether the pairing cannot be extended to a valid scheme
or whether finding such a scheme is very hard.
Since the number of random pairings that could be extended to a valid scheme
was very low, we tried adding streamlining constraints [8] to formulas. A stream-
lining constraint is a set of clauses that guides the solver to a solution, but these
clauses may not (and generally are not) implied by the formula. Streamlining
constraints are usually patterns observed in solutions of a given problem, poten-
tially of smaller sizes. We experimented with various streamlining constraints,
such as enforcing that each type 0, type 1, and type 2 term occurs either zero
times or twice in a scheme (instead of an even number of times). The most effec-
tive streamlining constraint that we came up with was observed in the Smirnov
scheme: for each summand that is assigned a single type 3 term, enforce that (i)
one matrix has either two rows, two columns or a row and a column fully assigned
to zero and (ii) another matrix has two rows and two columns assigned to zero,
i.e., the matrix has a single nonzero entry. This streamlining constraint reduced
the runtime from minutes to seconds. Yet some random pairings may only be
extended to a valid scheme that does not satisfy the streamlining constraint.
Fig. 1. Two neighboring schemes with 19 identical summands and 4 different ones.
In contrast, some of the schemes found using the first method have a large
neighborhood. We approximated the size of the neighborhood of a scheme using
the following experiment: Start with a given scheme S and find a neighboring
scheme by randomly fixing 2/3 of the base variables. Once a neighboring scheme
S ′ is found, find a neighboring scheme of S ′ , etc. We ran this experiment on a
machine with 48 cores of the Lonestar 5 cluster of Texas Advanced Computing
Center. We started each experiment using 48 threads with each thread assigned
a different seed. Figure 2 shows the number of different schemes (after sorting)
found in 1000 seconds when starting with one of the four known schemes and a
scheme that was computed from the streamlining method. Some of these different
schemes are new, while others are equivalent to each other or known ones. We
only assure here that they are not identical after sorting the summands.
Observe that the number of different schemes found in 1000 seconds depends
a lot on the starting scheme. No different neighboring scheme was found for
Streamlining
10000 Smirnov
Oh et al.
Courtois et al.
Laderman
1000
100
10
1
0 100 200 300 400 500 600 700 800 900 1000
Fig. 2. The number of different schemes (vertical axis in logscale) found within a period
of time (horizontal axis in seconds) during a random walk in the neighborhood of a
given scheme.
Laderman’s scheme, only 9 different schemes were found for the scheme of Cour-
tois et al., 94 different schemes were found for the scheme of Oh et al., 561 new
schemes were found for Smirnov’s scheme, and 3359 schemes were found using
a randomly selected new scheme obtained with the streamlining method.
In view of the large number of solutions we found, it is also interesting to
compare them with each other. For example, if we define the support of a solution
as the number of base variables set to 1, we observe that the support seems to
follow a normal distribution with mean around 160, see Fig. 3 for a histogram.
We can also see that the Laderman scheme differs in many ways from all the
other solutions. It is, for example, the only scheme whose core consists of four
quadruples of type 3 terms. In 89% of the solutions, the core consists of four pairs
of type 3 terms, about 10% of the solution have three pairs and one quadrupel,
and less than 1% of the schemes have cores of the form 2-2-2-2-3 or 2-2-2-3-4.
5 Challenges
The many thousands of new schemes that we found may still be just the tip
of the iceberg. However, we also observed that the state-of-the-art SAT solving
techniques are unable to answer several other questions. This section provides
four challenges for SAT solvers with increasing difficulty. For each challenge we
constructed one or more formulas that are available at
https://github.com/marijnheule/matrix-challenges.
The challenges are hard, but they may be doable in the coming years.
count
800
600
400
200
support
0
140 160 180
Challenge 1: Local search without streamlining. Our first method combines ran-
domly pairing the type 3 terms with streamlining constraints. The latter was
required to limit the search. We expect that local search solvers can be op-
timized to efficiently solve the formulas without the streamlining constraints.
This may result in schemes that are significantly different compared to ones we
found. We prepared ten satisfiable formulas with hardcoded pairings of type 3
terms. Five of these formulas can be solved using yalsat in a few minutes. All
of these formulas appear hard for CDCL solvers (and many local search solvers).