Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
6 views

Local Search For Fast Matrix Multiplication

Uploaded by

BBTiger Michael
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Local Search For Fast Matrix Multiplication

Uploaded by

BBTiger Michael
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Local Search for Fast Matrix Multiplication

Marijn J.H. Heule1⋆ , Manuel Kauers2⋆⋆ , and Martina Seidl3⋆ ⋆ ⋆


1
Department of Computer Science, The University of Texas at Austin, United States
2
Institute for Algebra, J. Kepler University Linz, Austria
arXiv:1903.11391v2 [cs.LO] 19 Aug 2019

3
Institute for Formal Models and Verification, J. Kepler University Linz, Austria

Abstract. Laderman discovered a scheme for computing the product


of two 3 × 3 matrices using only 23 multiplications in 1976. Since then,
some more such schemes were proposed, but nobody knows how many
such schemes there are and whether there exist schemes with fewer than
23 multiplications. In this paper we present two independent SAT-based
methods for finding new schemes using 23 multiplications. Both methods
allow computing a few hundred new schemes individually, and many
thousands when combined. Local search SAT solvers outperform CDCL
solvers consistently in this application.

1 Introduction

Matrix multiplication is a fundamental operation with applications in nearly


any area of science and engineering. However, after more than 50 years of work
on matrix multiplication techniques (see, e.g., [5,13,3,11]), the complexity of
matrix multiplication is still a mystery. Even for small matrices, the problem is
not completely understood, and understanding these cases better can provide
valuable hints towards more efficient algorithms for large matrices.
The naive way for computing the product C of two 2 × 2 matrices A, B
requires 8 multiplications:
      
a11 a12 b11 b12 a11 b11 + a12 b21 a11 b12 + a21 b22 c11 c12
= =
a21 a22 b21 b22 a21 b11 + a22 b21 a21 b12 + a22 b22 c21 c22

Strassen observed 50 years ago that C can also be computed with only 7 multi-
plications [15]. His scheme proceeds in two steps. In the first step he introduces
auxiliary variables M1 , . . . , M7 which are defined as the product of certain linear
combinations of the entries of A and B. In the second step the entries of C are


Supported by NSF grant CCF-1813993 and AFRL Award FA8750-15-2-0096.
⋆⋆
Supported by the Austrian FWF grants P31571-N32 and F5004.
⋆⋆⋆
Supported by the Austrian FWF grant NFN S11408-N23 and the LIT AI Lab funded
by the State of Upper Austria.
obtained as certain linear combinations of the Mi :
M1 = (a11 + a22 )(b11 + b22 ) c11 = M1 + M4 − M5 + M7
M2 = (a21 + a22 )(b11 ) c12 = M3 + M5
M3 = (a11 )(b12 − b22 ) c21 = M2 + M4
M4 = (a22 )(b21 − b11 ) c22 = M1 − M2 + M3 + M6
M5 = (a11 + a12 )(b22 )
M6 = (a21 − a11 )(b11 + b12 )
M7 = (a12 − a22 )(b21 + b22 )
Recursive application of this scheme gave rise to the first algorithm for multi-
plying arbitrary n × n matrices in subcubic complexity. Winograd [16] showed
that Strassen’s scheme is optimal in the sense that there does not exist a sim-
ilar scheme with fewer than 7 multiplications, and de Groote [7] showed that
Strassen’s scheme is essentially unique.
Less is known about 3 × 3 matrices. The naive scheme requires 27 multi-
plications, and in 1976 Laderman [10] found one with 23. Similar as Strassen,
he defines M1 , . . . , M23 as products of certain linear combinations of the entries
of A and B. The entries of C = AB are then obtained as linear combinations
of M1 , . . . , M23 . It is not known whether 23 is optimal (the best lower bound
is 19 [2]). It is known however that Laderman’s scheme is not unique. A small
number of intrinsically different schemes have been found over the years. Of par-
ticular interest are schemes in which all coefficients in the linear combinations
are +1, −1, or 0. The only four such schemes (up to equivalence) we are aware
of are due to Laderman, Smirnov [14], Oh et al. [12], and Courtois et al. [6].
While Smirnov and Oh et al. found their multiplicatoin schemes with computer-
based search using non-linear numerical optimization methods, Courtois found
his multiplication scheme using a SAT solver. This is also what we do here. We
present two approaches which allowed us to generate more than 13,000 mutu-
ally inequivalent new matrix multiplication schemes for 3 × 3 matrices, using
altogether about 35 years of CPU years. We believe that the new schemes are
of interest to the matrix multiplication community. We therefore make them
publicly available in various formats and grouped by invariants at
http://www.algebra.uni-linz.ac.at/research/matrix-multiplication/.

2 Encoding and Workflow


To search for multiplication schemes of 3 × 3 matrices having the above form,
we define the Mi as product of linear combination of all entries of A and B with
(ℓ) (ℓ)
undetermined coefficients αij , βij :
(1) (1) (1) (1)
M1 = (α11 a11 + · · · + α33 a33 )(β11 b11 + · · · + β33 b33 )
..
.
(23) (23) (23) (23)
M23 = (α11 a11 + · · · + α33 a33 )(β11 b11 + · · · + β33 b33 )
Similarly, we define the cij as linear combinations of the Mi with undetermined
(ℓ)
coefficients γi,j :
(1) (23) (1) (23)
c11 = γ11 M1 + · · · + γ11 M23 , . . . , c33 = γ33 M1 + · · · + γ33 M23

Comparing the coefficients of all terms ai1 i2 bj1 j2 ck1 k2 in the equations cij =
P
k aik bkj leads to the polynomial equations

23
(ℓ) (ℓ) (ℓ)
X
αi1 i2 βj1 j2 γk1 k2 = δi2 j1 δi1 k1 δj2 k2
ℓ=1

for i1 , i2 , j1 , j2 , k1 , k2 ∈ {1, 2, 3}. These 729 cubic equations with 621 variables
are also known as Brent equations [4]. The δuv on the right are Kronecker-deltas,
i.e., δuv = 1 if u = v and δuv = 0 otherwise. Each solution of the system of these
equations corresponds to a matrix multiplication scheme. The equations become
slightly more symmetric if we flip the indices of the γij , and since this is the
variant mostly used in the literature, we will also adopt it from now on.
(ℓ) (ℓ) (ℓ)
Another view on the Brent equations is as follows. View the αi1 i2 , βj1 j2 , γk1 k2
as variables, as before, and regard ai1 i2 , bj1 j2 , ck1 k2 as polynomial indeterminants.
Then the task consists of instantiating the variables in such a way that
23 3 X
3 X
3
(ℓ) (ℓ) (ℓ)
X X
(α11 a11 + · · · )(β11 b11 + · · · )(γ11 c11 + · · · ) = aij bjk cki
ℓ=1 i=1 j=1 k=1

holds as equation of polynomials in the variables ai1 i2 , bj1 j2 , ck1 k2 . Expanding the
left hand side and equating coefficients leads to the Brent equations as stated
before (but with indices of γ flipped, as agreed). In other words, expanding
the left hand side, all terms have to cancel except for the terms on the right.
We found it convenient to say that a term ai1 i2 bj1 j2 ck1 k2 has “type m” if m =
δi2 j1 + δj2 k1 + δk2 i1 . With this terminology, all terms of types 0, 1, 2 have to
cancel each other, and all terms of type 3 have to survive. Note that since all 27
type 3 terms must be produced by the 23 summands on the left, some summands
must produce more than one type 3 term.
For solving the Brent equations with a SAT solver, we use Z2 as coefficient
domain, so that multiplication translates into ‘and’ and addition translates into
(ℓ)
‘xor’. When, for example, the variable αi1 i2 is true in a solution of the corre-
sponding SAT instance, this indicates that the term ai1 i2 occurs in Mℓ , and
(ℓ)
likewise for the b-variables. If γk1 k2 is true, this means that Mℓ appears in the
(ℓ) (ℓ) (ℓ)
linear combination for ck1 k2 . We call αi1 i2 , βj1 j2 , and γk1 k2 the base variables.
In order to bring the Brent equations into CNF, we use Tseitin transforma-
tion, i.e., we introduce definitions for subformulas to avoid exponential blow-up.
To keep the number of fresh variables low, we do not introduce one new variable
for every cube but only for pairs of literals, i.e., we encode a cube (α ∧ β ∧ γ) as
u ↔ (α ∧ β) and v ↔ (u ∧ γ). In this way, we can reuse u. Furthermore, a sum
v1 ⊕ · · · ⊕ vm with m ≥ 4 is encoded by w ↔ (v1 ⊕ v2 ⊕ v3 ) and v4 ⊕ · · · ⊕ vm ⊕ w,
with the latter sum being encoded recursively. This encoding seems to require the
smallest sum of the number of variables and the number of clauses—a commonly
used optimality heuristic. The used scripts are available at

https://github.com/marijnheule/matrix-challenges/tree/master/src.

The generation of new schemes proceeds in several steps, with SAT solving being
the first and main step. If the SAT solver finds a solution, we next check whether
it is equivalent to any known or previously found solution modulo de Groote’s
symmetry group [7]. If so, we discard it. Otherwise, we next try to simplify the
new scheme by searching for an element in its orbit which has a smaller number
of terms. The scheme can then be used to initiate a new search. In the fourth
step, we use Gröbner bases to lift the scheme from the coefficient domain Z2 to Z.
Finally, we cluster large sets of similar solutions into parameterized families.

known • solve filter simplify lift cluster


• • • • • new
schemes schemes

In the present paper, we give a detailed description of the first step in this
workflow. The subsequent steps use algebraic techniques unrelated to SAT and
will be described in [9].

3 Solving Methods

The core of a scheme is the pairing of the type 3 terms. Our first method focuses
on finding schemes with new cores, while our second method searches for schemes
that are similar to an existing one and generally has the same core. For all exper-
iments we used the local search SAT solver yalsat [1] as this solver performed
best on instances from this application. We also tried solving these instances
using CDCL solvers, but the performance was disappointing. We observed a
possible explanation: The runtime of CDCL solvers tends to be exponential in
the average backtrack level (ABL) on unsatisfiable instances. For most formu-
las arising from other applications, ABL is small (< 50), while on the matrix
multiplication instances ABL is large (> 100).

3.1 Random Pairings of Type 3 Terms and Streamlining

Two of the known schemes, those of Smirnov [14] and Courtois et al. [6], have
the property that each type 3 term occurs exactly once and at most two type 3
terms occur in the same summand. We decided to use this pattern to search for
new schemes: randomly pair four type 3 terms and assign the remaining type 3
terms to the other 19 summands. Only in very rare cases, random pairing could
be extended to a valid scheme in reasonable time, say a few minutes. In the other
cases it is not known whether the pairing cannot be extended to a valid scheme
or whether finding such a scheme is very hard.
Since the number of random pairings that could be extended to a valid scheme
was very low, we tried adding streamlining constraints [8] to formulas. A stream-
lining constraint is a set of clauses that guides the solver to a solution, but these
clauses may not (and generally are not) implied by the formula. Streamlining
constraints are usually patterns observed in solutions of a given problem, poten-
tially of smaller sizes. We experimented with various streamlining constraints,
such as enforcing that each type 0, type 1, and type 2 term occurs either zero
times or twice in a scheme (instead of an even number of times). The most effec-
tive streamlining constraint that we came up with was observed in the Smirnov
scheme: for each summand that is assigned a single type 3 term, enforce that (i)
one matrix has either two rows, two columns or a row and a column fully assigned
to zero and (ii) another matrix has two rows and two columns assigned to zero,
i.e., the matrix has a single nonzero entry. This streamlining constraint reduced
the runtime from minutes to seconds. Yet some random pairings may only be
extended to a valid scheme that does not satisfy the streamlining constraint.

3.2 Neighborhood Search

The second method is based on neighborhood search: we select a scheme, ran-


domly fix some the corresponding base variables, and search for an assignment
for the remaining base variables. This simple method turned out to be remark-
ably effective to find new schemes. The only parameter for this method is the
number of base variables that will be fixed. The lower the number of fixed base
variables, the higher the probability to find a different scheme and the higher the
costs to find an assignment for the remaining base variables. We experimented
with various values and it turned out that fixing 2/3 of the 621 base variables
(414) is effective to find many new schemes in a reasonable time.
The neighborhood search is able to find many new schemes, but in almost all
cases they have the same pairing of type 3 terms. Only in some rare cases the
pairing of type 3 terms is different. Figure 1 shows such an example: scheme A
has term a13 b31 c11 in summand 22 and term a23 b33 c32 in summand 23, while the
neighboring scheme B has term a13 b33 c31 in summand 22 and terms a13 b31 c11 ,
a23 b33 c32 , and a13 b33 c31 in summand 23.

4 Evaluation and Analysis

The methods presented in Section 3 enabled us to find several hundreds of so-


lutions individually, but they were particularly effective when combined. The
first method allows finding schemes that can be quite different compared to the
known schemes. However, finding a scheme using that method may require a
few CPU hours as most pairings of type 3 terms cannot be extended to a valid
scheme that satisfies the streamlining constraints. The second method can find
schemes that are very similar to known ones with a second. The neighborhood
of known solutions turned out to be limited to a few hundred of new schemes.
1 (a11 + a13 + a21 + a22 + a23 )(b13 )(c22 + c32 )
2 (a11 + a13 + a23 )(b13 + b32 )(c11 + c22 + c31 + c32 )
3 (a11 + a13 )(b32 )(c21 + c22 + c31 + c32 )
4 (a11 + a31 )(b11 + b12 + b13 )(c23 )
5 (a11 + a33 )(b11 + b13 + b32 )(c11 + c23 )
6 (a12 + a13 + a23 )(b13 + b33 )(c11 + c31 )
7 (a12 + a22 + a32 )(b21 + b22 + b23 )(c33 )
8 (a12 + a31 + a32 + a33 )(b22 )(c23 + c33 )
9 (a12 + a33 )(b13 + b21 + b33 )(c11 + c33 )
10 (a12 )(b13 + b23 + b33 )(c31 + c33 )
11 (a21 + a31 + a33 )(b11 )(c12 + c22 )
12 (a21 )(b11 + b12 + b13 )(c22 )
13 (a22 + a31 + a33 )(b13 + b22 )(c12 + c13 + c22 + c33 )
14 (a22 + a32 + a33 )(b21 )(c13 + c33 )
15 (a22 )(b13 + b21 + b22 )(c12 + c13 )
16 (a22 )(b13 + b23 )(c32 + c33 )
17 (a23 )(b31 )(c11 + c12 + c31 + c32 )
18 (a31 + a33 )(b11 + b13 + b22 )(c12 + c13 + c22 + c23 )
19 (a33 )(b11 + b21 + b31 )(c11 + c13 )
20A (a12 )(b22 )(c21 + c23 )
21A (a11 )(b12 + b32 )(c21 + c23 )
22A (a13 + a33 )(b31 + b32 + b33 )(c11 )
23A (a23 )(b31 + b32 + b33 )(c11 + c31 + c32 )
20B (a11 + a12 )(b22 )(c21 + c23 )
21B (a11 )(b12 + b22 + b32 )(c21 + c23 )
22B (a13 + a33 )(b31 + b32 + b33 )(c31 + c32 )
23B (a13 + a23 + a33 )(b31 + b32 + b33 )(c11 + c31 + c32 )

Fig. 1. Two neighboring schemes with 19 identical summands and 4 different ones.

In contrast, some of the schemes found using the first method have a large
neighborhood. We approximated the size of the neighborhood of a scheme using
the following experiment: Start with a given scheme S and find a neighboring
scheme by randomly fixing 2/3 of the base variables. Once a neighboring scheme
S ′ is found, find a neighboring scheme of S ′ , etc. We ran this experiment on a
machine with 48 cores of the Lonestar 5 cluster of Texas Advanced Computing
Center. We started each experiment using 48 threads with each thread assigned
a different seed. Figure 2 shows the number of different schemes (after sorting)
found in 1000 seconds when starting with one of the four known schemes and a
scheme that was computed from the streamlining method. Some of these different
schemes are new, while others are equivalent to each other or known ones. We
only assure here that they are not identical after sorting the summands.
Observe that the number of different schemes found in 1000 seconds depends
a lot on the starting scheme. No different neighboring scheme was found for
Streamlining
10000 Smirnov
Oh et al.
Courtois et al.
Laderman

1000

100

10

1
0 100 200 300 400 500 600 700 800 900 1000

Fig. 2. The number of different schemes (vertical axis in logscale) found within a period
of time (horizontal axis in seconds) during a random walk in the neighborhood of a
given scheme.

Laderman’s scheme, only 9 different schemes were found for the scheme of Cour-
tois et al., 94 different schemes were found for the scheme of Oh et al., 561 new
schemes were found for Smirnov’s scheme, and 3359 schemes were found using
a randomly selected new scheme obtained with the streamlining method.
In view of the large number of solutions we found, it is also interesting to
compare them with each other. For example, if we define the support of a solution
as the number of base variables set to 1, we observe that the support seems to
follow a normal distribution with mean around 160, see Fig. 3 for a histogram.
We can also see that the Laderman scheme differs in many ways from all the
other solutions. It is, for example, the only scheme whose core consists of four
quadruples of type 3 terms. In 89% of the solutions, the core consists of four pairs
of type 3 terms, about 10% of the solution have three pairs and one quadrupel,
and less than 1% of the schemes have cores of the form 2-2-2-2-3 or 2-2-2-3-4.

5 Challenges
The many thousands of new schemes that we found may still be just the tip
of the iceberg. However, we also observed that the state-of-the-art SAT solving
techniques are unable to answer several other questions. This section provides
four challenges for SAT solvers with increasing difficulty. For each challenge we
constructed one or more formulas that are available at
https://github.com/marijnheule/matrix-challenges.
The challenges are hard, but they may be doable in the coming years.
count
800

600

400

200
support
0
140 160 180

Fig. 3. Number of non-equivalent schemes found, arranged by support.

Challenge 1: Local search without streamlining. Our first method combines ran-
domly pairing the type 3 terms with streamlining constraints. The latter was
required to limit the search. We expect that local search solvers can be op-
timized to efficiently solve the formulas without the streamlining constraints.
This may result in schemes that are significantly different compared to ones we
found. We prepared ten satisfiable formulas with hardcoded pairings of type 3
terms. Five of these formulas can be solved using yalsat in a few minutes. All
of these formulas appear hard for CDCL solvers (and many local search solvers).

Challenge 2: Prove unsatisfiability of subproblems. We observed that complete


SAT solvers performed weakly on our matrix multiplication instances. It seems
therefore unlikely that one could prove any optimality results for the product
of two 3 × 3 matrices using SAT solvers in the near future. A more realistic
challenge concerns proving unsatisfiability of some subproblems. We prepared
ten formulas with 23 multiplications and hardcoded pairings of type 3 terms.
We expect that these formulas are unsatisfiable.

Challenge 3: Avoiding a type 3 term in a summand. All known schemes have


the following property: each summand has at least one type 3 term. We do
not know whether there exists a scheme with 23 multiplications such that one
of the summands contains no type 3 term. The challenge problem blocks the
existence of a type 3 term in the last summand and does not have any additional
(streamlining) constraints.

Challenge 4: Existence of a scheme with 22 multiplications. The main challenge


concerns finding a scheme with only 22 multiplications. The hardness of this
challenge strongly depends on whether there exists such a scheme. The repository
contains a plain formula for a scheme with 22 multiplications.

Acknowledgments. The authors acknowledge the Texas Advanced Computing


Center at The University of Texas at Austin for providing HPC resources that
have contributed to the research results reported within this paper.
References
1. Armin Biere. CaDiCaL, Lingeling, Plingeling, Treengeling and YalSAT Entering
the SAT Competition 2018. In Proc. of SAT Competition 2018 – Solver and Bench-
mark Descriptions, volume B-2018-1 of Department of Computer Science Series of
Publications B, pages 13–14. University of Helsinki, 2018.
2. Markus Bläser. On the complexity of the multiplication of matrices of small for-
mats. Journal of Complexity, 19(1):43–60, 2003.
3. Markus Bläser. Fast Matrix Multiplication. Number 5 in Graduate Surveys. Theory
of Computing Library, 2013.
4. Richard P. Brent. Algorithms for matrix multiplication. Technical report, Depart-
ment of Computer Science, Stanford, 1970.
5. Peter Bürgisser, Michael Clausen, and Mohammad A. Shokrollahi. Algebraic com-
plexity theory, volume 315. Springer Science & Business Media, 2013.
6. Nicolas Courtois, Gregory V. Bard, and Daniel Hulme. A new general-purpose
method to multiply 3 × 3 matrices using only 23 multiplications. CoRR,
abs/1108.2830, 2011.
7. Hans F. de Groote. On varieties of optimal algorithms for the computation of
bilinear mappings i. the isotropy group of a bilinear mapping. Theoretical Computer
Science, 7(1):1–24, 1978.
8. Carla Gomes and Meinolf Sellmann. Streamlined constraint reasoning. In Prin-
ciples and Practice of Constraint Programming (CP 2004), pages 274–289, Berlin,
Heidelberg, 2004. Springer Berlin Heidelberg.
9. Marijn J.H. Heule, Manuel Kauers, and Martina Seidl. New ways to multiply 3 × 3
matrices, in preparation.
10. Julian D. Laderman. A noncommutative algorithm for multiplying 3 × 3 ma-
trices using 23 multiplications. Bulletin of the American Mathematical Society,
82(1):126–128, 1976.
11. Joseph M. Landsberg. Geometry and complexity theory, volume 169. Cambridge
University Press, 2017.
12. Jinsoo Oh, Jin Kim, and Byung-Ro Moon. On the inequivalence of bilinear algo-
rithms for 3×3 matrix multiplication. Information Processing Letters, 113(17):640–
645, 2013.
13. Victor Y. Pan. Fast feasible and unfeasible matrix multiplication. CoRR,
abs/1804.04102, 2018.
14. A. V. Smirnov. The bilinear complexity and practical algorithms for matrix mul-
tiplication. Computational Mathematics and Mathematical Physics, 53(12):1781–
1795, 2013.
15. Volker Strassen. Gaussian elimination is not optimal. Numerische Mathematik,
13(4):354–356, 1969.
16. Shmuel Winograd. On multiplication of 2 × 2 matrices. Linear algebra and its
applications, 4(4):381–388, 1971.

You might also like