2412.09285v1
2412.09285v1
2412.09285v1
Zi-Song Shen,1, 2, ∗ Feng Pan,1, ∗ Yao Wang,3, ∗ Yi-Ding Men,2, 4 Wen-Biao Xu,2, 4 Man-Hong Yung,3 and Pan Zhang1, 4, †
1
CAS Key Laboratory for Theoretical Physics, Institute of Theoretical Physics,
Chinese Academy of Sciences, Beijing 100190, China
2
School of Physical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
3
2012 lab, Huawei Technologies Co., Ltd., Shenzhen 518129, China
4
School of Fundamental Physics and Mathematical Sciences,
Hangzhou Institute for Advanced Study, UCAS, Hangzhou 310024, China
(Dated: December 13, 2024)
arXiv:2412.09285v1 [cond-mat.stat-mech] 12 Dec 2024
Finding optimal solutions to combinatorial optimization problems is pivotal in both scientific and technolog-
ical domains, within academic research and industrial applications. A considerable amount of effort has been
invested in the development of accelerated methods that leverage sophisticated models and harness the power
of advanced computational hardware. Despite the advancements, a critical challenge persists, the dual demand
for both high efficiency and broad generality in solving problems. In this work, we propose a general method,
Free-Energy Machine (FEM), based on the ideas of free-energy minimization in statistical physics, combined
with automatic differentiation and gradient-based optimization in machine learning. The algorithm is flexible,
solving various combinatorial optimization problems using a unified framework, and is efficient, naturally uti-
lizing massive parallel computational devices such as graph processing units (GPUs) and field-programmable
gate arrays (FPGAs). We benchmark our algorithm on various problems including the maximum cut problems,
balanced minimum cut problems, and maximum k-satisfiability problems, scaled to millions of variables, across
both synthetic, real-world, and competition problem instances. The findings indicate that our algorithm not only
exhibits exceptional speed but also surpasses the performance of state-of-the-art algorithms tailored for individ-
ual problems. This highlights that the interdisciplinary fusion of statistical physics and machine learning opens
the door to delivering cutting-edge methodologies that will have broad implications across various scientific and
industrial landscapes.
scribed by the theory of replica symmetry breaking, which to address problems confined to the Ising model category.
uses the organization of fixed points of mean-field solutions 2. The COPs permitting variables to take multi-valued
to characterize the feature of the rugged landscapes [28]. states, specifically σi ∈ {1, 2, . . . , q}, yet maintain-
Here, inspired by replica symmetry breaking, we propose ing the two-body interactions, are categorized under the
a general method based on minimizing variational free en- Potts model [24]. This model is defined as E(σ) =
ergies at a temperature that gradually annealed from a high − i< j Wi j δ(σi , σ j ), where δ(σi , σ j ) is the Kronecker
P
value to zero. The free energies are functions of replicas of function, yielding the value 1 if σi = σ j and 0 otherwise.
variational mean-field distributions and are minimized using
3. Another category of COPs includes those with higher-
gradient-based optimizers in machine learning. We refer to
order interactions in the cost function, yet retain binary
our method as Free-Energy Machine, abbreviated as FEM.
spin states. An example is the p-spin (commonly with
The approach incorporates two major features. First, the
p > 2) Ising glass model [22], characterized by the en-
gradients of replicas of free energies are computed via auto-
ergy function E(σ) = − i1 <i2 <...<i p Wi1 ,i2 ,...,i p σi1 σi2 · · · σi p ,
P
matic differentiation in machine learning, making it generic
which integrates interactions among p distinct spins. This
and immediately applied to various COPs. Second, the
class of COPs is also known as the polynomial uncon-
variational free energies are minimized by utilizing recog-
strained binary optimization (PUBO) problem [30]. We
nized optimization techniques such as Adam [29] developed
note that considerable efforts have been undertaken to ex-
from the deep learning community. Significantly, all replicas
tend existing Ising solvers to high-order architectures [30–
of mean-field probabilities are updated in parallel, thereby
33].
leveraging the computational power of GPUs for efficient
execution and facilitating a substantial speed-up in solving 4. In more general scenarios, COPs can encompass both
large-scale problems. The pictorial illustration of our algo- multi-valued states and many-body interactions, featuring
rithm is shown in Fig. 1. a simultaneous coexistence of interactions across various
We have evaluated FEM using a wide spectrum of combi- orders. We term this class of COPs the general model,
natorial optimization challenges, each with unique features. it poses more challenges for the design of extended Ising
This includes tackling the maximum cut (MaxCut) prob- machines.
lem, fundamentally represented by the two-state Ising spin Our proposed approach aims to address all kinds of prob-
glasses; addressing the q-way balanced minimum cut (bMin- lems discussed above (also shown in Fig. 1(a)) using the
Cut) problem, which aligns with the Potts glasses and encap- same variational framework. Within this framework, we fo-
sulates COPs involving more than two states; and solving the cus on analyzing the Boltzmann distribution at a specified
maximum k-satisfiablity (Max k-SAT) problem, indicative temperature
of problems characterized by multi-body interactions. We
measured FEM’s efficacy by comparing it with the leading
algorithms tailored to each specific problem. The compar- 1 −βE(σ)
PB (σ, β) = e , (1)
ative analysis reveals that the proposed approach not only Z
competes well but in many instances outperforms these spe- where β = 1/T is the inverse temperature and Z =
cialized, cutting-edge solvers across the board. This demon-
P −βE(σ)
σe is the partition function. It is important to em-
strates FEM’s exceptional adaptability and superior perfor- phasize that we do not impose any constraints on the specific
mance, both in terms of accuracy and efficiency, across a di- form of E(σ). Consequently, we extend the traditional Ising
verse set of COPs. model formulations by permitting the spin variable to adopt
q distinct states and use Pi (σi ) to represent the marginal
probability of the i-th spin taking value σi = 1, 2, · · · , q,
RESULTS as illustrated in Fig. 1(b). The ground state configuration
σ GS that minimizes the energy can be achieved at the zero-
Free-Energy Machine temperature limit with
Consider a COP characterized by a cost function, i.e. the σ GS = arg min E(σ) = arg max lim PB (σ, β). (2)
σ σ β→∞
energy function in physics, E(σ), that we aim to minimize.
Here, σ = (σ1 , σ2 , ..., σN ) represents a candidate solution As illustrated in Fig. 1(b) and (c), accessing the Boltzmann
or a configuration comprising N discrete variables. The en- distribution at zero temperature would allow us to calculate
ergy function encapsulates the interactions among variables, the marginal probabilities Pi (σi ) and determine the config-
capturing the essence of various COPs. This is depicted uration based on the probabilities. However, there are two
in Fig. 1(a), where it is further delineated into four distinct issues to accessing the zero-temperature Boltzmann distri-
models, each representing different physical scenarios. bution.
The first issue is that directly accessing the Boltzmann dis-
1. One of the simplest cases is the QUBO problem (or the tribution at zero temperature poses significant challenges due
Ising problem) [20], with E(σ) = − i< j Wi j σi σ j , where
P
to the rugged energy landscape, often described by the con-
σi ∈ {−1, +1}. The existing Ising solvers are tailor-made cept of replica symmetry breaking in statistical physics [34,
3
a b
Ising model Potts model p spin glass model General model q different spin states State probability distribution
c Gradient
descent
argmax
FIG. 1. Illustration of the Free-Energy Machine in solving combinatorial optimization problems. (a). Four distinct models offering
representations of different types of combinatorial optimization problems. (b). In a general combinatorial optimization problem (COP),
each spin variable is capable of adopting one of q distinct spin states, which are denoted as 1, 2, . . . , q. The likelihood of assuming any given
spin state is described by the marginal probability Pi (σi ). (c). To tackle a combinatorial optimization problem (COP) characterized by the
cost function E(σ), the primary computational challenge involves calculating the marginal probability from the zero-temperature Boltzmann
distribution. This calculation can be effectively approximated through a gradient-based annealing approach applied to the variational mean-
field free energy FMF . By determining the optimal mean-field distribution P∗MF (σ) that minimizes FMF , it becomes possible to ascertain the
ground state of the COP. This is achieved by identifying the most probable spin state for each spin variable, utilizing the set of marginal
probabilities P∗i (σi ).
35]. To navigate this issue and facilitate a more manage- The pictorial illustration of implementing the FEM al-
able exploration of the landscape, we employ the strategy of gorithm is depicted in Fig. 2. Given a COP defined on
annealing, which deals with the Boltzmann distribution at a graph, we associate the N spin variables (each spin has
finite temperature. This temperature is initially set high and q states) with the variational variables represented by the
is gradually reduced to zero. N × q marginal probabilities {Pi (σi )} for the mean-field
The second issue is how to represent the Boltzmann dis- free energy. Then we parameterize the marginal probability
tribution. Exactly computing the Boltzmann distribution be- Pi (σi ) using fields {hi (σi )} with a softmax function Pi (σi ) =
exp[hi (σi )]/ qσ′ =1 exp[hi (σ′i )] for a q-state variable, as illus-
P
longs to the computational class of #P, so we need to approx-
i
imate it efficiently. Many approaches have been proposed, trated in Fig. 2(a). This parameterization can release the nor-
including Markov-Chain Monte-Carlo [3], mean-field and malization constraints on the variational variables {Pi (σi )}
message-passing algorithms [28], and neural network meth- (ensuring the probabilistic interpretation of σi Pi (σi ) = 1
P
ods [36, 37]. In this work, we use the variational mean- during the variational process). Moreover, our approach con-
field distribution PMF (σ) =
Q
i Pi (σi ) to approximate the siders constraints on variables, such as the total number of
Boltzmann distribution PB (σ, β). The parameters of PMF spins with a particular value, or a global property that a con-
can be determined by minimizing the Kullback-Leibler di- figuration must satisfy.
vergence DKL (P ∥ PB ) = σ P(σ) ln (P(σ)/PB (σ, β)) , and
P
At a high temperature, there could be just one mean-field
this is equivalent to minimizing the variational free energy distribution that minimizes the variational free energy. How-
X 1X ever, at a low temperature, there could be many mean-field
FMF = PMF (σ)E(σ) + PMF (σ) ln PMF (σ). (3) distributions, each of which has a local free energy mini-
σ
β σ
mum, corresponding to a set of marginal probabilities. In-
While the mean-field distribution may not boast the ex- spired by the one-step replica symmetry breaking theory of
pressiveness of, for instance, neural network ansatzes, our spin glasses, we use a set of marginal distributions, which
findings indicate that it provides a precise representation of we term as m replicas of mean-field solutions with param-
ground-state configurations at zero temperature via the an- eters {Pai (σi ) | i = 1, 2, · · · , N; a = 1, 2, · · · , m}, each of
nealing process. Furthermore, a significant advantage of the which is updated to minimize the corresponding mean-field
mean-field variational distribution is the capability for exact free energy and the minimization is reached through machine
computation of the gradients of the mean-field free energy. learning optimization techniques. This approach notably en-
This stands in stark contrast to variational distributions uti- hances both the number of parameters and the expressive ca-
lizing neural networks, where gradient computation necessi- pability of the mean-field ansatz.
tates stochastic sampling [36]. The parameters of the replicas of the mean-field distribu-
4
a q states
Chain
Explicit gradients rule Gradients to
Combinatorial Marginal probabilities local fields
Local fields Explicit gradient formulations
optimization
problem Representation of spin states Computing the gradients for the local fields
Optimizer
Updating the local fields Optimization with an optimizer
b c Replicas d
Entropy
1 2 3
Internal
energy
Argmax
FIG. 2. The framework of implementing Free-Energy Machine. (a). Given a COP with N spin variables, the spin states are associated
with N × q variables called the local fields {hi (σi )}. The marginal probabilities for all spin states are calculated from the local fields through
the softmax P function, and the local fields serve as the genuine variational variables, consistently guaranteeing the probabilistic interpretation
of {Pi (σi )} ( σi Pi (σi )=1). The gradients of mean-field free energy FMF with respect to the local fields, denoted as {ghi (σi )}, can be computed
via the automatic differentiation or through the explicit gradient formulations. In the explicit gradient formulations, {ghi (σi )} can be computed
via the chain rule using the explicit gradients {gip (σi )} (please refer to Supplementary Materials for details). The well-developed continuous
optimizers from the deep learning community can be employed for the optimization. With the annealing, the local fields are updated to
optimize FMF . (b). The schematic of UMF and S MF evolving with the inverse temperature β during the optimization process. βc is the
critical inverse temperature at which the distribution transition occurs. (c). The corresponding evolutions of the marginal probabilities with
the annealing, depicting at the three stages marked in (b). Inspired by the one-step replica-symmetry breaking in statistical physics [28],
we can update the marginal probabilities of many replicas parallelly. (d). Reading out the optimal solution from the replicas of marginal
probabilities at the end of annealing.
tions are determined by minimizing the variational free ener- merely employing adaptive learning rates and momentum
gies, the gradients can be computed using automatic differen- techniques, as seen in gradient-based optimization methods
tiation. This process is very similar to computing gradients in machine learning [29].
of the loss function with respect to the parameters in deep With gradients computed, we adopt the advanced
neural networks. It amounts to expanding the computational gradient-based optimization methods developed in the deep
process as a computational graph and applying the back- learning community for training neural networks, such as
propagation algorithm. Thanks to the standard deep learning Adam [29], to update the parameters. They can efficiently
frameworks such as PyTorch [38] and TensorFlow [39], it maintain individual adaptive learning rates for each marginal
can be implemented using just several lines of code as shown probability from the first and second moments of the gra-
in the Methods section. Remarkably, for different combi- dients, and require minimal memory overhead, so is well-
natorial optimization problems, we only need to specify the suited for updating marginal probabilities in our algorithm.
P
form of the energy expectations σ PMF (σ)E(σ) as a func- Fig. 2(b) shows a schematic of the typical evolutions of in-
tion of marginal probabilities. Beyond leveraging automatic ternal energy UMF = σ PMF (σ)E(σ), and entropy S MF =
P
differentiation for gradient computation, we have the option − σ PMF (σ) ln PMF (σ) as a function of β. The correspond-
P
to delineate the explicit gradient formulas for the problem. ing evolutions of {Pi (σi )} of the replicas with the annealing
Utilizing explicit gradient formulations can halve the com- are depicted in Fig. 2(c). All mean-field probabilities of the
putational time. Another merit of adopting explicit gradients replicas are updated parallelly. Initially, the fields {hi (σi )}
lies in the possibility of further enhancing our algorithm’s associated with spin i are randomly initialized around zero,
stability through additional gradient manipulations, beyond making q-state marginal distributions {Pi (σi )} around 1/q.
5
a 33400 b 200
150
33200 Random
100
33000 50
0
32800 10 15 20 25 30 35
2000
Cut value
Amout
1000
32400 500
0
32200 dSBM 3 4 5
10-2
G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 G14 G15 G16 G17 G18 G19 G20 G21 G22 G23 G24 G25 G26 G27
4
10
102
100
10-2
10-4
G28 G29 G30 G31 G32 G33 G34 G35 G36 G37 G38 G39 G40 G41 G42 G43 G44 G45 G46 G47 G48 G49 G50 G51 G52 G53 G54
Problem instance
FIG. 3. Benchmarking results for the maximum cut problem. (a). The benchmarking results for solving the MaxCut problem on the
complete graph K2000 with 2000 nodes. The graphical representations include squares, solid lines, and circles, which respectively illustrate
the maximum, average, and minimum cut values as a function of the total number of annealing step Nstep . For each Nstep , we implement
R = 1000 mean-field replicas using FEM, showcasing the distribution of cut values derived from these replicas. In parallel, the discrete
Stochastic Bifurcation Machine (dSBM) was executed for 1000 trials with random initial conditions for each Nstep to serve as a comparative
benchmark. The best-known cut value for K2000 is 33337, with the 99% of the best-known cut values approximately reaching 33004. An
inverse-proportional annealing scheduling with β (T max = 1.16, T min = 6 × 10−5 ) was applied. Please refer to main text and Supplementary
Materials for more details. (b). The G-set instances commonly exhibit node degree distributions that categorize into three distinct types of
graphs. (c). Time-to-Solution (TTS) benchmarking comparison of FEM and dSBM across G-set instances, which range from G1 to G54
and include graphs with 800 to 2000 nodes. The data for dSBM is referenced from the study in [11]. The G-set instances are visually
categorized in the plot by regions marked with different colors, each color representing one of the three distinct graph types identified in
these instances. Graph instances that exhibit shorter TTSs are notably highlighted with circles positioned at the lower section of the plot,
indicating superior performance in those cases. For additional insights into the methodologies employed and the experimental setup, please
refer to main text and Supplementary Materials.
In the first stage indicated in Fig. 2(b), since β is small (i.e. temperature Boltzmann distribution, where the ground states
at a high temperature), the entropy term S MF predominantly are the most probable states to occur.
governs the energy landscape of FMF . The uniform distribu- After the annealing process, as shown in Fig. 2(d), the
tions of {Pi (σi )} indicate that all possible spin configurations temperature is decreased to a very low value, and we obtain
emerge with equal importance. Consequently, they maxi- a configuration for each replica according to the marginal
mize S MF (i.e. minimize FMF at a fixed β), and the value probabilities, as
of S MF remains around its maximal value in this stage. In
the second stage, when β increases to some critical value σ
eai = arg max Pai (σi ). (4)
σi
βc , UMF becomes the predominant factor and the distribution
transition occurs. As a consequence, internal energy plays a Then we choose the configuration with the minimum energy
more important role in minimizing FMF , leading {Pi (σi )} to from all replicas
deviate from the uniform distributions and explore different σi = arg min E(e
σai ). (5)
mean-field solutions. In the third stage, when β is sufficiently
b
a
large, UMF gradually converges to a minimum value of FMF ,
We emphasize that all the gradient computation and pa-
and {Pi (σi )} gradually converges to approximate the zero-
rameter updating on replicas can be processed parallelly
6
and is very similar to the computation in deep neural net- FEM. The results of FEM are compared with the dSBM al-
works: overall computation only involves batched matrix gorithm, for which we also ran for 1000 trials from random
multiplications and element-wise non-linear functions. Thus initial conditions. From the results, we can see that the best
it fully utilizes massive parallel computational devices such value of FEM achieves the best-known results for this prob-
as GPUs. lem in less than 1000 annealing steps, and all the maximum,
average, and minimum cut sizes are better than dSBM.
We also evaluate our algorithm using the standard bench-
Applications to the Maximum-Cut problem mark for the MaxCut problem, the G-set [11, 45, 46]. The
G-set benchmark contains various graphs including random,
We begin by evaluating the performance of our algorithm regular, and scale-free graphs based on the distribution of
on Quadratic Unconstrained Binary Optimization (QUBO) node degrees as shown in Fig. 3(b). Each problem contains a
problems. To illustrate, we select the MaxCut problem as best-known solution which is regarded as the ground truth
a representative example. This NP-complete problem is for evaluating algorithms. A commonly used statistic for
widely applied in various fields, including machine learning quantitatively assessing both the accuracy and the compu-
and data mining [40], the design of electronic circuits [41], tational speed in the MaxCut problem is the time-to-solution
and social network analysis [42]. Furthermore, it serves as (TTS) [11, 43], which measures the average time to find a
a prevalent testbed for assessing the efficacy of new algo- good solution by running the algorithm for many times (tri-
rithms aimed at solving QUBO problems [10, 11, 43]. The als) from different initial states. TTS (or TTS99 ) is formu-
optimization task in the MaxCut problem is to determine an lated as T com · log(1 − 0.99)/ log(1 − PS ), where T com repre-
optimal partition of nodes into q = 2 groups for an undirected sents the computation time per trial, and PS denotes the suc-
weighted graph, in such a way that the sum of weights on the cess probability of finding the optimal cut value in all tested
edges connecting two nodes belonging to different partitions trials. When PS ≥ 0.99, the TTS is defined simply as T com .
(i.e. the cut size C) is maximized. Formally, we define the The value of PS is typically estimated from experimental re-
energy function as sults comprising numerous trials.
X In Fig. 3(c) we present the TTS results obtained by FEM
E(σ) = − Wi j [1 − δ(σi , σ j )], (6) for 54 problem instances in G-set (G1 to G54, with the num-
(i, j) ∈ E ber of nodes ranging from 800 to 2000) and compare to
the reported data for dSBM using GPUs. We can see that
where E is the edge set of the graph, Wi j is the weight of edge FEM surpasses dSBM in TTS for 33 out of the 54 instances,
(i, j), and δ(σi , σ j ) stands for the delta function, which takes and notably for all G-set instances of scale-free graph, FEM
value 1 if σi = σ j and 0 otherwise. Then the variational achieves better performance than dSBM. The primary reason
mean-field free energy for the MaxCut problem can be writ- is that we employed the advanced normalization techniques
ten out (see Methods section) and the gradients of FMF to the for the optimization, please refer to the Supplementary Mate-
variational parameters can be computed via automatic differ- rials for more details. Furthermore, we notice that FEM out-
entiation. In general, writing out the explicit formula for the performs the state-of-the-art neural network-based method
gradients is not necessary, as the gradients computed using in combinatorial optimization. For instance, the physics-
automatic differentiation are numerically equal to the explicit inspired graph neural network (PI-GNN) approach [48] has
formulations. However, for the purpose of benchmarking, been shown to outperform other neural-network-based meth-
using the explicit formula will result in lower computation ods on the G-set dataset. From the data in [48] we see that
time. Moreover, in practice, we can further apply the nor- results of PI-GNN on the G-Set instances (with estimated
malization and clip of gradients to enhance the robustness runtimes in the order of tens of seconds) still give significant
of the optimization process, we refer to the Supplementary discrepancies to the best-known results of the G-set instances
Materials for details. e.g. for G14, G15, G22 etc, while our method FEM achieves
We first benchmark FEM by solving a 2000-spin Max- the best-known results for these instances only using several
Cut problem named K2000 [44] with all-to-all connectiv- milliseconds.
ity, which has been intensively used in evaluating MaxCut
solvers [10, 11, 45]. We compare the results obtained by
FEM with discrete-SBM (dSBM) [11], which can be iden- Applications to the q-way balanced minimum cut problem
tified as the state-of-the-art solver for the MaxCut problem.
Fig. 3(a) shows the cut values we obtained for the K2000 prob- Next, we choose the q-way balanced MinCut (bMinCut)
lem with different total annealing steps Nstep used to increase problem [49] as the second benchmarking type to evalu-
β from βmin to βmax . Similar to the role in FEM, the hyperpa- ate the performance of FEM on directly addressing multi-
rameter Nstep introduced in dSBM controls the total number valued problems featuring the Potts model. The q-way bMin-
of annealing steps for the bifurcation parameter. To investi- Cut problem asks to group nodes in a graph into q groups
gate the distribution of energy in all replicas, in the figure, with a minimum cut size and balanced group sizes. The re-
we plot the minimum, average, and maximum cut value as quirement to balance group sizes imposes a global constraint
a function of Nstep , with R = 1000 mean-field replicas for on the configurations, rendering the problem more complex
7
Best Best
Instance q METIS KaFFPaE FEM Instance q METIS KaFFPaE FEM
known known
2 596(0) 722(0) 597(0) 596(0) 2 189(0) 211(0) 189(0) 189(0)
4 1151(0) 1257(0) 1158(0) 1152(0) 4 382(0) 429(0) 382(0) 382(0)
add20 8 1678(0) 1819(0.007) 1693(0) 1690(0) data 8 668(0) 737(0) 668(0) 669(0)
16 2040(0) 2442(0) 2054(0) 2057(0) 16 1127(0) 1237(0) 1138(0) 1129(0)
32 2356(0) 2669(0.04) 2393(0) 2383(0) 32 1799(0) 2023(0) 1825(0) 1815(0)
2 90(0) 90(0) 90(0) 90(0) 2 10171(0) 10205(0) 10171(0) 10171(0)
4 201(0) 208(0) 201(0) 201(0) 4 21717(0) 22259(0) 21718(0) 21718(0)
3elt 8 345(0) 380(0) 345(0) 345(0) bcsstk33 8 34437(0) 36732(0.001) 34437(0) 34440(0)
16 573(0) 636(0.004) 573(0) 573(0) 16 54680(0) 58510(0) 54777(0) 54697(0)
32 960(0) 1066(0) 966(0) 963(0) 32 77410(0) 83090(0.004) 77782(0) 77504(0)
TABLE I. Benchmarking results for the q-way balanced minimum cut problem, using real-world graph instances from Chris Wal-
shaw’s archive [47]. The graph partitioning was tested with the number of groups q set to 2, 4, 8, 16, and 32 across all solvers. The table
showcases the minimum cut values obtained by three different solvers. The numbers in parentheses represent the imbalance ϵ (with ϵ = 0
being ideal) satisfying the condition |Πn | ≤ (1 + ϵ)⌈N/q⌉ for n = 1, 2, . . . , q, where N denotes the number of nodes and |Πn | indicates the size
of each group. For each graph, the lowest cut values are emphasized in bold.
than unconstrained problems such as the MaxCut problem. Cut problem: METIS [52] and KAHIP (alongside its vari-
Here we formulate the energy function with a soft constraint ant KaFFPaE, specifically engineered for balanced partition-
term ing) [53], the latter being the winner of the 10th DIMACS
X XX challenge. The benchmarking results are shown in Tab. I,
E(σ) = Wi j [1 − δ(σi , σ j )] + λ δ(σi , σ j ), (7)
where we can observe that the results obtained by FEM con-
(i, j) ∈ E i j,i
sistently and considerably outperform METIS in all the prob-
where λ is the parameter of soft constraints which controls lems and for all the number of groups q. Moreover, in some
the degree of imbalance. Based on the energy formula- instances, METIS failed to find a perfectly balanced solu-
tion, the expression of FMF can be explicitly formulated (see tion while the results found by FEM in all cases are perfectly
Methods section), and its gradients can be calculated through balanced. We observe that FEM performs comparably to
automatic differentiation or derived analytically. In practice, KaFFPaE for small group sizes q, and significantly outper-
we also used gradient normalization and gradient clip to en- forms KaFFPaE with a large q value. We have also evalu-
hance the robustness of the optimization, we refer to Supple- ated the performance of FEM on extensive random graphs
mentary Materials for detailed discussions. It’s worth noting comprising up to one million nodes. The outcomes are de-
that the bMinCut problem bears a significant resemblance picted in Fig. 4. As observed from the figure, FEM achieves
to the community detection problem [26]. In the latter, the significantly lower cut values than METIS across the same
imbalance constraint is often substituted with the constraints collection of graphs when partitioned into q = 4, 8, and 16
derived from a random configuration model [26] or a gen- groups. Notably, this performance disparity is maintained as
erative model [50, 51], to avoid the trivial solution that puts the number of nodes increases to one million. The compar-
all the nodes into a single group. Thus FEM can be easily isons demonstrate FEM’s exceptional scalability in solving
adapted to the community detection problem. the large-scale q-way bMinCut problems.
To evaluate the performance of FEM in solving the q-
way bMinCut problem, we conduct numerical experiments Since the bMinCut modeling finds many real-world appli-
using four large real-world graphs from Chris Walshaw’s cations in parallel computing and distributed systems, data
archive [47]. These include add20 with 2395 nodes and 7462 clustering, and bioinformatics [54, 55], we then apply FEM
edges, data with 2851 nodes and 15093 edges, 3elt which to address a challenging real-world problem of chip verifica-
comprises 4,720 nodes and 13,722 edges, and bcsstk33 with tion [56]. To identify and correct design defects in the chip
8738 nodes and 291583 edges. The graphs have been widely before manufacturing, operators or computing units need to
used in benchmarking q-way bMinCut solvers, e.g. used be deployed on a hardware platform consisting of several
by the D-wave for benchmarking their quantum annealing processors (e.g. FPGAs) for logic and function verifica-
hardware [49]. However, their work only presents the re- tion. Due to the limited capacity of a single FPGA and
sults of q = 2 partitioning, owing to the constraints of the the restricted communication bandwidth among FPGAs, a
quantum hardware. Here, we focus on the perfectly bal- large number of operators need to be uniformly distributed
anced problem which asks for group sizes the same. We across the available FPGAs, while minimizing the commu-
evaluate the performance of FEM by partitioning the graphs nication volume among operators on different FPGAs. The
into q = 2, 4, 8, 16, 32 groups. For comparison, we uti- schematic illustration is shown in Fig. 5(a), This scenario re-
lized two state-of-the-art solvers tailored to the q-way bMin- sembles load balancing in parallel computing and minimiz-
8
1.16
1.00
0.78
1.14
0.98
0.76 1.12
0.96 METIS
Number of nodes, N
FIG. 4. The scalability test of FEM in the q-way bMinCut problems on the Erdös-Rényi random graphs. The generated random
graphs contain a number of nodes ranging from 1,000 to 1,000,000, each with an average degree of 5. We benchmark the scalability of FEM
with METIS over these generated random graphs. The number of partitions we evaluate are q = 2, 4, 8. Each box plot contains 50 points,
representing the cut value per node, obtained from 50 runs of the algorithms. The number of replicas of FEM is set to R = 50 (in each run)
throughout the experiments.
a b
METIS
120000
Free-energy machine
107935
100000
Communication
volume
Edge cut
Deployment 83650
80000
73223
FPGA FPGA
60000
53748
43860
FPGA FPGA
40000
34142
FPGAs board
4 8 16
Operators/Computing units q
FIG. 5. The application of FEM in large-scale FPGA-chip verification tasks. (a). In the chip verification task, a vast number of operators
or computing units with logical interconnections need to be uniformly deployed onto a hardware platform consisting of several FPGAs. The
operators are partitioned into several groups corresponding to different FPGAs, and the communication volume between these groups of
operators should be minimized. This can be modeled as a balanced minimum cut problem. (b). The results of FEM, along with a comparison
to METIS, are presented for a large-scale real-world dataset consisting of 1,495,802 operators and 3,380,910 logical operator connections
deployed onto (q = 4, 8, 16) FPGAs (see Supplementary Materials for details).
ing the edge cut while maintaining balanced partitions and ply the Louvain algorithm [58] to identify community struc-
can be modeled as a balanced minimum cut problem [57]. tures. Nodes within the same community are coarsened to-
In this work, we address a large-scale real-world chip ver- gether. The results are shown in Fig. 5(b), along with com-
ification task consisting of 1495802 operators (viewed as parative results provided by METIS (see Supplementary Ma-
nodes) and 3380910 logical operator connections (viewed as terials for more details. We did not include the results of
edges) onto q = 4, 8, 16 FPGAs. We apply FEM to solve KaFFPaE, as its open-source implementation [59] runs very
this task. Since the dataset contains many locally connected slowly and exceeds the acceptable time limits on large-scale
structures among operators, we first conduct coarsening the graphs). From the figure, we can observe that the size of the
entire graph before partitioning on it. Unlike the matching edge cut given by FEM is 22.2%, 26.6%, and 22.5% smaller
method used in the coarsening phase in METIS [52], we ap- than METIS, for 4,8, and 16 FPGAs, respectively, signifi-
9
cantly reduces the amount of communication among FPGAs M ∈ [700, 1500]. For the “HG3” category, the Max 3-SAT
and shortening the chip verification time. problems feature N ∈ [250, 300] and M ∈ [1000, 1200].
Lastly, the “HG4” category contains Max 4-SAT problems
with N ∈ [100, 150] and M ∈ [900, 1350]. Fig. 6 shows
Application to the Max k-SAT problem the benchmarking results for all 454 competition instances
in the four categories (see Supplementary Materials for the
Lastly, we evaluate FEM for addressing COPs with the experimental details).
higher-order spin interactions on the constraint Boolean sat- In Fig. 6(a), we illustrate the quality of solutions found
isfiability (SAT) problem. In this problem, M logical clauses, by FEM, evaluated using the energy difference ∆E, be-
denoted as C1 , C2 , . . . , C M , are applied to N boolean vari- tween FEM the best-known results for all problem in-
ables. Each clause is a disjunction of literals, namely Cm = stances. To provide a comprehensive comparison, we also
l1 ∨ l2 ∨ . . . ∨ lkm (km is the number of literals in the clause present the documented results for the competition problems
Cm ). A literal can be a Boolean variable σi or its nega- achieved by a state-of-the-art solver, continuous-time dy-
tion ¬σi . The clauses are collectively expressed in the con- namical heuristic (Max-CTDS), as reported in Ref. [60]. The
junctive normal form (CNF). For example, the CNF formula results in Fig. 6(a) show that FEM found the optimal solution
C1 ∧C2 ∧C3 = (σ1 ∨¬σ2 ∨¬σ2 )∧(¬σ1 ∨σ4 )∧(σ2 ∨¬σ3 ∨¬σ4 ) in 448 out of 454 problem instances. For the 6 instances that
is composed of 4 Boolean variables and 3 clauses. Note that, FEM did not achieve the optimal solution, it found a solution
a clause is satisfied if at least one of its literals is true, and with an energy gap 1 to the best-known solution. Also, we
is unsatisfied if no literal is true. We can see that the SAT can see that FEM outperforms Max-CTDS in all instances.
problem is a typical many-body interaction problem with In Fig. 6(b), we list the computational time of FEM using
higher-order spin interactions. The decision version of the GPU in solving each instance of SAT competition 2016 prob-
SAT problem asks to determine whether there exists an as- lems and compare them with the computation time of the
signment of Boolean variables to satisfy all clauses simulta- specific-purpose incomplete MaxSAT solvers (using CPU) in
neously (i.e. the CNF formula is true). The optimization the 2016 competition [61]. For each instance, we only chart
version of the SAT problem is the maximum SAT (Max- the minimum computational time of the incomplete solver
SAT) problem, which asks to find an assignment of vari- needed to reach the best-known results, as documented by
ables that maximizes the number of clauses that are satisfied. all incomplete MaxSAT solvers [61]. Note that the fastest
When each clause comprises exactly k literals (i.e. km = k), incomplete solver can differ across various instances. The
the problem is identified as the k-SAT problem, one of the data presented in the figure clearly demonstrates that FEM
earliest recognized NP-complete problems (when k ≥ 3) outperforms the quickest incomplete MaxSAT solvers from
[62, 63]. These problems are pivotal in the field of com- the competition, both significantly and consistently. On av-
putational complexity theory. We benchmark FEM on the erage, FEM achieves a computational time of 0.074 seconds
Max k-SAT problem, which is NP-hard for any k ≥ 2. In our across all instances, with a variation (standard deviation) of
framework, the energy function for the Max k-SAT problem 0.077 seconds. The computational time ranges from as short
is formulated as the number of unsatisfied clauses, as as 0.018 seconds for the “s2v200c1400-2.cnf” to as long
M Y
as 1.17 seconds for the “HG-3SAT-V250-C1000-14.cnf” in-
stance. A key factor contributing to the rapid computation
X
E(σ) = [1 − δ(Wmi , σi )] , (8)
m=1 i∈∂m
time of FEM is its ability to leverage the extensive parallel
processing capabilities of GPU, which can accelerate com-
where σi = {0, 1} is the Boolean variable, ∂m denotes the putations by approximately tenfold compared to CPU pro-
set of Boolean variables that appears in clause Cm , Wmi = 0 cessing. Nevertheless, it’s crucial to highlight that even when
if the literal regarding variable i is negated in clause Cm and performing on a CPU, FEM significantly outpaces most of
Wmi = 1 when not negated. Note that, in the case of Boolean the competitors in the SAT competition. For example, on av-
SAT, Wmi ∈ {0, 1} corresponds to the two states of spin vari- erage, Max-CTDS demands an average time of 4.35 hours
ables. The energy function can be generalized to any Max- to approximate an optimal assignment across all instances,
SAT problem, where clauses may vary in the number of lit- as reported in [60]. In stark contrast, FEM, when running
erals, and to cases of non-Boolean SAT problem where Wmi on CPU, completes the same task in just a few seconds on
has q > 2 states. The expression of FMF can be also found in average. Our benchmarking results demonstrate that FEM
Methods section, and the form of its explicit gradients please surpasses contemporary leading solvers in terms of accuracy
refer to Supplementary Materials. and computational speed when addressing the problems pre-
We access the performance of FEM using the dataset in sented in the MaxSAT 2016 competition.
MaxSAT 2016 competition [61]. The competition prob-
lems encompass four distinct categories: “s2”, “s3” by
Abrame-Habet, and “HG3”, “HG4” with high-girth sets. DISCUSSION
The “s2” category consists of Max 2-SAT problems with
N ∈ [120, 200] and M ∈ [1200, 2600]. The “s3” cate- We have presented a general and high-performance ap-
gory includes Max 3-SAT problems with N ∈ [70, 110] and proach for solving COPs using a unified framework in-
10
a s2 s3 HG3 HG4
12
Max-CTDS
10
Free-Energy Machine
8
b 1
0 50 100 150 200 250 300 350 400 450
10
SsMonteCarlo
Ramp
Computation time(sec.)
0
borealis
10 SC2016
CCLS
CnC-LS
-1
Swcca-ms
10 HS-Greedy
CCEHC
Free-Energy Machine
-2
10
0 50 100 150 200 250 300 350 400 450
Instance index
FIG. 6. Benchmarking results on the MaxSAT 2016 competition problems. (a). The results for the energy differences, ∆E = Emin − Ebkr ,
which represent the gap between the minimal energy values found by the solvers (Emin ) and the best-known results documented in the
literature (Ebkr ) with E indicating the number of unsatisfied clauses. These findings cover all 454 problem instances across four competition
categories: “s2”, “s3”, “HG3”, and “HG4”. For further discussion on these results, please see the main text. The Max-CTDS algorithm data
were obtained from Ref. [60]. (b). The computation time (measured by the running time of all replicas) of the Free-Energy Machine (FEM)
to achieve these outcomes against the leading incomplete solvers from the 2016 competition across different instances. The performance
data for these incomplete solvers were sourced from the 2016 MaxSAT competition documentation [61]. For detailed information on the
experimental setup for FEM, refer to the main text and Supplementary Materials.
spired by statistical physics and machine learning. The pro- which involves a two-state many-body interactions. The out-
posed method, FEM, integrates three critical components comes of our benchmarks clearly show that FEM markedly
to achieve its success. First, FEM employs the variational surpasses contemporary algorithms tailored specifically for
mean-field free energy framework from statistical physics. each problem, demonstrating its superior performance across
The framework facilitates the natural encoding of diverse these diverse optimization scenarios. Beyond the bench-
COPs, including those with multi-valued states and higher- marking problems showcased in this study, we also extend
order interactions. This attribute renders FEM an exception- our modelings to encompass a broader spectrum of combi-
ally versatile approach. Second, inspired by replica sym- natorial optimization issues, to which FEM can be directly
metry breaking theory, FEM maintains a large number of applied. For further details, please consult the Supplemen-
replicas of mean-field free energies, exploring the mean-field tary Materials.
spaces to efficiently find an optimal solution. Third, the
mean-field free energies are computed and minimized using
machine-learning techniques including automatic differenti-
ation, gradient normalization, and optimization. This offers
a general framework for different kinds of COPs, enables In this study, our exploration was confined to the most fun-
massive parallelization using modern GPUs, and fast com- damental mean-field theory within statistical physics. How-
putations. ever, more sophisticated mean-field theories exist, such as
We have executed comprehensive benchmark tests on a the Thouless-Anderson-Palmer (TAP) equations associated
variety of optimization challenges, each exhibiting unique with TAP free energy, and belief propagation, which con-
features. These include the MaxCut problem, characterized nects to the Bethe free energy. These advanced theories have
by a two-state and two-body interaction without constraints; the potential to be integrated into the FEM framework, offer-
the bMinCut, defined by a q-state and two-body interac- ing capabilities that could surpass those of the basic mean-
tion with global constraints; and the Max k-SAT problem, field approaches. We put this into future work.
11
METHODS constraints, and the nature of the objective function, the im-
plementations for each problem differ by only a single line
The variational mean-field free energy formulations of code. This highlights the adaptability and efficiency of
the approach in handling distinct optimization challenges.
As outlined in the opening of the Results section, FEM Please refer to Supplementary Materials for the codes with
addresses COPs by minimizing the variational mean-field detailed explanations.
free energy through a process of annealing from high to low
temperatures. To tackle a specific COP, we commence by
1 import torch
constructing the variational mean-field free energy formula- 2
tion for the problem at hand. Here, we establish the varia- 3 def cut(W,p):
tional mean-field free energy formulations for the MaxCut, 4 return ((W @ p) * (1-p)).sum ((1 , 2))
the bMinCut, and the Max k-SAT problems that are bench- 5
marked in this study. The derivation details can be found in 6 def S(p):
Supplementary Materials. 7 return -(p*p.log ()).sum (2).sum (1)
8
Starting with the MaxCut problem, the variational free en-
9 def argmax_cut (W,p):
ergy reads 10 s = torch.nn. functional . one_hot (
X X 11 p. argmax (dim =2) , num_classes =p. shape
MaxCut
FMF ({Pi (σi )}, β) = − Wi j Pi (σi )[1 − P j (σi )] [2])
(i, j) ∈ E σi 12 return config , cut(W, s) / 2
1 XX 13
+ Pi (σi ) ln Pi (σi ). (9) def balance (p):
β i σ
14
i 24 F = -cut(W,p) - S(p)/beta
25 if problem == ’bmincut ’:
and for the Max k-SAT problem, as 26 F = cut(W,p)
27 + panelty * balance (p)-S(p
M Y
X )/beta
MaxSAT
FMF ({Pi (σi )}, β) = [1 − Pi (Wmi )] 28 optimizer . zero_grad ()
m=1 i∈∂m 29 F. backward ( gradient =torch.
1 XX ones_like (F))
+ Pi (σi ) ln Pi (σi ). (11) optimizer .step ()
β i σ 30
i 31 return argmax_cut (W, p)
Different annealing schedules for the inverse temperature Relationship to the existing mean-field annealing approaches
Regarding the annealing process, this study employs two It’s noteworthy that the exploration of mean-field theory
monotonic functions to structure the annealing schedule of coupled with an annealing scheme for COPs began in the
β. The first function is named as the exponential scheduling, late 20th century, as indicated in [66]. This approach has
which is utilized for the exponential decrease of temperature also been instrumental in deciphering the efficacy of recently
T . This is defined as follows: introduced algorithms inspired by quantum dynamics, as dis-
cussed in [9]. Traditional mean-field annealing algorithms,
ln βmax − ln βmin
!
those addressing the Ising problem, revolve around the itera-
β(t s ) = exp t s + ln βmin ,
Nstep − 1 tive application of mean-field equations (for reproducing the
mean-field equations for the Ising problem from the FEM
where t s ∈ {0, 1, 2, . . . , Nstep − 1} represents the annealing formalism, please refer to Supplementary Materials):
step within a total of Nstep steps, ensuring that β(0) = βmin X
and β(Nstep − 1) = βmax . The second function is named as the mi = tanh(β Wi j m j ),
inverse-proportional scheduling, as j
Academy of Sciences. P. Z. is partially supported by the [19] N. Mohseni, P. L. McMahon, and T. Byrnes, Ising machines
Innovation Program for Quantum Science and Technology as hardware solvers of combinatorial optimization problems,
project 2021ZD0301900. Nature Reviews Physics 4, 363 (2022).
[20] G. Kochenberger, J.-K. Hao, F. Glover, M. Lewis, Z. Lü,
H. Wang, and Y. Wang, The unconstrained binary quadratic
programming problem: a survey, Journal of Combinatorial
Optimization 28, 58 (2014).
∗ [21] A. Lucas, Ising formulations of many NP problems, Frontiers
These three authors contributed equally in Physics 2, 5 (2014).
†
panzhang@itp.ac.cn
[22] E. Gardner, Spin glasses with p-spin interactions, Nuclear
[1] D. Du and P. M. Pardalos, Handbook of combinatorial opti-
Physics B 257, 747 (1985).
mization, Vol. 4 (Springer Science & Business Media, 1998). [23] R. M. Karp, Reducibility among combinatorial problems
[2] S. Arora and B. Barak, Computational complexity: a modern (Springer, 2010).
approach (Cambridge University Press, 2009).
[24] F.-Y. Wu, The Potts model, Rev. Mod. Phys. 54, 235 (1982).
[3] S. Kirkpatrick, C. D. Gelatt Jr, and M. P. Vecchi, Optimization
[25] T. R. Jensen and B. Toft, Graph coloring problems (John Wi-
by simulated annealing, Science 220, 671 (1983).
ley & Sons, 2011).
[4] B. Selman, H. A. Kautz, B. Cohen, et al., Noise strategies for [26] M. E. Newman, Modularity and community structure in net-
improving local search, AAAI 94, 337 (1994). works, Proceedings of the National Academy of Sciences 103,
[5] F. Glover and M. Laguna, Tabu search (Springer, 1998).
8577 (2006).
[6] S. Boettcher and A. G. Percus, Optimization with extremal
[27] C. Papadimitriou and K. Steiglitz, Combinatorial Optimiza-
dynamics, Phys. Rev. Lett. 86, 5211 (2001). tion: Algorithms and Complexity, Dover Books on Computer
[7] F. Barahona, On the computational complexity of ising spin Science (Dover Publications, 1998).
glass models, Journal of Physics A: Mathematical and General
[28] M. Mézard, G. Parisi, and R. Zecchina, Analytic and algorith-
15, 3241 (1982).
mic solution of random satisfiability problems, Science 297,
[8] E. S. Tiunov, A. E. Ulanov, and A. Lvovsky, Annealing by
812 (2002).
simulating the coherent Ising machine, Optics Express 27, [29] D. P. Kingma and J. Ba, Adam: A method for stochastic opti-
10288 (2019). mization, arXiv preprint arXiv:1412.6980 (2014).
[9] A. D. King, W. Bernoudy, J. King, A. J. Berkley, and T. Lant-
[30] D. A. Chermoshentsev, A. O. Malyshev, M. Esencan, E. S.
ing, Emulating the coherent Ising machine with a mean-field
Tiunov, D. Mendoza, A. Aspuru-Guzik, A. K. Fedorov,
algorithm, arXiv preprint arXiv:1806.08422 (2018). and A. I. Lvovsky, Polynomial unconstrained binary op-
[10] H. Goto, K. Tatsumura, and A. R. Dixon, Combinatorial op- timisation inspired by optical simulation, arXiv preprint
timization by simulating adiabatic bifurcations in nonlinear
arXiv:2106.13167 (2021).
Hamiltonian systems, Science Advances 5, eaav2372 (2019).
[31] C. Bybee, D. Kleyko, D. E. Nikonov, A. Khosrowshahi,
[11] H. Goto, K. Endo, M. Suzuki, Y. Sakai, T. Kanao,
B. A. Olshausen, and F. T. Sommer, Efficient optimization
Y. Hamakawa, R. Hidaka, M. Yamasaki, and K. Tatsumura, with higher-order Ising machines, Nature Communications
High-performance combinatorial optimization based on clas- 14, 6033 (2023).
sical mechanics, Science Advances 7, eabe7953 (2021).
[32] T. Kanao and H. Goto, Simulated bifurcation for higher-order
[12] M. W. Johnson, M. H. Amin, S. Gildert, T. Lanting, F. Hamze,
cost functions, Applied Physics Express 16, 014501 (2022).
N. Dickson, R. Harris, A. J. Berkley, J. Johansson, P. Bunyk, [33] S. Reifenstein, T. Leleu, T. McKenna, M. Jankowski, M.-G.
et al., Quantum annealing with manufactured spins, Nature Suh, E. Ng, F. Khoyratee, Z. Toroczkai, and Y. Yamamoto,
473, 194 (2011).
Coherent SAT solvers: a tutorial, Advances in Optics and Pho-
[13] T. Inagaki, K. Inaba, R. Hamerly, K. Inoue, Y. Yamamoto, and
tonics 15, 385 (2023).
H. Takesue, Large-scale Ising spin network based on degen-
[34] M. Mézard, G. Parisi, N. Sourlas, G. Toulouse, and M. Vira-
erate optical parametric oscillators, Nature Photonics 10, 415 soro, Replica symmetry breaking and the nature of the spin
(2016). glass phase, Journal de Physique 45, 843 (1984).
[14] T. Honjo, T. Sonobe, K. Inaba, T. Inagaki, T. Ikuta, Y. Ya-
[35] M. Mézard, G. Parisi, and M. A. Virasoro, Spin glass theory
mada, T. Kazama, K. Enbutsu, T. Umeki, R. Kasahara, et al.,
and beyond: An Introduction to the Replica Method and Its
100,000-spin coherent Ising machine, Science Advances 7, Applications, Vol. 9 (World Scientific Publishing Company,
eabh0952 (2021). 1987).
[15] D. Pierangeli, G. Marcucci, and C. Conti, Large-scale pho-
[36] D. Wu, L. Wang, and P. Zhang, Solving statistical mechan-
tonic Ising machine by spatial light modulation, Phys. Rev.
ics using variational autoregressive networks, Phys. Rev. Lett.
Lett. 122, 213902 (2019).
122, 080602 (2019).
[16] A. Mallick, M. K. Bashar, D. S. Truesdell, B. H. Calhoun, [37] M. Hibat-Allah, E. M. Inack, R. Wiersema, R. G. Melko, and
S. Joshi, and N. Shukla, Using synchronized oscillators to J. Carrasquilla, Variational neural annealing, Nature Machine
compute the maximum independent set, Nature Communica-
Intelligence 3, 952 (2021).
tions 11, 4689 (2020).
[38] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury,
[17] F. Cai, S. Kumar, T. Van Vaerenbergh, X. Sheng, R. Liu, C. Li, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al.,
Z. Liu, M. Foltin, S. Yu, Q. Xia, et al., Power-efficient combi- Pytorch: An imperative style, high-performance deep learning
natorial optimization using intrinsic noise in memristor Hop-
library, Advances in Neural Information Processing Systems
field neural networks, Nature Electronics 3, 409 (2020).
32 (2019).
[18] N. A. Aadit, A. Grimaldi, M. Carpentieri, L. Theogarajan,
[39] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,
J. M. Martinis, G. Finocchio, and K. Y. Camsari, Massively C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al.,
parallel probabilistic computing with sparse Ising machines, Tensorflow: Large-scale machine learning on heterogeneous
Nature Electronics 5, 460 (2022).
distributed systems, arXiv preprint arXiv:1603.04467 (2016).
14
[40] Y. Y. Boykov and M.-P. Jolly, Interactive graph cuts for op- [60] B. Molnár, F. Molnár, M. Varga, Z. Toroczkai, and M. Ercsey-
timal boundary & region segmentation of objects in N-D im- Ravasz, A continuous-time MaxSAT solver with high analog
ages, in Proceedings eighth IEEE international conference on performance, Nature Communications 9, 4864 (2018).
computer vision. ICCV 2001, Vol. 1 (IEEE, 2001) pp. 105– [61] Eleventh Max-SAT Evaluation, http://www.maxsat.udl.cat/16/
112. benchmarks/index.html.
[41] F. Barahona, M. Grötschel, M. Jünger, and G. Reinelt, An ap- [62] S. A. Cook, The complexity of theorem-proving procedures,
plication of combinatorial optimization to statistical physics in Proceedings of the Third Annual ACM Symposium on The-
and circuit layout design, Operations Research 36, 493 (1988). ory of Computing, STOC ’71 (Association for Computing Ma-
[42] G. Facchetti, G. Iacono, and C. Altafini, Computing global chinery, New York, NY, USA, 1971) p. 151–158.
structural balance in large-scale signed social networks, Pro- [63] S. Cook, The P versus NP problem, Clay Mathematics Insti-
ceedings of the National Academy of Sciences 108, 20953 tute 2, 6 (2000).
(2011). [64] T. Tieleman, G. Hinton, et al., Lecture 6.5-rmsprop: Divide
[43] F. Böhm, T. V. Vaerenbergh, G. Verschaffelt, and G. Van der the gradient by a running average of its recent magnitude,
Sande, Order-of-magnitude differences in computational per- COURSERA: Neural networks for machine learning 4, 26
formance of analog Ising machines induced by the choice of (2012).
nonlinearity, Communications Physics 4, 149 (2021). [65] Optim tools of pytorch, https://pytorch.org/docs/stable/optim.
[44] G. Rinaldy, rudy graph generator, http://www-user. html.
tu-chemnitz.de/∼helmberg/rudy.tar.gz. [66] G. Bilbro, R. Mann, T. Miller, W. Snyder, D. van den Bout,
[45] T. Inagaki, Y. Haribara, K. Igarashi, T. Sonobe, S. Tamate, and M. White, Optimization by mean field annealing, Ad-
T. Honjo, A. Marandi, P. L. McMahon, T. Umeki, K. Enbutsu, vances in Neural Information Processing Systems 1 (1988).
et al., A coherent Ising machine for 2000-node optimization [67] Metis software package: version 5.1.0, http://glaros.dtc.umn.
problems, Science 354, 603 (2016). edu/gkhome/metis/metis/download.
[46] Y. Ye, G-set test problems, https://web.stanford.edu/∼yyye/ [68] P. Sanders and C. Schulz, Kahip v3. 00–karlsruhe high qual-
yyye/Gset/. ity partitioning–user guide, arXiv preprint arXiv:1311.1714
[47] C. Walshaw, The graph partitioning archive, https: (2013).
//chriswalshaw.co.uk/partition/. [69] J. Fujisaki, H. Oshima, S. Sato, and K. Fujii, Practical and
[48] M. J. Schuetz, J. K. Brubaker, and H. G. Katzgraber, Combi- scalable decoder for topological quantum error correction with
natorial optimization with physics-inspired graph neural net- an Ising machine, Phys. Rev. Research 4, 043086 (2022).
works, Nature Machine Intelligence 4, 367 (2022). [70] J. Fujisaki, K. Maruyama, H. Oshima, S. Sato, T. Sakashita,
[49] H. Ushijima-Mwesigwa, C. F. Negre, and S. M. Mniszewski, Y. Takeuchi, and K. Fujii, Quantum error correction with an
Graph partitioning using quantum annealing on the D-Wave ising machine under circuit-level noise, Phys. Rev. Research
system, in Proceedings of the Second International Workshop 5, 043261 (2023).
on Post Moores Era Supercomputing (2017) pp. 22–29.
[50] P. W. Holland, K. B. Laskey, and S. Leinhardt, Stochastic
blockmodels: First steps, Social Networks 5, 109 (1983).
[51] B. Karrer and M. E. Newman, Stochastic blockmodels and
community structure in networks, Phys. Rev. E 83, 016107
(2011).
[52] G. Karypis and V. Kumar, A fast and high quality multilevel
scheme for partitioning irregular graphs, SIAM Journal on
Scientific Computing 20, 359 (1998).
[53] P. Sanders and C. Schulz, Think locally, act globally: Highly
balanced graph partitioning, in International Symposium on
Experimental Algorithms (Springer, 2013) pp. 164–175.
[54] S. Acer, E. G. Boman, C. A. Glusa, and S. Rajamanickam,
Sphynx: A parallel multi-gpu graph partitioner for distributed-
memory systems, Parallel Computing 106, 102769 (2021).
[55] J. Chuzhoy, Y. Gao, J. Li, D. Nanongkai, R. Peng, and T. Sara-
nurak, A deterministic algorithm for balanced cut with appli-
cations to dynamic connectivity, flows, and beyond, in 2020
IEEE 61st Annual Symposium on Foundations of Computer
Science (FOCS) (IEEE, 2020) pp. 1158–1167.
[56] W. K. Lam, Hardware design verification: simulation and for-
mal method-based approaches (Prentice Hall Modern semi-
conductor design series) (Prentice Hall PTR, 2005).
[57] S. Patil and D. Kulkarni, K-way spectral graph partitioning for
load balancing in parallel computing, International Journal of
Information Technology 13, 1893 (2021).
[58] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefeb-
vre, Fast unfolding of communities in large networks, Journal
of statistical mechanics: theory and experiment 2008, P10008
(2008).
[59] The graph partitioning framework KaHIP, https://github.com/
KaHIP/KaHIP.
15
Derivation of variational mean-field free energy formulations for the benchmarking problems
and it suffices to derive the formulae for different mean-field internal energies UMF .
For the maximum cut (MaxCut) problem, the energy function can be defined as the negation of the cut value,
X
E(σ) = − Wi j [1 − δ(σi , σ j )] , (S3)
(i, j) ∈ E
where E is the edge set of the graph, Wi j is the weight of edge (i, j), and δ(σi , σ j ) stands for the delta function, which takes
value 1 if σi = σ j and 0 otherwise. Then the corresponding mean-field internal energy reads
XY
X
MaxCut
= δ(σ , σ
UMF Pi (σi ) − W [1 − )] (S4)
ij i j
σ i (i, j) ∈ E
X X
=− Wi j Pi (σi )P j (σ j )[1 − δ(σi , σ j )] (S5)
(i, j) ∈ E σi ,σ j
X X
=− Wi j Pi (σi )[1 − P j (σi )] , (S6)
(i, j) ∈ E σi
The energy function designed for the bMinCut problem given in the main text is
X XX
E(σ) = Wi j [1 − δ(σi , σ j )] + λ δ(σi , σ j ) , (S8)
(i, j) ∈ E i j,i
Thus, we have
X X
bMinCut
FMF ({Pi (σi )}, β) = Wi j Pi (σi )[1 − P j (σi )]
(i, j) ∈ E σi
XX
+λ [Pi (σi )P j (σi ) − P2i (σi )]
i, j σi
1 XX
+ Pi (σi ) ln Pi (σi ). (S12)
β i σ
i
16
The cost function designed for the Max-SAT problem given in the main text reads
M Y
X
E(σ) = [1 − δ(Wmi , σi )] . (S13)
m=1 i∈∂m
Thus, we have
M Y
X
MaxSAT
FMF ({Pi (σi )}, β) = [1 − Pi (Wmi )]
m=1 i∈∂m
1 XX
+ Pi (σi ) ln Pi (σi ). (S17)
β i σ
i
The following shows the Pytorch code with detailed annotations of implementing of FEM, on the MaxCut and bMinCut
problems, using the automatic differentiation. Remarkably, even with the substantial disparities in the number of variable
states, the existence of constraints, and the characteristics of the objective function between these two problems, the coding
implementations for each only vary by a mere line.
1 import torch
2
32 def S(p):
33 """ Calculating the entropy of the marginal matrix p"""
34 return -(p*p.log ()).sum (2).sum (1)
35
64 n, m, W = read_graph (’G1’, index_start =1) # the data file can be retrieved from https ://
web. stanford .edu /˜ yyye/yyye/Gset/
65 beta_range = 1 / torch. linspace (10.0 , 0.01 , 1100)
66 configs , cuts = solve(’maxcut ’, W, 1000 , 2, 5.0, beta_range )
67 ind = torch. argmax (cuts)
68 print ( configs [ind ][:, 0], cuts[ind ])
69
The key step in FEM involves computing the gradients of FMF with respect to the local fields {hi (σi )}, denoted as {ghi (σi )}.
This task can be accomplished by leveraging the automatic differentiation techniques. Additionally, we have the option to
write down the explicit gradient formula for each problem at hand. Once the specific form of E(σ) is known, the form of
FMF is also determined. Thanks to the mean-field ansatz, the explicit formula for the gradients of FMF with respect to {Pi (σi )}
can be obtained, denoted as {gip (σi )}. The benefits of obtaining {gip (σi )} are twofold. Firstly, explicit gradient computation
can lead to substantial time savings by eliminating the need for forward propagation calculations. Our numerical experiments
have revealed that the application of explicit gradient formulations can reduce computational time by half. Secondly, it
enables problem-dependent gradient manipulations based on {gip (σi )}, denoted as {ĝip (σi )}, enhancing numerical stability
and facilitating smoother optimization within the gradient descent framework, extending beyond the conventional use of
adaptive learning rates and momentum techniques commonly found in gradient-based optimization methods within the realm
of machine learning.
18
Hence, in this work, we mainly adopt the explicit gradient approach for benchmarking FEM, and we use the manipulated
gradients {ĝip (σi )} to compute {ghi (σi )}. According to the chain rule for partial derivative calculation, {ghi (σi )} can be computed
from
X ∂FMF ∂Pi (σ′ )
ghi (σi ) = i
, (S18)
σ′
∂Pi (σ′
i ) ∂hi (σi)
i
′
and since Pi (σi ) = ehi (σi ) /Z, where Z = ehi (σi ) is the normalization factor, we also have
P
σ′i
′
−ehi (σi ) ehi (σi )
∂Pi (σ′i )
− Z
Z = −Pi (σ′i )Pi (σi ), σ′i , σi
=
(S19)
∂hi (σi )
h (σ )
e i i (1 − ehi (σi ) ) = Pi (σi )[1 − Pi (σi )], σ′ = σi .
Z Z i
∂FMF
Therefore, by substituting Eq. (S19) into Eq. (S18) and employing the modified gradient variable ĝip (σ′i ) in place of ∂Pi (σ′i )
(instead of using gip (σi ) directly), we have the following unified form for {ghi (σi )},
Xq
p p
gi (σi ) = γgrad ĝi (σi ) −
h Pi (σi )ĝi (σi ) Pi (σi ) ,
′ ′
(S20)
′ σi =1
where γgrad is a hyperparameter that controls the magnitudes of {ghi (σi )} to accommodate different optimizers. Once we have
the values of {Pi (σi )} and {ĝip (σi )} (also computed using {Pi (σi )}), we can obtain {ghi (σi )} immediately according to Eq. (S20).
The explicit gradients and manipulated gradients for the benchmarking problems
The explicit gradients of {gip (σi )} for the MaxCut problem can be derived analytically from FMF
MaxCut
, as
X 1
gip (σi ) = ∇Pi (σi ) FMF
MaxCut
= Wi j [2P j (σi ) − 1] + [ln Pi (σi ) + 1] . (S21)
j
β
However, using Eq. (S21) or Eq. (S24) will result in the same result for computing ghi (σi ), for the reason that the constant
P
− j Wi j in Eq. (S21) for each index i will be canceled when computing the gradients for the local fields using Eq. (S20).
Then the manipulated gradients can be designed as follows
X 1
ĝip (σi ) = ci Wi j e j (σi ) + [ln Pi (σi ) + 1] , (S25)
j
β
where two modifications on gradient have been made to improve the optimization performance. Firstly, [e j (1), e j (2), . . . , e j (q)]
is the one-hot vector (length of 2 in MaxCut problem) corresponding to [P j (1), P j (2), . . . , P j (q)]. The introduction of {e j (σ j )}
P
to replace {P j (σ j )} is called the discretization that enables reducing the analog errors introduced by j Wi j P j (σ j ) in the
explicit gradients. Note that the similar numerical tricks have been also employed in the previous work [11]. Secondly, when
P
the graph has inhomogeneity in the node degrees or the edge weights, the magnitude of j Wi j e j (σ j ) for each spin variable can
19
exhibit significant differences. Hence, the gradient normalization factor ci enables robust optimization and better numerical
performance.
For the MaxCut problem, the gradient normalization factor ci is set to ci = P j 1|Wi j | in this work. In this context, the L1
P P
norm, represented by j |Wi j |, normalizes the gradient magnitude for each spin, ensuring that the values of j Wi j e j (σ j )
across all spins remain below one. This normalization is critical. Since the range of the gradients of the entropy term,
given by ln Pi (σi ) + 1, is consistent across all spins. In our experiments, we found the constrains made on the gradient
magnitude of the internal energy can prevent the system from becoming ensnared in local minima. Although other tricks for
the normalization factors can be employed, we have observed that our current settings yield satisfactory performance in our
numerical experiments.
The modifications can be made for better optimization on graphs with different topologies, as done in the MaxCut problem.
We have the following manipulated gradients
X X 1
ĝip (σi ) = −cif Wi j e j (σi ) + λcai [ e j (σi ) − ei (σi )] + [ln Pi (σi ) + 1] , (S27)
j j
β
where cif and cai are the gradient normalization factors for the ferromagnetic and antiferromagnetic terms, respectively. The
{ei (σi )} are again the one-hot vectors for {Pi (σi )}, which serves to mitigate the analog error introduced in the explicit gradients,
as we done in the MaxCut problem.
For the bMinCut problem, analogous to the approach taken with the MaxCut problem, the ferromagnetic normalization
factor cif is defined as cif = P jqWi j (in the bMinCut problem, Wi j > 0), while the antiferromagnetic normalization factor cai is
determined by cai = √Pq . The rationale behind the setting for cai stems from the intuition that spins with larger values of
j Wi j
P
j Wi j should remain in their current states, implying that they should not be significantly influenced by the antiferromagnetic
force to transition into other states. Although theses settings for the normalization factors may not be optimal, and alternative
schemes could be implemented, we have observed that these particular settings yield satisfactory performance in our numerical
experiments.
Note that, for each m in the summation, the spin variable σi must appear as a literal in the clasue Cm , otherwise the gradients
of the internal energy with respect to σi are zeros in this clause. For the random Max k-SAT problem benchmarked in this
work, we make no modifications. Therefore, we set ĝip (σi ) equal to gip (σi ).
To facilitate an efficient implementation, we can simplify FEM’s approach for solving the Ising problem since the gradients
for {hi (+1)} and {hi (−1)} are dependent. In the MaxCut problem with q = 2 (σi = +1 or σi = −1), it is straightforward
20
to prove ghi (+1) = −ghi (−1) = γgrad [ĝip (+1) − ĝip (−1)]Pi (+1)Pi (−1) from Eq. (S20). Given that Pi (+1) = 1 − Pi (−1) =
ehi (+1) /(ehi (+1) + ehi (−1) ) = sigmoid(hi (+1) − hi (−1)), we actually only need to update {Pi (+1)} (or simply {Pi }) to save consid-
erable computational resources. Thus, we introduce the new local field variables {hi } to replace {hi (+1) − hi (−1)}, such that
Pi = sigmoid(hi ).
According to Eq. (S25), ĝip (+1) = ci j Wi j e j (+1) + β1 [ln Pi (+1) + 1] and ĝip (−1) = ci j Wi j e j (−1) + β1 [ln Pi (−1) + 1]. Based
P P
on ghi (+1) = γgrad [ĝip (+1) − ĝip (−1)]Pi (+1)Pi (−1), the gradients regarding to local fields {hi } can be written as
We can further simplify Eq. (S29) by introducing the magnetization mi = 2Pi − 1 = tanh(hi /2), such that
γgrad X 1
ghi = [ci Wi j sgn(m j ) + hi ](1 − m2i ) , (S30)
4 j
β
where sgn(·) is the sign function, and the identity arctanh(x) = 21 ln 1+x 1−x has been used for the simplifications. Thus, we
have used {mi } and {hi } to simplify the original gradients, requiring optimizations for only half of the variational variables as
compared to the non-simplified case.
The numerical experiment of the MaxCut problem on the complete graph K2000
For the numerical experiment of the MaxCut problem on the complete graph K2000 , as shown in Fig. 3(a) in the main
text (where we evaluate the performance by only varying the total annealing steps Nstep while keeping other hyperparame-
ters unchanged), we implemented the dSBM algorithm, which is about to simulate the following Hamiltonian equations of
motion [11]
XN
yi (tk+1 ) = yi (tk ) + + ∆t ,
−[a − a(t )]x (t ) c W sgn[x (t )] (S31)
0 k i k 0 ij j k
j=1
xi (tk+1 ) = xi (tk ) + a0 yi (tk+1 )∆t , (S32)
where xi and yi represent the position and momentum of a particle corresponding to the i-th spin in a N-particle dynamical
system, respectively, ∆t is the time step, tk is discrete time with tk+1 =q tk + ∆t , J is the edge weight matrix, a(tk ) is the
bifurcation parameter linearly increased from 0 to a0 = 1, and c0 = 0.5 PN−1 W2
is according to the settings in Ref. [11]. In
i, j ij
addition, at every tk , if |xi | > 1, we set xi = sgn(xi ) and yi = 0. For the dSBM benchmarks on K2000 , we set ∆t = 1.25
following the recommended settings in Ref. [11], and the initial values of xi and yi are randomly initialized from the range
[−0.1, 0.1]. Regarding FEM, we employ the explicit gradient formulations, setting the values of γgrad = 1, T max = 1.16, T min =
6e-5, and utilize the inverse-proportional scheduling for annealing. We employ RMSprop as the optimizer, with the optimizer
hyperparameters alpha, momentum, weight decay, and learning rate set to 0.56, 0.63, 0.013, and 0.03, respectively. Both
dSBM and FEM were executed on a GPU.
For the benchmarks on the G-set problems, we have presented the detailed TTS results obtained by FEM in Tab. S1,
along with a comparison of the reported data provided by dSBM. Given the capability of FEM for optimizing many replicas
parallelly, here we assess the TTS using the batch processing method introduced in Ref. [11]. All the parameter settings for
FEM are listed in Tab. S2. We also utilized the same GPU used in Ref. [11] for implementing FEM in this benchmark. In
this study, we adopt the batch processing method as introduced in Ref. [11] for calculating TTS. Therefore, for an accurate
comparison, the values of Nrep for each instance shown in Tab. S2 are consistent with those used for dSBM in Ref. [11].
Throughout the benchmarking, we initialize the local fields with random values according to hi (σi )ini = 0.001 ∗ randn, where
randn represents a random number sampled from the standard Gaussian distribution. All variables are represented using 32-bit
single-precision floating-point numbers.
21
TABLE S1: Benchmarking results on the G-Set instances. The TTS is defined as the time taken to achieve the best cut found by
the solver, and the parentheses for the TTS results indicate that the best cut for computing TTS is not the best known cut. Shorter
TTS results are highlighted in bold.
Graph Best FEM dSBM
Instance N
type known Best cut TTS(ms) Ps Best cut TTS(ms) Ps
G1 800 11624 11624 24.9 100.0% 11624 33.3 98.7%
G2 800 11620 11620 96.5 99.6% 11620 239 82%
G3 800 11622 11622 23.3 100.0% 11622 46.2 99.6%
G4 800 11646 11646 19.8 99.5% 11646 34.4 98.3%
G5 800 11631 11631 21.6 98.9% 11631 58.6 97.2%
Random G6 800 2178 2178 5.6 95.5% 2178 6.3 97.9%
G7 800 2006 2006 11.5 98.6% 2006 6.85 97.4%
G8 800 2005 2005 21.3 98.5% 2005 11.9 95.4%
G9 800 2054 2054 35.6 98.8% 2054 36 86.7%
G10 800 2000 2000 193 53.2% 2000 47.7 40.7%
G11 800 564 564 24.2 99.0% 564 3.49 98%
Toroidal G12 800 556 556 31.2 97.8% 556 5.16 97.3%
G13 800 582 582 203 63.9% 582 11.9 99.6%
G14 800 3064 3064 2689 36.5% 3064 71633 0.5%
G15 800 3050 3050 164 96.5% 3050 340 80.4%
G16 800 3052 3052 165 99.8% 3052 347 99.2%
G17 800 3047 3047 800 70.2% 3047 1631 28.3%
Planar
G18 800 992 992 264 37.9% 992 375 7.4%
G19 800 906 906 17.5 98.8% 906 17.8 99.5%
G20 800 941 941 8.5 99.2% 941 9.02 98%
G21 800 931 931 67.3 34% 931 260 13.6%
G22 2000 13359 13359 917 56.3% 13359 429 92.8%
G23 2000 13344 13342 (98) 36.9% 13342 (89) -
G24 2000 13337 13337 1262 92.4% 13337 459 64.8%
G25 2000 13340 13340 5123 31.9% 13340 2279 39.9%
G26 2000 13328 13328 991 83.2% 13328 476 64.3%
Random
G27 2000 3341 3341 127 90.1% 3341 49.9 97.1%
G28 2000 3298 3298 306 82.7% 3298 87.2 95.2%
G29 2000 3405 3405 200 98.6% 3405 221 73.7%
G30 2000 3413 3413 948 64.7% 3413 439 73.8%
G31 2000 3310 3310 4523 19.6% 3310 1201 19.9%
G32 2000 1410 1410 23749 1.3% 1410 3622 9.3%
Toroidal G33 2000 1382 1382 659607 0.6% 1382 57766 0.5%
G34 2000 1384 1384 12643 28.1% 1384 2057 23.1%
G35 2000 7687 7686 (5139390) 0.01% 7686 (8319000) -
G36 2000 7680 7680 5157009 0.01% 7680 62646570 0.01%
G37 2000 7691 7690 (3509541) 0.01% 7691 27343457 0.02%
G38 2000 7688 7688 41116 7.3% 7688 98519 6.8%
Planar
G39 2000 2408 2408 12461 17.5% 2408 56013 10.7%
G40 2000 2400 2400 3313 54.1% 2400 24131 15.4%
G41 2000 2405 2405 1921 80.9% 2405 10585 28.2%
G42 2000 2481 2481 91405 0.23% 2480 (550000) -
G43 1000 6660 6660 19.8 66.1% 6660 5.86 99.2%
G44 1000 6650 6650 13.2 80.1% 6650 6.5 98.5%
Random G45 1000 6654 6654 35 98.7% 6654 43.4 98.5%
G46 1000 6649 6649 141 69.8% 6649 16 99.2%
G47 1000 6657 6657 33.9 98.7% 6657 44.8 98.2%
G48 3000 6000 6000 0.35 95.3% 6000 0.824 100.0%
to be continued...
22
TABLE S2: The hyperparameter settings for FEM in the TTS benchmarking on G-set instances. Here, we utilize different optimizers,
SGD and RMDprop, for different sets of instances in this benchmarking based on their performance. For the explanations of the hyperpa-
rameters of the different optimizers. The inverse-proportional scheduling is used for the annealing. The meanings of Nstep , Nbatch and Nrep
are the same with the TTS experiments documented in Ref. [11]. For Nbatch in FEM, we refer to the number of replicas.
dampe- weight momen-
Ins. Tmax Tmin γgrad Optimizer lr alpha Nstep Nbatch Nrep
ning decay tum
G1 0.5 8e-5 1 RMSprop 0.2 0.623 - 0.02 0.693 1000 130 1000
G2 0.2592 6.34e-4 1 RMSprop 0.0717 0.5485 - 0.0264 0.9082 5000 100 1000
G3 0.264 1.1e-3 1 RMSprop 0.3174 0.7765 - 0.00672 0.7804 1000 120 1000
G4 0.29 8.9e-4 1 RMSprop 0.2691 0.4718 - 0.00616 0.7414 800 130 1000
G5 0.2 9e-4 1 RMSprop 0.24 0.9999 - 0.0056 0.8215 1000 110 1000
G6 0.44 1.7e-3 1 RMSprop 0.534 0.6045 - 0.00657 0.4733 1000 20 1000
G7 0.54 1.8e-3 1 RMSprop 0.452 0.8966 - 0.0087 0.632 700 80 1000
G8 0.19 7.92e-4 1 RMSprop 0.296 0.9999 - 0.00731 0.737 1000 100 1000
G9 0.208 9e-4 1 RMSprop 0.305 0.9999 - 0.00205 0.718 2500 70 1000
G10 1.28 5.21e-6 0.75 SGD 1.2 - 0.082 0.03 0.88 2000 100 1000
G11 1.28 4.96e-6 0.98 SGD 1.2 - 0.13 0.061 0.88 1800 120 1000
G12 1.28 7.8e-6 0.65 SGD 1.98 - 0.13 0.06 0.88 1600 140 1000
G13 1.28 3.12e-6 1.7 SGD 3 - 0.082 0.033 0.76 3000 130 1000
G14 0.387 8.64e-4 1 RMSprop 0.44 0.9999 - 0.0089 0.793 7000 250 1000
G15 0.5 1e-3 1 RMSprop 0.45 0.9999 - 0.0056 0.7327 4000 200 1000
G16 0.54 8.1e-4 1 RMSprop 0.288 0.9999 - 0.00756 0.7877 7000 160 1000
G17 0.253 1.06e-3 1 RMSprop 0.631 0.9999 - 0.01341 0.7642 7000 200 1000
G18 0.4 1e-3 1 RMSprop 0.345 0.99 - 0.01 0.9 1200 150 1000
G19 0.962 3.98e-6 1.75 SGD 4.368 - 0.05175 0.01336 0.729 1700 85 1000
G20 0.37 9.4e-4 1.55 RMSprop 1.38 0.9089 - 0.00445 0.8186 500 100 1000
G21 0.6 9.6e-4 1 RMSprop 0.33 0.9999 - 0.0092 0.692 1000 40 1000
G22 0.352 2.4e-4 1 RMSprop 0.481 0.9999 - 0.00382 0.7166 4700 90 1000
G23 0.406 1.15e-6 2.72 SGD 8.042 - 0.1443 0.00184 0.714 3200 10 1000
G24 0.528 1.6e-4 1 RMSprop 0.39 0.9999 - 0.00413 0.74 7000 250 1000
G25 0.4 4.83e-6 5.33 SGD 3.66 - 0.0905 0.00987 0.672 7000 200 1000
G26 0.361 4.43e-6 2.18 SGD 8.46 - 0.0612 0.0078 0.714 6000 200 1000
G27 0.28 5e-4 1 RMSprop 0.7 0.9995 - 0.00575 0.78 2000 80 1000
G28 0.32 5e-4 1 RMSprop 0.69 0.999 - 0.006 0.78 3000 100 1000
G29 0.38 2.7e-4 1 RMSprop 0.44 0.9999 - 0.013 0.7 4000 120 1000
G30 0.96 4.92e-6 1.9 SGD 2.59 - 0.05 0.053 0.715 7000 100 1000
G31 1.834 2.76e-6 1.32 SGD 1.38 - 0.0104 0.083 0.7566 7000 100 1000
G32 0.89 1.42e-5 3.17 SGD 1.67 - 0.1285 0.018 0.9 12000 20 1000
G33 0.605 7.8e-6 2 SGD 4.05 - 0.098 0.0366 0.91 12000 260 1000
G34 0.605 6.24e-6 2.33 SGD 2.638 - 0.1182 0.0384 0.8967 12000 260 1000
to be continued...
23
We explored how varying the number of replicas impacts the cut value distribution among replicas in the MaxCut problem
on G55 in the G-set dataset. It is a random graph with 5000 nodes and {+1, −1} edge weights. After optimizing FEM’s
hyperparameters, we incrementally increased the number of replicas R to examine changes in the cut value distribution. The
histograms of the cut values with different R values are shown in Fig. S1(a). In Fig. S1(b), we found that the average cut value
remains stable with different R values crossing several magnitudes. We also see that the standard deviation is also quite stable
as shown in Fig. S1(c). As a consequence, the maximum cut value achieved by FEM is an increasing function of R. It is
clearly shown in Fig. S1(a) that the maximum cut value of FEM approaches the best-known results for G55 when R increases.
These findings also suggest that the hyperparameters of FEM can be fine-tuned using the mean cut value at a small R, while
the final results can be obtained using a large R with fine-tuned parameters.
a b c
Standard deviation
Cut value
Counts
Cut value R R
FIG. S1. The cut values as a function of the number of replicas R used in FEM for the MaxCut Problem on the G55 dataset. (a). The
histograms of the cut values with the different R. (b). The maximum, minimum, and mean cut values as a function of R. (c). The standard
deviation of the cut-value distribution in replicas with R changes.
24
To elucidate the principles of FEM in tackling q-way bMinCut problem in the main text, firstly, we provide an example to
demonstrate the numerical details in our experiments. We employ the real-world graph named 3elt as a demonstrative case
study. This graph, which is collected in Chris Walshaw’s graph partitioning archive [47], comprises 4,720 nodes and 13,722
edges, and we specifically address its 4-way bMinCut problem as an illustrative example. Fig. S2(a) shows the evolution of
the marginal probabilities for 4 states for a typical variable σi . From the figure, we can identify 3 optimization stages. At early
annealing steps, the 4 marginal probabilities {Pi (σi = 1), Pi (σi = 2), Pi (σi = 3), Pi (σi = 4)} are all very close to the initial
value 0.25; with annealing step increases, the marginal probabilities for 4 states begin to fluctuate; then finally converge. Only
one probability converges to unity and the other three probabilities converge to 0. This indicates that during annealing and the
minimization of mean-field free energy, the marginal probability will evolve towards localizing on a single state. In Fig. S2(b),
we plot the evolutions of the cut values and the largest group sizes (the largest value of the sizes among all groups) averaged
over R = 1000 replicas of mean-field approximations. In our approach, the hyperparameter λ linearly increases from 0 to λmax
with the annealing step, for preferentially searching the states that minimize the cut value at the initial optimization stage.
From the figure, we can see that our algorithm first searches for a low-cut but un-balanced solution with a large group and
three small groups, then gradually decreases the maximum group, and finally finds a balanced solution with a global minimum
cut.
a b ×10
10
3
7
×10
3
1
Cut value
Largest group size 6
0 8
1 5
0.1 4
0
1 0 3
4 0 5000 10000
Annealing step 2
0
1 2
1
0 0 0
102 103 104 102 103 104
Annealing step Annealing step
FIG. S2. Illustrative example of the 4-way perfectly bMinCut problem on the real-world graph 3elt. (a). The evolutions of the state
probabilities of four typical spins in four allowed state classes. (b). The evolutions of the cut values and the largest class sizes in 1000
replicas. The circles denote the minimum cut values, and triangles indicate the maximum values of the largest class size, with solid lines
illustrating the average values. The dotted line indicates the perfectly balanced class size 1180. The dashed line indicate the best known
minimal cut 201. Inset: The hyperparameter λ is linearly increased with the annealing step from 0 to λmax = 0.2.
In our benchmarking experiments, we utilized the latest release of the METIS software, with version 5.1.0 [67], specifically
employing the stand-alone program named gpmetis for the bMinCut tasks. Throughout the experiment, we consistently
configured gpmetis with specific options as outlined in the documentation [67]: -ptype=rb, which denotes multilevel recursive
bisectioning, and -ufactor=1 (1 is the lowest permissible integer value allowed by METIS, indicating that while a perfectly
balanced partitioning cannot be guaranteed, we aim to approach it as closely as possible). All other parameters or options
were left at their default settings.
For KaHIP, we used the advanced variant of KaHIP named KaFFPaE [59] for the bMinCut tasks, with the following settings
(see also the documentation [68] for more explanations): -n=24 (number of processes to use), –time limit=300 (limiting the
execution time to 300 seconds), –imbalance=0, –preconfiguration=strong, and we also enabled –mh enable tabu search and
–mh enable kabapE to optimize performance. Both the programs of gpmetis and KaFFPaE, implemented in C, were executed
on a computing node equipped with dual 24-core processors with 2.90GHz.
25
For FEM, we have detailed hyperparameter settings in Tab. S3. Throughout the experiments, we initialize the local fields
with random values according to hi (σi )ini = 0.001 ∗ randn, where randn represents a random number sampled from the
standard Gaussian distribution. The execution of FEM was performed on a GPU. All variables are represented using 32-bit
single-precision floating-point numbers.
TABLE S3: The hyperparameter settings for FEM in the bMinCut benchmarking on real-world graphs. The optimization algorithm
employed in this experiment is Adam. Note, the imbalance penalty coefficient, λ, is progressively incremented from 0 to its maximum
value, λmax , over Nstep steps. The term RC time refers to the abbreviation of replica computation time when running FEM. The exponential
scheduling is used for the annealing.
weight RC
Graph q β min β max λ max Nstep Nreplica γgrad lr β1 β2
decay time(sec.)
2 2.12e-2 756 1 10000 2000 17.8 0.2914 3.4e-4 0.9408 0.7829 24.2
4 5.054e-2 840.84 0.2332 10000 2000 13.33 0.3664 3.411e-3 0.9158 0.7691 46.2
add20 8 3.048e-2 2178.64 0.6229 10000 2000 10.05 0.4564 4.198e-4 0.9018 0.7225 90.2
16 3.634e-2 1607.45 1.0553 10000 2000 81.04 0.6564 8.264e-3 0.9032 0.6009 181.2
32 3.77e-2 2827.44 1.9607 10000 2000 7.384 1.4246 0.01719 0.911 0.8199 360.1
2 5.054e-2 840.84 0.2292 10000 2000 13.328 0.2629 1.706e-3 0.9347 0.7692 29.1
4 5.054e-2 840.84 0.2292 10000 2000 13.328 0.2629 1.706e-3 0.9347 0.7692 57.1
data 8 5.054e-2 840.84 0.3438 10000 2000 13.328 0.3023 1.365e-3 0.9369 0.8076 109.2
16 7.316e-2 1766.14 1.3376 10000 2000 15.736 0.8358 1.874e-4 0.7801 0.4039 217.5
32 1.968e-2 613.88 0.697 12000 2000 57.56 0.6633 1.65e-3 0.5263 0.4681 519
2 3.648e-2 2458.64 2.07 10000 2000 7.146 1.2032 2.293e-2 0.9374 0.9215 45.8
4 3.648e-2 2458.64 2.07 10000 2000 7.146 1.2032 2.293e-2 0.9374 0.9215 90.2
3elt 8 2.918e-2 1966.92 0.8 10000 2000 7.384 0.2117 1.031e-3 0.724 0.5397 177.1
16 3.648e-2 2458.64 1.267 10000 2000 7.146 0.2238 2.577e-3 0.77 0.7698 355.3
32 4.626e-2 3581.42 1.369 12000 2000 6.768 0.9946 2.508e-2 0.777 0.8982 850
2 3.648e-2 2458.64 2.07 10000 2000 7.146 1.2032 2.2926e-2 0.9374 0.9215 103
4 3.648e-2 2458.64 2.07 10000 2000 7.146 1.2032 2.2926e-2 0.9374 0.9215 204
bcsstk33 8 3.72e-2 754.22 0.5016 12000 2000 15.6 0.1948 1.663e-3 0.7196 0.9 480
16 3.336e-2 950 1.167 12000 2000 7.052 0.5254 3.089e-3 0.8894 0.6223 968
32 2.918e-2 2827.44 1.656 12000 2000 7.384 1.5541 1.146e-2 0.9582 0.8686 1986
In this chip verification task described in the main text, the large-scale realistic dataset consists of 1495802 operators and
3380910 logical operator connections. We map this dataset into an undirected weighted graph with 1495802 nodes and
3380910 edges. The information of this original graph is listed in Fig. S3(a). Note that, in the original graph, there are many
edges with weights larger than 1, which differs from the other graphs used in bMinCut benchmarking.
Besides, since the realistic datasets in this task often have locally connected structures among operators, meaning many
nodes form local clusters or community structures, we consider applying the coarsening trick to reduce the graph size for the
sake of performance and speed. A good option is to coarsen the graph based on its natural community structure. We group
the nodes within the same community into a new community node, hierarchically contracting the original graph. We refer
to the coarsened graph as the community graph. Note that in the community graph, the community nodes now have a node
weight larger than 1, equal to the number of nodes grouped in the community. The weight of the community edges connecting
different community nodes is thus the sum of the original edges that have endpoints in different communities.
To effectively identify the community structure in a large-scale graph, we apply the Louvain algorithm [58], which was
developed for community detection. The information of the community graphs output from the Louvain algorithm, corre-
sponding to level 2 and level 3, is also listed in Fig. S3(a). The modularity metric of the original graph is 0.961652. Fig. S3(b)
26
a Dataset information
Original graph Count Max. weight Min. weight
Nodes 1495802 1 1
Edges 3380910 29480 1
Community graph Node count Max. node weight Min. node weight
b 5
10
Level 2
10
4 Level 3
Node weight
3
10
2
10
1
10
1000 2000 3000 4000 5000
Community node index
FIG. S3. The detailed information of the large-scale realistic dataset used in this chip verification task. (a). The information of the
original graph and the two community graphs given by the Louvain algorithm. (b). The top 5000 community nodes with the largest node
weight in the two community graphs.
shows the top 5000 community nodes with the largest node weight in the two community graphs. Note that the coarsen-
ing technique used to generate coarsened graphs here is different from the methods based on maximal matching techniques
employed in METIS.
Since FEM now has to address the issue of ensuring the balanced constraint on nodes with different node weights, a slight
modification on the constraint term in the formula of mean-field free energy should be made, which is
XX XX
λ [Pi (σi )P j (σi ) − P2i (σi )] → λ [Vi V j Pi (σi )P j (σi ) − Vi2 P2i (σi )] , (S33)
i, j σi i, j σi
where Vi is the node weight of the i-th node. The corresponding modifications on gradients are
X X
gip (σi ) : 2λ[ P j (σi ) − Pi (σi )] → 2λVi [ V j P j (σi ) − Vi Pi (σi )] , (S34)
j j
X X
ĝip (σi ) : λcai [ e j (σi ) − ei (σi )] → λcai Vi [ V j e j (σi ) − Vi ei (σi )] , (S35)
j j
For the benchmarking experiments targeting the Max k-SAT problem, we collated the reported data encompassing the best-
known results and corresponding shortest computation times from the most efficient incomplete solvers. These solvers were
evaluated against 454 random instances from the random MaxSAT 2016 competition problems [61], with the summarized
data presented in Tab. S4. The incomplete solvers included are SsMonteCarlo, Ramp, borealis, SC2016, CCLS, CnC-LS,
Swcca-ms, HS-Greedy, and CCEHC. For the FEM approach, we have also compiled the summarized results in Tab. S4 and
the hyperparameter settings in Tab. S5. Still, throughout the experiments, we initialize the local fields with random values
according to hi (σi )ini = 0.001 ∗ randn, where randn represents a random number sampled from the standard Gaussian distri-
bution. The implementation of FEM was executed on a GPU to ensure efficient computation. All variables are represented
using 32-bit single-precision floating-point numbers.
TABLE S4: The results for FEM in the benchmarking on the random MaxSAT 2016 competition problems. RC time
refers to the abbreviation of replica computation time, and ∆E = Emin − Ebkr . Emin is the minimal energy found by FEM, and
Ebkr is the best known value in the literature.
Fastest Solver’s RC
Index Instance Ebkr E ∆
solver time(sec.) min E time(sec.)
Fastest Solver’s RC
Index Instance Ebkr E ∆
solver time(sec.) min E time(sec.)
Fastest Solver’s RC
Index Instance Ebkr E ∆
solver time(sec.) min E time(sec.)
Fastest Solver’s RC
Index Instance Ebkr E ∆
solver time(sec.) min E time(sec.)
Fastest Solver’s RC
Index Instance Ebkr E ∆
solver time(sec.) min E time(sec.)
Fastest Solver’s RC
Index Instance Ebkr E ∆
solver time(sec.) min E time(sec.)
Fastest Solver’s RC
Index Instance Ebkr E ∆
solver time(sec.) min E time(sec.)
Fastest Solver’s RC
Index Instance Ebkr E ∆
solver time(sec.) min E time(sec.)
Fastest Solver’s RC
Index Instance Ebkr E ∆
solver time(sec.) min E time(sec.)
Fastest Solver’s RC
Index Instance Ebkr E ∆
solver time(sec.) min E time(sec.)
Fastest Solver’s RC
Index Instance Ebkr E ∆
solver time(sec.) min E time(sec.)
Fastest Solver’s RC
Index Instance Ebkr E ∆
solver time(sec.) min E time(sec.)
TABLE S5: The hyperparameter settings for FEM in the benchmarking on the random MaxSAT 2016 competition
problems. Throughout the experiments, we employed the RMSprop as the optimizer. The inverse-proportional scheduling is
used for the annealing.
weight momen-
Index Instance Nstep Nreplica Tmin Tmax γgrad lr alpha
decay tum
weight momen-
Index Instance Nstep Nreplica Tmin Tmax γgrad lr alpha
decay tum
weight momen-
Index Instance Nstep Nreplica Tmin Tmax γgrad lr alpha
decay tum
weight momen-
Index Instance Nstep Nreplica Tmin Tmax γgrad lr alpha
decay tum
weight momen-
Index Instance Nstep Nreplica Tmin Tmax γgrad lr alpha
decay tum
weight momen-
Index Instance Nstep Nreplica Tmin Tmax γgrad lr alpha
decay tum
weight momen-
Index Instance Nstep Nreplica Tmin Tmax γgrad lr alpha
decay tum
weight momen-
Index Instance Nstep Nreplica Tmin Tmax γgrad lr alpha
decay tum
weight momen-
Index Instance Nstep Nreplica Tmin Tmax γgrad lr alpha
decay tum
weight momen-
Index Instance Nstep Nreplica Tmin Tmax γgrad lr alpha
decay tum
weight momen-
Index Instance Nstep Nreplica Tmin Tmax γgrad lr alpha
decay tum
weight momen-
Index Instance Nstep Nreplica Tmin Tmax γgrad lr alpha
decay tum
410 HG-3SAT-V250-C1000-14 500 800 0.0001 0.9 0.6 0.7 0.1 0.02 0.9
411 HG-3SAT-V250-C1000-15 90 50 0.0001 0.64 0.72 0.64 0.12 0.02 0.876
412 HG-3SAT-V250-C1000-16 130 60 0.0001 0.8 0.6 0.56 0.1 0.02 0.9
413 HG-3SAT-V250-C1000-19 300 100 0.0001 0.76 0.85 0.56 0.1 0.02 0.9
414 HG-3SAT-V250-C1000-2 300 100 0.0001 0.8 0.53 0.56 0.1 0.02 0.9
415 HG-3SAT-V250-C1000-22 300 200 0.0001 0.8 0.85 0.56 0.1 0.02 0.9
416 HG-3SAT-V250-C1000-24 300 200 0.0001 0.8 0.6 0.72 0.1 0.02 0.9
417 HG-3SAT-V250-C1000-3 300 200 0.0001 0.8 0.6 0.72 0.1 0.02 0.9
418 HG-3SAT-V250-C1000-8 300 200 0.0001 0.64 0.81 0.56 0.09 0.02 0.9
419 HG-3SAT-V300-C1200-2 500 200 0.0001 0.45 0.94 0.68 0.08 0.03 0.92
420 HG-3SAT-V300-C1200-21 200 200 0.0001 0.73 0.51 0.54 0.13 0.02 0.88
421 HG-3SAT-V300-C1200-7 100 200 0.0001 0.64 0.48 0.54 0.09 0.02 0.88
422 HG-3SAT-V250-C1000-21 300 100 0.0001 0.83 0.81 0.75 0.1 0.02 0.83
423 HG-4SAT-V100-C900-14 50 50 0.0001 0.7 0.85 0.63 0.08 0.02 0.67
424 HG-4SAT-V100-C900-19 50 50 0.0001 0.8 1.03 0.62 0.09 0.02 0.77
425 HG-4SAT-V100-C900-2 50 50 0.0001 0.85 0.66 0.65 0.06 0.03 0.44
426 HG-4SAT-V100-C900-20 50 50 0.0001 0.66 1.03 0.49 0.05 0.03 0.5
427 HG-4SAT-V100-C900-23 50 50 0.0001 1.31 1.26 0.71 0.09 0.03 0.41
428 HG-4SAT-V100-C900-4 50 50 0.0001 0.64 0.48 0.56 0.1 0.02 0.9
429 HG-4SAT-V100-C900-7 50 50 0.0001 0.8 0.6 0.56 0.1 0.02 0.9
430 HG-4SAT-V150-C1350-12 100 100 0.0001 0.64 0.62 0.67 0.1 0.02 0.9
431 HG-4SAT-V150-C1350-17 80 100 0.0001 0.7 0.57 0.71 0.1 0.02 0.71
432 HG-4SAT-V150-C1350-24 100 100 0.0001 0.76 0.85 0.56 0.1 0.02 0.9
433 HG-4SAT-V150-C1350-5 80 100 0.0001 0.8 0.76 0.56 0.1 0.02 0.9
434 HG-4SAT-V150-C1350-8 80 100 0.0001 0.67 0.71 0.69 0.13 0.02 0.62
435 HG-4SAT-V150-C1350-9 80 100 0.0001 0.7 0.53 0.9 0.11 0.03 0.75
436 HG-4SAT-V150-C1350-1 50 50 0.0001 0.83 0.62 0.56 0.1 0.02 0.9
437 HG-4SAT-V150-C1350-10 50 50 0.0001 0.76 0.75 0.77 0.07 0.03 0.75
438 HG-4SAT-V150-C1350-100 70 50 0.0001 0.76 0.85 0.61 0.11 0.02 0.75
439 HG-4SAT-V150-C1350-11 100 100 0.0001 0.76 0.9 0.54 0.12 0.03 0.88
440 HG-4SAT-V150-C1350-13 80 100 0.0001 0.7 0.76 0.56 0.1 0.02 0.9
441 HG-4SAT-V150-C1350-14 80 100 0.0001 0.8 0.6 0.72 0.1 0.02 0.9
442 HG-4SAT-V150-C1350-15 80 100 0.0001 0.83 0.96 0.49 0.12 0.02 0.88
443 HG-4SAT-V150-C1350-16 100 100 0.0001 0.8 0.76 0.56 0.1 0.02 0.9
444 HG-4SAT-V150-C1350-18 100 100 0.0001 0.64 0.81 0.56 0.09 0.03 0.79
445 HG-4SAT-V150-C1350-19 100 100 0.0001 0.8 0.6 0.61 0.1 0.02 0.9
446 HG-4SAT-V150-C1350-2 100 100 0.0001 0.64 0.57 0.61 0.09 0.02 0.9
447 HG-4SAT-V150-C1350-20 50 50 0.0001 0.83 0.64 0.49 0.09 0.02 0.67
448 HG-4SAT-V150-C1350-21 50 50 0.0001 0.9 0.52 0.65 0.09 0.02 0.62
449 HG-4SAT-V150-C1350-22 50 100 0.0001 0.76 0.67 0.62 0.09 0.02 0.75
to be continued...
50
weight momen-
Index Instance Nstep Nreplica Tmin Tmax γgrad lr alpha
decay tum
450 HG-4SAT-V150-C1350-3 80 100 0.0001 0.8 0.6 0.61 0.1 0.02 0.9
451 HG-4SAT-V150-C1350-4 80 100 0.0001 0.8 0.57 0.56 0.1 0.02 0.9
452 HG-4SAT-V150-C1350-6 80 100 0.0001 0.8 0.6 0.56 0.1 0.02 0.9
453 HG-4SAT-V150-C1350-7 80 100 0.0001 0.67 0.68 0.6 0.1 0.02 0.7
454 HG-4SAT-V150-C1350-23 100 100 0.0001 0.8 0.71 0.56 0.1 0.02 0.9
The mean field equations for the Ising problem derived from the FEM formalism
We introduce reproducing the mean-field equations for the Ising problem from the FEM formalism. For the case of Ising
problem with the Ising Hamiltonian E I sing (σ) = − 12 i,Nj Wi j σi σ j , where σi ∈ {−1, +1}, the corresponding mean field free
P
energy FMF = UMF − β1 S MF is computed as follows. The the internal energy UMF is defined as the expectation of E I sing (σ)
regarding to mean field joint probability PMF (σ) = iN Pi (σi ), which reads
Q
1 XY X
UMF = − Pi (σi ) Wi j σi σ j
2 σ i i, j
1X X X
=− Wi j Pi (σi )σi
P j (σ j )σ j
2 i, j σi σj
1X
=− W i j mi m j , (S36)
2 i, j
where mi = Pi (σi )σi = Pi (+1) − Pi (−1) is the magnetization (or mean spin). And the entropy S MF reads
P
σi
XX
S MF = − Pi (σi ) ln Pi (σi )
i σi
X X 1 + mi σi 1 + mi σi
=− ln
i σi
2 2
X 1 + mi 1 + mi 1 − mi 1 − mi
!
=− ln + ln , (S37)
i
2 2 2 2
1+mi σi
where the identity relation Pi (σi ) = 2 is used. Accordingly, the mean field free energy for the Ising Hamiltonian turns out
to be
1 X 1 + mi 1 + mi 1 − mi
!
1X 1 − mi
FMF = − W i j mi m j + ln + ln . (S38)
2 i, j β i 2 2 2 2
At any given β, the optimal {P∗i (σi )} that minimizes FMF can be obtained by the zero-gradient equations ∂P∂Fi (+1)
MF
= − ∂P∂Fi (−1)
MF
=
1+mi σi
0 (since Pi (+1) + Pi (−1) = 1). And since Pi (σi ) = 2 , it indicates that the information of all marginal probabilities is
completely described by the magnetizations. Thus, it suffices to solve the zero-gradient equations with respect to {mi }, which
is
∂FMF 1 1 1 + mi
X !
=0=− Wi j m j + ln
∂mi j
β 2 1 − mi
X 1
=− Wi j m j + arctanh(mi ) . (S39)
j
β
which reproduces the mean field equations in the existing mean-field annealing approaches for combinatorial optimization.
Then, solving Eq. (S40) by using the fixed-point iteration method with a slow annealing for β → ∞ leads to the solution {m∗i }
to Eq. (S40). We can identify the ground state by the operation of σGS
i = arg maxσi P∗i (σi ) = sign(m∗i ).
Building upon the three example problems explored in the main text, we now outline a concise methodology for modeling
or formulating additional well-known COPs within the FEM framework. As demonstrated previously, the essential step
for solving a given COP using FEM is to precisely define the expectation of the designed cost function with respect to the
marginal probabilities. Given the fixed form of the entropy in the mean-field ansatz, calculating the internal energy provides
the complete free energy expression that is a function of the marginals. Subsequently, the COP can be tackled using either the
automatic differentiation approach or the explicit gradient formulations.
Probably the most simple and straightforward ones will be problems that can be naturally translated into QUBO problems
(or Ising formulations). The target function of QUBO problems can be written as
X
E(σ) = σ T Qσ = Q i j σi σ j . (S41)
i, j
Here σ is a vector of binary spin variables (0 or 1) and Q is a symmetric matrix encoding the QUBO problems.
The expectation value of the cost function, or UMF , will be
XX
⟨E⟩ = pi Qi j p j , (S42)
i j
with the pi represents the marginal probability of spin variable σi to take on value 1.
Many kinds of COPs are either naturally QUBO or can be translated into QUBO problems with simple steps. Some
prominent such examples are listed below.
1. Vertex cover
4. Set Packing
5. Maximum cliques
The COPs with multi-state variables, commonly referred as Potts model in the statistical physics community, have different
target function formulation with the aforementioned QUBO-like problems. In the main text, we have shown a example of
balanced minimum cut problem. Another such problem will be the graph coloring problems, which detects whether a coloring
configuration of a graph can satisfy that no end points of an edge have the same color. This problem can be transferred into a
binary variable case with ancillary extra variables [21], the cost function then will be
2
X X X X
E(σ) = 1 −
σi,c + σi,c σ j,c . (S43)
i c (i, j) c
Here there are overall N × C binary variables σi,c with N being the number of vertices, C representing the overall number
of colors and σi,c being a variable to determine whether vertex i has the color c. The first term of this cost function is a
regularization term to let all vertices to have only one color and the second term is the actual target function which describes
the number of violation edges whose two end vertices have the same color.
52
If we change the framework into the FEM solver, modeling will become much easier with only N multi-state variables and
no regularization term. The target function can be written as
X
E(σ) = δ(σi , σ j ) , (S44)
(i, j)
with σi being a multi-valued spin variable which represents the color of vertex i. The expectation value of such target function
can be deducted easily as
XX
⟨E(σ)⟩ = pi (σi )p j (σi ) . (S45)
(i, j) σi
With the graph coloring problem being the natural one, there are many other problems can be translated into such formula-
tions. We have listed some of the examples below.
• Hamiltonian Cycles and Paths
• Traveling Salesman
• Community Detection
For problems like SAT and quantum error corrections (QEC), the target function will encounter terms with multi-variable
interactions. In the main text, we have already introduce how to solve Max k-SAT. Here we demonstrate the formulation of
QEC that compatible to FEM solver.
The QEC problems can be translated into Ising spin-glass problems with the Hamiltonian of the variable σi ∈ {−1, +1} [69,
70]
Nv
X 4
Y Nd
X
H = −J bv σi − h σi , (S46)
v i∈δv i