Training coupled phase oscillators as a neuromorphic platform using equilibrium propagation

Qingshan Wang; Clara C Wanjura; Florian Marquardt

doi:10.1088/2634-4386/ad752b

1. Introduction

We are witnessing the widespread adoption of machine learning in all branches of science and technology. At the same time, it becomes clear that the energy costs for both training and inference are exploding in an unsustainable way. This is especially apparent for the most powerful recent innovations like large-language models and diffusion models for image generation, which contain billions of parameters. Even the impressive advances in deploying specialized hardware like graphical processing units or tensor processing units are not able to keep up with the rapidly rising resource requirements. This makes it all the more urgent to explore alternative means for computation in this domain. The hope that such alternatives exist rests on several observations. First, machine learning is inherently stochastic and it therefore seems wasteful to run implementations on digital hardware that was originally conceived for implementing mathematical algorithms where precise answers matter. In addition, one can exploit the specific structure of machine learning problems to design new hardware platforms that are no longer general purpose. In particular, one can overcome the inefficient separation between processing and memory, the so-called von-Neumann bottleneck. Third, many physical platforms naturally allow for highly parallel processing.

These are the motivations for the rapidly growing field of neuromorphic computing [1, 2], which explores a large variety of different physical platforms, with the aim of designing more energy-efficient parallel processing devices, operating much closer to the underlying physics than a typical digital processor. Neuromorphic implementations do include alternative digital electronic platforms that overcome the von-Neumann bottleneck, but there is even more fundamental research to be done in the domain of analog, more directly physics-based approaches. These range from solid-state physics (building on components like memristors [3], Josephson junctions [4, 5], or spin oscillators [6]) and optics [7–12] (free space or integrated photonics) to other domains, even including soft-matter physics (e.g. [13–16]).

A neuromorphic platform usually has to exhibit nonlinear dynamics to be powerful enough in terms of its expressivity, i.e. its ability to approximate arbitrary input–output relations. Arguably one of the strongest generic classical nonlinearities is found in phase oscillators. These are systems where a combination of energy pump and nonlinearity establishes a limit cycle, with the phase of the oscillation as the resulting continuous degree of freedom. The laser is the most prominent example. When several such phase oscillators are coupled, they exhibit the dynamics of phase locking and synchronization. Theoretically, these systems are described by the so-called Kuramoto model [17], which is equivalent to the relaxation dynamics of an XY model [18], one of the most widely studied models in statistical physics. Coupled phase oscillators have been explored in a variety of platforms, including laser arrays [19, 20], coupled mechanical limit cycles in optomechanical [21, 22] and nano-electromechanical [23] systems, cold atoms [24], Josephson arrays in superconducting circuits [25], spin oscillators [26, 27], coupled complementary metal–oxide–semiconductor (CMOS)-based electrical oscillators [28–31], and exciton-polaritons [32–34] (presented in panel (c) of figure 1).

Figure 1. Refer to the following caption and surrounding text. — **Figure 1.** Equilibrium propagation for coupled phase oscillators. (a) We train a neuromorphic system of coupled phase oscillators (units), here represented by gray circles. In the free phase (left), the input units (blue) are fixed (indicated by the blue rim) and the system relaxes to its equilibrium via the standard overdamped dynamics of coupled phase oscillators. Output units are highlighted in green. In the nudge phase (right), the output units are forced weakly towards the desired target output (red arrows) and the system evolves to a new, slightly shifted equilibrium. The difference between these two equilibria allows to infer the gradients needed to train the system, according to the fundamental rule of equilibrium propagation, see equation (7). (b) Schematic representation of the evolution in the space of oscillator phases φ_j in the nth iteration, with open circles depicting initial conditions and full circles depicting equilibria (φ^τ is the desired target configuration). The solid arrows refer to the convergence of the network configuration from the initial state to the equilibrium, while the dashed arrows refer to the evolution of the equilibria during the training (caused by the update of tuneable parameters). (c) Possible experimental platforms that can give rise to the dynamics of the XY/Kuramoto model of coupled phase oscillators include superconducting circuits, nonlinear electrical circuits, exciton-polariton systems, mechanical-oscillator limit cycles, coupled laser arrays and superfluids.
Download figure:
Standard image High-resolution image

In this article, we explore how a system of coupled phase oscillators can serve as a neuromorphic platform. Most importantly, we will investigate how it can be trained in a physics-based approach.

An array of coupled phase oscillators was first considered as a neuromorphic platform already in [35, 36]. In that case, the system was proposed as an associative Hopfield memory, where couplings can be determined in a single step based on a simple Hebbian rule. In contrast, in our case, we will be interested in the more general setting of supervised learning, where the goal is to learn a prescribed input–output relation, for regression or classification tasks—the most wide-spread application in machine learning. That such a goal might be achievable is indicated by the results in [37], where certain small architectures of coupled phase oscillators were handcrafted to address simple tasks. More generally, coupled oscillators have been proposed in various forms as computing platforms (see the review in [38]), and the training of coupled digital oscillators has also been investigated recently in simulations [39]. From the neuromorphic perspective, oscillators seem like a natural choice for machine learning, since the brain is based on spiking neurons that can also show behavior such as synchronization.

One particular challenge in neuromorphic physical platforms is the question of efficient training. The conceptually easiest, general approach that can be adopted in every case is the parameter shift method, where internal trainable parameters of the physical device are adjusted in a feedback loop based on the distance to the desired output (e.g. [40, 41] and many others). However, this scales very unfavorably, requiring for a single training step a number of evaluations rising linearly in the number of parameters. Another popular approach is reservoir computing [42–45], where the fixed (non-adjustable) nonlinear dynamics of a system is used to map the input to a higher-dimensional space which is then turned into the output via a trainable simple (possibly even linear) neural network, similar to kernel methods in machine learning. This has been very successful, but still relies on a digital neural network. One major goal in the field is to find ways to implement variants of backpropagation for neuromorphic devices, since this efficient method to find the gradients needed for training is the cornerstone of virtually all machine learning. One possibility consists in doing this via simulations ('digital twins'). This method suffers from the lower speed of simulations and the potential inaccuracy of the adopted model, although the latter constraint has recently been overcome via an ingenious hybrid method [46], where backpropagation happens in the simulation but the forward pass is performed on the real device. Physical implementations of backpropagation, exploiting the physical dynamics itself, are a conceptually very attractive goal. They have been occasionally put forward starting even in the 80s. The first ideas involved approaches that were not general purpose but relied on specific properties of specially engineered platforms, especially in optics [7, 47–49]. A more general type of physical backpropagation for an integrated-photonics platform was introduced and eventually implemented recently in [12, 50]. In certain cases, gradients can also be extracted directly from scattering experiments, as shown for the recently introduced nonlinear neuromorphic processing in linear wave scattering platforms [51]. However, only two generic methods exist for physically performing backpropagation in a wide class of systems. One of them is Hamiltonian echo backpropagation, which relies on the Hamiltonian dynamics of arbitrary time-reversal-invariant systems [52]. The other is equilibrium propagation (EP) [53], which encompasses arbitrary energy-based systems that relax to some equilibrium. Since coupled phase oscillators display relaxation-type phase dynamics, in this work we will employ EP.

We start by explaining how EP can be applied to systems of coupled phase oscillators. We then explore in numerical experiments the EP-based training of networks of oscillators in two illustrative examples (XOR and handwritten-digit recognition). We specifically point out the effects of multistability on training and how to account for them. Our results establish coupled phase oscillators as a viable neuromorphic platform for supervised training, although the choice of network architecture plays an important role.

2. The XY model and EP

In this section, we first describe our model and then briefly review the theory of EP and how to apply it in our setting. We address the problem of multistability, which arises especially in the very nonconvex energy landscape of coupled oscillators.

2.1. XY model of coupled phase oscillators

The dynamics of coupled phase oscillators can be described by some variant of the Kuramoto model, and this can be understood (in many cases) as overdamped relaxation dynamics in the energy landscape of an XY model.

An XY model consists of interacting classical 2D spins (in our case, the phase oscillators), which are described by a set of phase angles φ_i. Anticipating the neuromorphic application, we refer to the coupling between the spins as weights and assume that all weights are trainable. We adopt the following energy function

$\begin{equation} E\left(\phi ; W, h, \psi \right) = - \frac{1}{2} \sum_{ij} W_{ij} \cos\left(\phi_i-\phi_j\right) - \sum_i h_i \cos\left(\phi_i-\psi_i\right).\end{equation} \tag{ 1 }$

The first term describes coupling between arbitrary oscillators, with coupling constants (weights) W_ij. The second term describes a tendency towards phase locking to some external oscillatory drives, where h_i is set by the amplitude of the respective drive and ψ_i is the phase of the drive. In the machine-learning context, these could be understood in analogy to the bias terms of a neural network. Beyond that, we remark that, in the synchronization context, one would normally expect to have another set of terms, of the type $\Omega_i \phi_i$ . This would denote the fact that different oscillators are running at different intrinsic frequencies. When these terms are large, a transition towards desynchronization will take place. However, for the neuromorphic application we are interested in, we find that such terms—and the associated tendency towards desynchronization—are not helpful, as they may lead to the absence of stable fixed points. We therefore envisage a situation where all the oscillators' intrinsic frequencies are the same. Small deviations from this ideal scenario can be tolerated, since the coupling terms would help to generate stable fixed points.

In this article, we focus on the deterministic dynamics of the system. We obtain the equilibrium by solving the following ODE, assuming that the phase angles are driven by the negative energy gradient

$\begin{equation} \begin{split} \dot{\phi}_i = - \frac{\partial E}{\partial \phi_i} \left(\phi; \theta\right) = - \sum_j W_{ij} \sin\left(\phi_i - \phi_j\right) - h_i \sin\left(\phi_i - \psi_i\right). \end{split}\end{equation} \tag{ 2 }$

It is obvious from this equation that we took E (as well as the couplings and the biases) to have the dimensions of a frequency, to avoid having to introduce a separate mobility constant in front of the gradient. These are the equations of motion for a standard Kuramoto model [17] modified by the bias term $\sum_i h_i \cos(\phi_i-\psi_i)$ . In this sense, equation (2) describes the dynamics of a perfectly synchronized network of oscillators with the same intrinsic frequency. Here and in the following, θ denotes the set of all the trainable parameters—in our case, $\theta = (W, h, \psi)$ . As explained in the introduction, such a system can be realized in many physical platforms.

2.2. EP

EP was introduced in [53, 54] as a way to perform physical backpropagation, i.e. to extract training gradients based on the physical dynamics of a system undergoing relaxation towards a minimum-energy state. In a nutshell, as shown in panel (a) and (b) of figure 1, it involves relaxing the system under two different choices of boundary conditions ('free' and 'nudge'), comparing the resulting states locally and performing parameter updates via some feedback mechanism (see below). As such, EP stands in the tradition of contrastive learning approaches which have been a well-known part of machine learning since the 80 s [55], but importantly EP offers an implementation of those general mathematical ideas in terms of local physical interactions. In recent years, EP has been explored further in various directions. A variant termed 'coupled learning' has been introduced [56] (see [57] for a very useful comparison of the different versions of such energy-based training approaches). EP has been proposed for the training of nonlinear resistor networks [58], and it has also been extended to deal with spiking networks [59, 60]. Furthermore, it has been shown that EP can be successfully used in training coupled electrical oscillators for associative memory [61]. The effects of the EP/coupled-learning training dynamics on the response behavior of a neuromorphic system was analyzed carefully in [62]. In addition, for EP, it was realized that it is possible to implement continual parameter updates already during the second phase of the equilibration dynamics, leading to an approach that is local in time [63, 64]. Going an important step further, recently, a dynamical version of EP has been proposed [65] (see also [66]), where in addition to gradient extraction via physical dynamics there is also a way to perform the parameter update itself via physical dynamics, rather than via measurement and feedback. This development can be seen in the conceptual tradition of Hamiltonian echo backpropagation [52], where a general way to perform training updates via physical dynamics in neuromorphic systems was first introduced.

On the experimental side, EP and its variants have been implemented so far in only a few devices—almost all of them electronic systems. One set of demonstrations investigated linear and nonlinear resistor networks in a platform using two copies of the network working in tandem, trained using coupled learning [67–70]. An experimental application to elastic networks was recently presented in [16]. Furthermore, a classical Ising model has been trained using the ideas of EP and making use of a quantum annealer to efficiently reach equilibrium [71]. Probably the most technically advanced platform consists of a memristor crossbar array trained using a contrastive learning approach closely related to EP [72, 73].

In general, EP is applied to energy-based models, i.e. systems with an energy function. We will now describe its workings, already keeping in mind the system of coupled phase oscillators. During evaluation (inference), the input is injected by fixing the phases of some selected oscillators to the components of the input vector, e.g. representing the color values of a picture. This could as well be done by (sufficiently strong) external driving, which will generate the same tendency towards phase locking already present in the bias terms. The system then equilibrates via its natural dynamics, i.e. it relaxes to the lowest-energy state. The output is finally read off by observing a selected set of oscillators and taking their phases to represent the output vector.

We call the vector of input phases $\phi_{\mathrm{in}}$ , while the phase angles of the output units and the hidden units are denoted as $\phi_{\mathrm{out}}$ and $\phi_{\mathrm{hidden}}$ , respectively. The energy function can then be written as $E(\phi_{\mathrm{in}},\phi_{\mathrm{hidden}},\phi_{\mathrm{out}};\theta)$ . In the context of EP, we call this the internal energy.

During training, we want to measure the deviation of the actual output to the desired output (the target). In the context of phase oscillators, a simple possibility is the following distance function. It simply compares the normalized spin vectors s, leading to a cosine-similarity measure, with the desired output values (target) denoted as $\phi_i^\tau$ :

$\begin{equation} D\left(\phi, \phi^\tau\right) \equiv \sum_{i \in O} \frac{1}{2} | \mathbf{s}_i - \mathbf{s}_i^\tau |^2 = \sum_{i \in O} 1- \cos\left(\phi_i - \phi_i^\tau \right).\end{equation} \tag{ 3 }$

Training via EP requires the system to have another weak contribution to the energy function which can be switched on when needed. This component describes the interaction between the output and the desired target. It plays the role of a cost function in supervised learning and is called external energy. The most straightforward ansatz for the cost function would be to employ the distance function introduced above. However, we found that this generates problems, since even incorrect solutions (for which phases differ by π) are hard to avoid—albeit unstable. As the result, the training is extremely slow (we discuss this aspect in appendix C). Empirically, the following cost function works much better, as it eliminates these undesired unstable fixed points and quickly drives the system away from incorrect solutions

$\begin{equation} C\left(\phi, \phi^\tau\right) = \sum_{i \in S_{\mathrm{out}}} -\ln\left(1+ \cos\left(\phi_i - \phi_i^\tau\right) \right).\end{equation} \tag{ 4 }$

We define $S_{\mathrm{in}}$ , $S_{\mathrm{h}}$ and $S_{\mathrm{out}}$ as the sets of input, hidden and output units, respectively. Although the logarithm appearing here was introduced empirically, it can be motivated by the connection to the categorical cross-entropy (Kullback–Leibler divergence) found for classification tasks. Interestingly, the cost function adopted here can also be related to fidelities occurring when comparing different quantum states, despite the classical setting we are working in, as shown in appendix A. However, it has to be admitted that the cost function in this form is generally difficult to implement on a physical platform of coupled phase oscillators, since it is not clear how to obtain the logarithm. We address this problem in appendix C and argue that we can attain the same performance using the linear expansion of C, which is much easier to implement.

Once a cost function has been assigned, EP introduces a small parameter β which quantifies the strength of the interaction between the output units and the target. This leads to the total energy of the system

$\begin{equation} F\left(\phi,\theta,\beta\right) = E\left(\phi_{\mathrm{in}},\phi_{\mathrm{hidden}},\phi_{\mathrm{out}};\theta\right) + \beta C\left(\phi_{\mathrm{out}}, \phi^\tau\right).\end{equation} \tag{ 5 }$

The process of EP [53] contains the following steps:

(i)
Initialize the system such that the input units are fixed to the provided input values.
(ii)
Set β = 0 and obtain a stable fixed point/equilibrium by following the relaxation dynamics
$\begin{equation} \dot{\phi_i} = -\frac{\partial F}{\partial \phi_i}\left(\phi,\theta,\beta = 0\right)\end{equation} \tag{ 6 }$
towards the fixed point. This state is called the free equilibrium and this step is called the free phase of EP.
(iii)
Switch on a small value of β and again follow the dynamics towards a new fixed point. This state is called nudge equilibrium and this step is called the nudge phase of EP.
(iv)
Repeat the process 1–3 for all the input data from the given batch of training samples. The central insight of EP [53] is that the gradient of the cost function with respect to the trainable parameters can be approximated as
$\begin{equation} \begin{split} \frac{\partial C}{\partial \theta_\alpha} = \frac{\mathrm d}{\mathrm d \beta} \left( \frac{\partial F}{\partial \theta_\alpha} \right)_{\beta = 0} \approx \frac{1}{\beta}\left( \left\langle {\partial E \over \partial \theta_\alpha} \right\rangle^{\mathrm{nudge}}_{\phi_{\mathrm{in}}} - \left\langle {\partial E \over \partial \theta_\alpha} \right\rangle^{\mathrm{free}}_{\phi_{\mathrm{in}}} \right). \end{split}\end{equation} \tag{ 7 }$
in which, in our case, ${\partial E \over \partial \theta_\alpha}$ are given by the following analytical expressions
$\begin{equation} \begin{split} {\partial E \over \partial W_{ij}} & = -\cos \left(\phi_i - \phi_j\right) \\ {\partial E \over \partial h_i} & = -\cos \left(\phi_i - \psi_i\right) \\ {\partial E \over \partial \psi_i} & = - h_i \sin\left(\phi_i - \psi_i\right). \end{split}\end{equation} \tag{ 8 }$

It has been discovered that EP gradients may benefit from nudging the system away from the desired target state, i.e. β < 0, or even estimating the gradient in a symmetric way from positive and negative nudging [74]. However, it was not possible to use this trick in our work due to the fact that our cost function can become large quickly when nudged in the 'wrong' direction. We implement the dynamics of the equilibration numberically using the 4th order Runge–Kutta method with adaptive step size, implemented in python/jax on a GPU.

As we focus on the deterministic dynamics of a system with a complicated non-convex energy landscape, a discussion of multistability is inevitable. Apparently such a discussion was not needed in earlier applications of EP, due to the simpler nature of the energy landscape encountered there (e.g. [58]). An XY model with random couplings can be viewed as a spin glass, for which [75] shows that the number of fixed points grows exponentially with the size of the network. Although only a small fraction of these fixed points are stable, the effect of possible multistability in this energy landscape cannot be neglected, as it can undermine the training if it is not properly handled. We emphasize that on top of this multistability of the energy landscape (peculiar to energy-based models) which hinders equilibration for each training step, there can also be multistability in the cost function training landscape (similar to neural networks), which is a distinct aspect.

The effect of multi-stability is shown schematically in figure 2. In panel (a), we assume that the system has only one stable fixed point. The fixed point then moves along the dashed arrows and approaches the desired result (the target) by adjusting the trainable parameters with EP. However, if there are multiple stable fixed points, the situation becomes more complicated, as shown in panel (b). Depending on the arbitrarily chosen initial phase configuration one may end up in different fixed points. Depending on which training updates are performed in which fixed points, this may even result in situations where some of the fixed points move away from the target during training, at least temporarily. Sometimes there are even more drastic cases as a stable fixed point may become unstable after the parameters have been adjusted. These cases can reduce the efficiency of the training.

Figure 2. Refer to the following caption and surrounding text. — **Figure 2.** The problem of multi-stability. (a) A case with only one stable fixed point. The circles refer to the initial states and the arrows refer to the relaxation to the stable fixed points (equilibria; solid dots). The arrows with dashed lines refer the evolution of the equilibria during the training. The dark red dot refers to the target output, assuming that the two phase coordinates shown here both refer to output units, i.e. hidden phases and input phases are suppressed in this depiction. Different colors refer to different iterations during the training. (b) Multistability. Different equilibria can be reached at the same training iteration (same color), starting from different random initial conditions.
Download figure:
Standard image High-resolution image

In principle, the multistability problem can be solved by introducing noise into the system and rewriting equation (7) with a Boltzmann distribution (the theory is briefly discussed in [54]). However, we note here that the challenge of multistability can also be solved within the framework of deterministic dynamics. We propose to randomly initialize the hidden (and output) units for each new training step. The system will correspondingly end up in different fixed points, depending on the location of the randomly selected initial conditions with respect to the basins of attraction of those fixed points. Then, in the spirit of stochastic gradient descent (SGD), we average the learning gradient over both the random initial configurations and the training samples. Based on the success of SGD in other contexts, we expect that all stable fixed points can be trained simultaneously (and our numerical experiments confirm this). This is conceptually similar to the fact that a standard artificial neural network (ANN) can become insensitive to noise in some of its input neurons during training, if training samples that differ in the noise are prescribed to have the same output. In this comparison, the noise injected into the input neurons of ANNs corresponds to the different random initial conditions that we employ during EP training (each of which may lead to a different fixed point but is prescribed to have the same output nonetheless).

In practice, in each iteration we randomly initialize the system many times and average the approximate gradient given by EP over both the training sample batch and the initial states. We denote the number of random initial values by $M_{\mathrm{init}}$ and the size of the training batch by $M_{\mathrm{data}}$ , and we update the trainable parameters accordingly

$\begin{equation} \begin{split} \delta \theta_\alpha = & -\frac{\eta}{\beta} \frac{1}{M_{\mathrm{data}}} \sum_{i = 1}^{M_{\mathrm{data}}} \frac{1}{M_{\mathrm{init}}} \sum_{j = 1}^{M_{\mathrm{init}}} \left( \frac{\partial E}{\partial \theta_\alpha}\left(\phi^{i,j,\mathrm{nudge}}; \theta\right) - \frac{\partial E}{\partial \theta_\alpha}\left(\phi^{i,j,\mathrm{free}}; \theta\right) \right). \end{split}\end{equation} \tag{ 9 }$

Here $\phi^{i,j,\mathrm{nudge/free}}$ refers to the nudge/free equilibrium for the ith input attained from the jth initial state, and η refers to the learning rate.

3. Results

3.1. Learning the XOR function

We first test EP-based training of a coupled-oscillator network for the simple case of the XOR function. For ANNs, the XOR problem is historically important, because it was the first task to require at least one hidden layer. Due to its simplicity, the XOR task allows us to investigate the dynamical properties of the network in great detail. We are able to explore the effects of the network size, the evolution of equilibria when the network is multi-stable, and the influence of random initial configurations.

We encode 'True' and 'False' with $\phi = \pi/2$ and $\phi = -\pi/2$ , respectively. For XOR, if the two inputs are same, the output unit should give us a 'False' and vice versa.

At the beginning of training, the weights are initialized independently according to a standard Gaussian distribution, while the strength and the angle of the bias are initialized according to the uniform distribution over $[-0.5, 0.5)$ and $[-\pi, \pi)$ , respectively. During the training, we repeatedly update the parameters according to equation (9). In each iteration, the system is exposed to all 4 input–output pairs ( $M_{\mathrm{data}} = 4$ ). For each such pair, we repeat the following process for $M_{\mathrm{init}}$ times: the units of the network are initialized so that the phase angles of the hidden and output units independently follow the uniform distribution over $[-\pi, \pi)$ ; then we solve equation (2) for a fixed time interval T which is sufficiently long to ensure that the coupled phase oscillators approximately reach their equilibrium; finally, we calculate the average parameter gradient and update the trainable parameters using equation (9). In this section, we fix T = 100 and η = 0.1. For all the numerical experiments of this paper, we take β = 0.1 if not specified, which is a conventional choice used for existed literature [53].

All our numerical results for the XOR task are collected in figure 3. We first set the number of initial configurations to $M_{\mathrm{init}} = 1$ and train a five-unit XY network with all-to-all connectivity for 1000 iterations. In each iteration, the mean distance $\langle D \rangle$ averaged over the 4 input–output pairs (computed with equation (3)) is recorded. In addition, we perform averages over different training runs with random initial weight configurations. The average distance is shown in panel (a) of figure 3. The white line depicts the average mean distance, which we define to be $\overline{\langle D \rangle} = \frac{1}{N_{\mathrm{train}}}\sum_{i = 1}^{N_{\mathrm{train}}} \langle D(\phi, \phi^{\tau}) \rangle$ , where the average is taken over training runs. We find that the distance almost always converges to 0, which proves the ability of XY networks to learn XOR via EP training.

Figure 3. Refer to the following caption and surrounding text. — **Figure 3.** Results of training an XY model on the XOR task. (a) Statistics of the training progress of five-unit XY networks of all-to-all connectivity (the structure is shown in the upper right corner). Evolution of mean distance function $\langle D \rangle$ (white line, average over 1000 training runs). The density plot indicates the histogram of $\log_{10}D$ over these runs. (b)The dependence of the learning speed on the size N. For a given N, we randomly initialize the trainable parameters and train them for 1000 iterations, repeating this whole process 100 times. The curves show the evolution of $\overline{\langle D \rangle}$ during the training. (c) Evolution of equilibria during a single training run. For each of the four input configurations, the possible stable fixed points at each training iteration are found by running the system with random initialization for 100 times. For each of these fixed points the state of the output unit (upper row) and one of the hidden units (lower low) are recorded. Output units for all fixed points converge to a unique, deterministic function of the input at the end of training. (d) Dependency of learning speed on the number of random initial configurations, $M_{\mathrm{init}}$ . For each N and $M_{\mathrm{init}}$ we show the evolution of $\overline{\langle D \rangle}$ averaged over 100 training runs. (e) Upper panel: the learning speed of training measured by the average slope of the first 300 steps shown in (d) and its dependence on $M_{\mathrm{init}}$ . Lower panel: the 'physical training speed', measured by the slope divided by $M_{\mathrm{init}}$ .
Download figure:
Standard image High-resolution image

**Figure 3.** Results of training an XY model on the XOR task. (a) Statistics of the training progress of five-unit XY networks of all-to-all connectivity (the structure is shown in the upper right corner). Evolution of mean distance function $\langle D \rangle$ (white line, average over 1000 training runs). The density plot indicates the histogram of $\log_{10}D$ over these runs. (b)The dependence of the learning speed on the size N. For a given N, we randomly initialize the trainable parameters and train them for 1000 iterations, repeating this whole process 100 times. The curves show the evolution of $\overline{\langle D \rangle}$ during the training. (c) Evolution of equilibria during a single training run. For each of the four input configurations, the possible stable fixed points at each training iteration are found by running the system with random initialization for 100 times. For each of these fixed points the state of the output unit (upper row) and one of the hidden units (lower low) are recorded. Output units for all fixed points converge to a unique, deterministic function of the input at the end of training. (d) Dependency of learning speed on the number of random initial configurations, $M_{\mathrm{init}}$ . For each N and $M_{\mathrm{init}}$ we show the evolution of $\overline{\langle D \rangle}$ averaged over 100 training runs. (e) Upper panel: the learning speed of training measured by the average slope of the first 300 steps shown in (d) and its dependence on $M_{\mathrm{init}}$ . Lower panel: the 'physical training speed', measured by the slope divided by $M_{\mathrm{init}}$ .
Download figure:
Standard image High-resolution image

The effect of size N on learning speed is shown in panel (b). For different N, we initialize and train an all-to-all network 100 times and plot the evolution of the mean distance function $\overline{\langle D \rangle}$ . The results in (b) show that an increasing size initially accelerates the learning process. The speed of learning reaches its maximum at a certain number of phase oscillators and then decreases. The initial increase in speed can be explained by the increasing number of parameters, which makes the network more flexible. However, when the size becomes excessively large, the large number of parameters and multiple stable fixed points (all of which have to be trained) will reduce the efficiency of the training. According to the result in (b), the training becomes slower again when the number of units exceeds 8.

We now turn to a more detailed investigation of multistability (panel (c) of figure 3). We train a network of 30 units with 2000 iterations and record the parameters for each iteration. After training, for each iteration and each input–output pair, we search for all the possible equilibria by randomly initializing and running the network 100 times. We then inspect the phase angles for each of these equilibria. In the top row of panel (c), we show the evolution of the set of phase angles for an output unit. Importantly, training converges in the sense that the output unit phase angle becomes a deterministic function of the input, with no remaining scatter. At the same time however, multistability in the energy landscape remains. We can see this by observing the evolution of one selected hidden unit (bottom row), which shows scatter even after conclusion of the training. Overall, we confirm that despite the persistence of multistability, it is still possible to adjust all of the equilibria simultaneously to perfectly attain the training objective.

Finally, we study the effect of the number of random initial configurations, $M_{\mathrm{init}}$ , considered during each training iteration. First, for a fixed number of units N and $M_{\mathrm{init}}$ , we train an XY network with 100 different sets of initial trainable parameters and record the average mean distance function $\overline{\langle D \rangle}$ . The results are shown in panels (d) and (e). From panel (d), we observe that the learning speed increases with increasing $M_{\mathrm{init}}$ and saturates at large $M_{\mathrm{init}}$ . The effect of $M_{\mathrm{init}}$ is not obvious for networks of 5 units, as multi-stability is rare for small networks.

Another observation is that the average mean distance function $\overline{\langle D \rangle}$ always decays approximately exponentially in the early stages of training. Therefore, we perform a linear regression to $\log \overline{\langle D \rangle}$ on the first 300 iterations and use the negative slope to measure the speed of learning. The results are shown in the upper graph of panel (e). An increase in the learning speed and a saturation at large $M_{\mathrm{init}}$ are evident. However, a larger $M_{\mathrm{init}}$ physically means more equilibration runs. This results in larger time consumption in the physical world. To take this into account, we divide the learning speed extracted above by $M_{\mathrm{init}}$ and use this ratio to measure the physical speed of learning ('normalized convergence speed' in the figure). The results are shown in the bottom graph. We observe that the physical learning speed reaches its maximum at a relatively small value of $M_{\mathrm{init}}$ . This indicates that a small number of different initial configurations can already sufficiently accelerate the learning process in terms of reducing the physical time consumption.

3.2. Handwritten-digits recognition

In the previous section we showed the ability of XY networks to learn the simple XOR task via EP. In this section, we will analyze the performance on the recognition of written digits, a more complex task which has long been considered a benchmark for neural networks.

To keep the numerical effort in our simulations manageable, we use a scaled-down version of the MNIST dataset. Specifically, we use the dataset provided by ScikitLearn which is a copy of the test set for the Optical Recognition of Handwritten Digits from the UCI ML hand-written digits datasets [76]. It consists of 1797 8×8-pixel greyscale images of written digits, approximately 180 for each. A sample figure is shown in the left of panel (a) of figure 4. The size of the data set and the dimension of the pictures are all smaller than the commonly used MNIST, which allows us to do training via EP in smaller coupled-oscillator networks and test various architectures more easily. The structure of the networks and the results are shown in figure 4.

Figure 4. Refer to the following caption and surrounding text. — **Figure 4.** Handwritten-digit recognition in a network of coupled phase oscillators. (a) We consider both all-to-all and layered connectivity networks, where the recognized digit is one-hot encoded in the phase configuration of the ten output units. (b) Training curves for networks of different sizes with all-to-all connectivity and layer structure. The sizes of the networks are selected such that the number of trainable parameters for them are approximately the same (shown in table 1). (d) Training evolution of confusion matrices, for different structures and different network sizes.
Download figure:
Standard image High-resolution image

As shown in figure 4(a), we train XY networks with both all-to-all connectivity and a layered structure. In each such network, there are 64 input units and 10 output units. The input data (image pixels) are linearly rescaled into the range $[-\pi/2, \pi/2]$ . As usual for classification tasks, the outputs are interpreted as a probabilities, such that $p_i \propto 1+\sin \phi_i$ , as shown in the right side of panel (a), where p_i denotes the probability that 'the input digit is i'. The inference result is given by taking the index of the highest probability. We note that in our setting these probabilities are not normalized, i.e. they do not sum up to one. This is different from usual ANNs, where one uses a softmax activation function in the last layer to ensure normalization. The label for each input image is encoded in a 'one-hot' configuration, giving a phase of $\pi/2$ at the output unit (phase oscillator) corresponding to the correct label of the digit seen in the image and $-\pi/2$ otherwise. We use the same cost function as for the XOR task.

For neural networks of large size, the influence of weight initialization on training can be crucial [77]. In this section, we initialize out networks as following:

(i)
For all-to-all networks, the weights are initialized according to $\mathcal{N}(0, 1/N)$ , where $\mathcal{N}(\mu, \sigma^2)$ refers to a normal Gaussian distribution with mean value µ and variance σ², and N is the total number of units in the network. The strength of the initial bias field is set to be 0 everywhere, while the initial bias directions are randomly set according to a uniform distribution over $[-\pi, \pi]$ .
(ii)
For layered networks, the initial weights connecting the ith and the i + 1th layers are initialized according to $\mathcal{N}(0, 1/(N_{i}+N_{i+1}))$ , where N_i and $N_{i+1}$ are the size of the ith and the i + 1th layer. The biases are initialized in the same way as for all-to-all networks.

In order to evaluate the effect of the architecture, we compare the training process of all-to-all and layered XY networks with a similar number of parameters. For an all-to-all network of N units containing $N_{\mathrm{in}}$ input units and $N_{\mathrm{out}}$ output units, the number of trainable parameters (weights, strength and direction of bias) is $\frac{1}{2} N(N-1) - \frac{1}{2} N_{\mathrm{in}}(N_{\mathrm{in}}-1) + 2(N - N_{\mathrm{in}})$ . For a network of layered structure, the number of trainable parameters is $\sum_{i = 0}^{L-1} N_i N_{i+1} + 2\sum_{i = 1}^{L} N_i$ , where L here denotes the depth of the network (number of layers). The focus of our work is on the performance of all-to-all networks and layered networks with one hidden layer. The sizes and parameters of the tested networks are displayed in table 1.

Table 1. Sizes and parameters of tested networks.

Layered		All-to-All
Layer structure	# Parameters	# Units	# Parameters
64, 20, 10	1540	85	1596
64, 100, 10	7620	139	7725
64, 200, 10	15 220	185	15 246
64, 300, 10	22 820	222	22 831

The results are shown in panels (b) and (c) of figure 4. The data set is split into a training set and a test set. The training set consists of the first 100 images for each of the 10 digits (1000 images in total), while the test set consists of the next 70 images for each digit. All networks are trained for 1000 iterations. In each iteration, we randomly select 30 images from the training set for each digit (so the total batch size is 300). We take the number of random initial phase configurations for each sample, $M_{\mathrm{init}}$ , to be one. This can be justified by our insights from the XOR task (panel (f) in figure 3) and the fact that a large batch size, such as the one adopted here, automatically leads to a variety of different initial configurations. Every five iterations, we test the accuracy of the inference with all the images in the test set.

The evolution of the test errors is shown in figure 4 (b) and we list the best attained test accuracy for each architecture in table B1 of appendix B in comparison to results we obtained with ANNs and linear classifiers. Increasing the number of parameters (by increasing the number of hidden units) seems to improve the test accuracy for the layer architecture although the accuracy again seems to drop as we increase the number of hidden units from 200 to 300. The best test accuracy we obtained with the layer architecture (at 200 hidden units) was 94.1% which is above the accuracy of 90.7% achieved by a linear classifier with the same number of parameters but below the accuracy of 95.0% achieved by an ANN with one hidden layer (see appendix B for further details). In contrast, the performance of a network with all-to-all connectivity remains approximately constant as we increase the number of parameters with a slight drop as the number of hidden units increases from 11 to 75. The best accuracy we achieved with all-to-all connectivity was 93.3% with 11 hidden units. The comparable linear classifier achieves 90.4% accuracy, while the ANN attains 94.3%. This indicates that EP can be used successfully to train a network of coupled phase oscillators to perform standard classification tasks. The performance does not reach the state-of-the-art values of ANNs, which have been highly optimized over the past two decades for the MNIST task (reaching accuracies close to 100% on the larger $28\times 28$ images), but this was to be expected—further optimization and evolution of the neuromorphic platform (as well as the details of the training procedure) would certainly yield improved accuracy. In addition, we recall that the motivation for adopting neuromorphic platforms is to achieve gains in energy efficiency and other aspects like inference speed. Therefore, one may be willing to pay the price of reduced accuracy in certain applications.

The influence of the network architecture on convergence is also evident. For all-to-all networks, the curves depicting the training evolution of the test error almost overlap for the different choices of network size. In contrast, for layered networks, convergence speed increases significantly as the size of the hidden layer increases. However, as the size of the hidden layer increases above 200, the speed of convergence tends to saturate. This difference suggests that the speed of learning is more sensitive to the size of the networks when they are layered.

Finally, we visualize the training evolution via the confusion matrix, which measures the probability that a given true digit is classified as one of the ten possible digits (correct classifications are on the diagonal). For different network sizes, we plot the confusion matrix after 0, 10, 50, 100 and 1000 iterations during training. Comparing the results of the same network architecture, we see a clear influence of size on the learning speed for layered networks, which is negligible for all-to-all networks. Comparing the results for the similar number of trainable parameters, we find that the learning process of layered networks is slower at smaller sizes, but faster when the size is sufficiently large.

In terms of network architectures, it is a natural question to ask whether a locally coupled network can also yield good training performance. This is an important question, since physical interactions are naturally local, so a neuromorphic device based on such an architecture would require less physical resources. We have explored this question and investigated nearest-neighbor-coupled square lattices of phase oscillators. In that setting, there are various choices to be made, e.g. as to how the input and output units are distributed over the lattice. In any case, despite our best efforts, we could not make that architecture train successfully on the MNIST task. Intuitively, we believe this is due to the difficulty for the system to find ways to effectively process different parts of the input that are spatially removed from each other (and from the output units). This difficulty may also be related to the physics of localization in disordered systems. More detailed investigations here would be needed in the future, since the performance of locally coupled neuromorphic devices in non-trivial tasks is an important question, even beyond the coupled phase oscillator platform discussed here.

4. Towards experimental implementations

There is a large variety of possible experimental platforms that could be suitable to implement the ideas presented here, as mentioned already in the introduction. While it is beyond the scope of the present manuscript to offer detailed design studies and performance analyzes for these platforms, leaving these to future works, we can already summarize the main ingredients that will be required for any successful experimental implementation.

Tuneable couplings. In any platform, we need to have tuneable couplings between the different phase oscillators. These might include optical or electrical couplings.

Readout. Both for inference (evaluation) and during training, we require the possibility to read out the phases of the oscillators. In optical or electrical implementations, this would typically happen either via heterodyne detection (beating optical signals against a local oscillator) or possibly direct digitization and processing of an electrical signal.

Input. We need to inject the input signal. This can be performed by injection locking, i.e. injecting at the oscillators serving as input nodes a relatively strong signal (at a drive strength exceeding the coupling between oscillators) oscillating at the common frequency of the phase oscillators. Those oscillators will then lock to the phase of the input signal.

Biases and output nudging. As discussed, in our model, we need to exert tuneable bias forces at all the oscillators (both for the biases inside the network, but also for the slight nudging of output oscillators). This can be achieved again using injection locking, but now simply with a weaker signal. For any given oscillator subject to such a weak injection signal, the signal amplitude will determine the strength of the bias term and its phase will determine the bias phase.

Stable homogeneous frequencies. In our modeling, we have assumed that all phase oscillators are running at the same bare frequency, leading to a direct implementation of the XY model. Slight violations of this condition could probably be tolerated, since phase oscillators with bare frequencies that differ much less than the coupling strength will synchronize (frequency-lock) anyway. Apart from this observation, it is however important to use an experimental system where either the intrinsic frequencies are all matched or where one can implement some feedback loop to stabilize the frequencies.

Equilibration. The standard overdamped phase dynamics of a system of coupled phase oscillators will provide the required relaxation dynamics towards the fixed points.

Physical example. As one very promising illustrative example, we mention the laser-array platform introduced in [19, 78]. In that flexible experimental platform, a laser array is constructed out of modes propagating in parallel inside a Fabry–Pérot cavity. It has been shown [79] that it is possible to couple these modes even with arbitrary tuneable couplings by inserting suitable optical elements in the beam paths. In addition, it is possible to read out the phases of the modes by measuring the weakly outcoupled light emerging from the cavity. Moreover, optical injection locking is naturally possible.

As a final remark, we point out that even the simple direct coupling of phase oscillators (e.g. xx-coupling of nearby mechanical limit-cycle oscillators) may in general lead to a phase model that is more complex than the XY model (Kuramoto model) discussed here. As shown in [21, 80], straightforward nearest-neighbor coupling of Hopf oscillators (i.e. generic limit-cycle oscillators) can lead to coupling terms that fall outside the scope of the XY/Kuramoto model. These include effective next-to-nearest-neighbor coupling terms and dependencies of the type $\sin(2 (\varphi_j - \varphi_k))$ . Most importantly, one can even have terms of the shape ${\dot \varphi}_j \propto \cos(\varphi_k - \varphi_j)+\ldots$ and simultaneously $\dot \varphi_k \propto \cos(\varphi_j-\varphi_k)+\ldots$ . These latter terms are crucial deviations from the XY model, since they cannot be understood as being derived from a potential or free energy: that would by necessity lead to a term that switches sign when going from $\dot \varphi_j$ to $\dot \varphi_k$ . As a consequence, a dynamical model with such terms represents a truly nonequilibrium situation with new properties. While terms of the first class discussed above are less dramatic in their impact, and EP would still apply (simply for a different energy functional), the latter scenario is more difficult: indeed, standard EP cannot be applied in that scenario. However, there has been some progress in extending EP to relaxation dynamics given by an arbitrary vector field which is not necessarily a gradient field, see [81, 82]. Applying these new approaches would allows us to extend the training analyzed here to the more general case.

5. Conclusions

Our results confirm that it is possible to employ EP for training systems of coupled phase oscillators. As an important aspect, we observed that the complex energy landscape of the XY model leads to multistability. We showed that it is possible to address that challenge via stochastic initialization, leading to a simultaneous adjustment of all fixed points and eventually stable convergence of the training. In terms of physical implementations, we feel the most promising route might be based on coupled laser arrays of the type shown in [19], possibly combined with the flexibility of spatial light modulators or similar devices for setting the coupling weights, as shown for photonic neural networks in [9]. However, many other platforms are conceivable as well, with the main requirements that it is possible to read out phases, drive individual oscillators externally (for input injection and biases), as well as implement tuneable couplings between arbitrary oscillators.

Acknowledgments

We acknowlegde funding via the German Research Foundation (DFG, Project-ID 429529648–TRR 306 QuCoLiMA, 'Quantum Cooperativity of Light and Matter').

Data availability statement

The data that support the findings of this study will be openly available following an embargo at the following URL/DOI: https://github.com/chaganbajala/XY_EqProp/tree/main. Data will be available from 13 May 2024.

Appendix A: Interpretation of the cost function

In equation (4) we introduced a cost function different from the straightforward distance function in equation (3). Here we comment further on the problems posed by the distance function. In addition, we point out an interesting connection between the adopted cost function and the overlap of quantum states.

From equation (3), we find that the training reaches its fixed points when

$\begin{equation} \begin{split} \sin\left(\phi_i - \phi_i^\tau\right) = 0 \end{split}\end{equation} \tag{ A.1 }$

holds for the output units. The distance function D reaches it minimum when $\phi_i = \phi_i^T$ and maximum when $\phi_i = \phi_i^T + \pi$ . Theoretically, the maxima are unstable and therefore should be avoided. However, practically they can still cause a long stay on some plateaus during training (with $\phi_i = \phi_i^T + \pi$ for some output unit) and seriously increase the time cost for training. In contrast to this behavior of D, the only fixed point for the adopted cost function C is the case when $\phi_i = \phi_i^T$ holds for all the output units. We find that this makes the training much more efficient.

Although C was initially chosen simply to provide a strong repulsion from the unstable fixed points (maxima), we can draw an interesting connection to quantum physics. Assume that we have an array of independent two-level quantum systems and prepare the state of the system based on the output units by letting

$\begin{equation} |\Phi_i \rangle = \frac{1}{\sqrt{2}}\left(|0 \rangle + e^{i \phi_i} |1\rangle \right)\end{equation} \tag{ A.2 }$

which is exactly an eigenstate of $\cos \phi_i X + \sin \phi_i Y$ of eigenvalue +1 (Here $X = \vert 0 \rangle \langle 1\vert 0\rangle \langle 1 | + |1\rangle \langle 0 |$ and $Y = \mathrm{i} |0\rangle \langle 1 | - \mathrm{i} |1\rangle \langle 0 |$ ).

We prepare another set of two-level systems based on the state of the target in the same way. Then we have two product states, corresponding to the actual output $|S \rangle$ and the target $| T \rangle$

$\begin{equation} \begin{split} & |S \rangle = \prod_{i \in S_{\mathrm{out}}} |\Phi_i \rangle = \frac{1}{2^{|S_{\mathrm{out}}|/2}} \prod_{i \in S_{\mathrm{out}}} \left(|0 \rangle + e^{i \phi_i} |1\rangle \right) \\ & | T \rangle = \prod_{i \in S_{\mathrm{out}}} |\Phi_i^\tau \rangle = \frac{1}{2^{|S_{\mathrm{out}}|/2}} \prod_{i \in S_{\mathrm{out}}} \left(|0 \rangle + e^{i \phi_i^\tau} |1\rangle \right). \end{split}\end{equation} \tag{ A.3 }$

$\begin{equation} \begin{split} \mathcal{L} & = -\log \left| \langle T|S \rangle \right| ^2 \\ & = -\sum_{i \in S_{out}} \log \left( \frac{1}{4} |1+e^{\unicode{x0131} \left(\phi^\tau_i-\phi_i\right) }|^2 \right) \\ & = -\sum_{i \in S_{out}} \log \left( \frac{1}{2} \left(1+\cos\left(\phi_i -\phi^\tau_i\right)\right) \right) \end{split}\end{equation} \tag{ A.4 }$

which has the form of C expressed in equation (4). We note that $\left| \langle T|S \rangle \right| ^2$ is the probability of measuring the quantum state $|T\rangle$ after preparing the state $| S \rangle$ (if measurements are performed with respect to a basis containing $| T \rangle$ as one of the states). Thus, $\mathcal L$ can also be seen as a categorical cross-entropy $- \sum P_j \ln Q_j$ , for the case where P₁ = 1 is the desired probability of outcome 'T' in such a measurement, and $Q_1 = \left| \langle T|S \rangle \right| ^2$ is the probability of actually observing this outcome.

Appendix B: Digit recognition performance and benchmarks

We compared the performance of the neuromorphic system to artificial neural networks (ANNs) and linear classifiers which we implemented with tensorflow. The architecture of the models we implemented as well as the number of parameters matched those of the layered structure of our neuromorphic systems. We used a categorical cross-entropy loss function for the ANNs and a mean-squared-error loss function for the linear classifiers and trained both models using Adam optimization for 100 epochs dividing the training data into randomized mini-batches of ten images. The maximum test accuracy is reported in table B1 and compared to the performance of the corresponding neuromorphic systems with either layer structure or with all-to-all connectivity.

Table B1. Digit recognition test accuracy and benchmarks. Benchmarks were obtained with networks of the same structure as the layered neuromorphic system. For the linear classifier, we used linear activations and a mean-squared-error cost function; for the artificial neural network, we used a sigmoid activation function in the second layer, a softmax in the last layer and a categorical cross-entropy activation function.

Layered			All-to-All			Benchmark
# hidden units	# parameters	test accuracy	# hidden units	# parameters	Test accuracy	ANN	Linear classifier

20	1540	$91.9 \%$	11	1596	$93.3 \%$	$94.3 \%$	$90.4 \%$
100	7620	$92.7 \%$	75	7725	$92.7 \%$	$95.0 \%$	$90.3 \%$
200	15 220	$94.1 \%$	115	15 246	$92.4 \%$	$95.0 \%$	$90.7 \%$
300	22 820	$93.7 \%$	148	22 831	$92.9 \%$	$95.1 \%$	$90.7 \%$

Appendix C: Comparison of different choices for the cost function

In the article we use the cost function $C(\phi, \phi^\tau) = \sum_i - \log(1+\cos(\phi_i - \phi^\tau_i))$ for the training and the distance function $D = \sum_i 1 - \cos(\phi_i - \phi^\tau_i)$ for assessing the performance. Here we give a comparison of these two metrics for the distance between the outputs and the targets.

The definition of $D(\phi, \phi^\tau)$ is direct and it could be easily implemented in the experiment (as it has the same form as the coupling terms anyway present in the energy function). However, we find in general that it suffers the problem of slow convergence. This is particularly obvious for the XOR scenario, as can be seen from figure C1. From figure C1 we find that after being trained for 10 000 epochs, the networks still produces an average accuracy of the prediction worse than that of networks trained with C for only 1000 epochs. We also test the effect of choosing the D cost function on the scaled-down version of MNIST and still find slow convergence.

Figure C1. Refer to the following caption and surrounding text. — **Figure C1.** Training with $\cos(\phi-\phi^\tau)$ . Training curves for XOR, displaying D averaged over 1000 training cases.
Download figure:
Standard image High-resolution image

**Figure C1.** Training with $\cos(\phi-\phi^\tau)$ . Training curves for XOR, displaying D averaged over 1000 training cases.
Download figure:
Standard image High-resolution image

As mentioned briefly in the main text, there is a challenge when attempting to apply the cost function C directly in experiments: it is not obvious how to construct physically a cost function of this shape in a usual setup of coupled phase oscillators. However, we can solve this challenge by the following construction. Observe that we can expand the cost function (equation (4)) to linear order in the deviation of φ from the free equilibrium φ₀ and find that:

$\begin{equation} C\left(\phi, \phi^\tau\right) \approx C\left(\phi_0, \phi_\tau\right) + \frac{\partial C}{\partial \phi}\left(\phi_0, \phi^\tau\right) \left(\phi-\phi_0\right) + \ldots\end{equation} \tag{ C.1 }$

Replacing the true cost function C by this approximation, we obtain something that is much easier to implement experimentally. Indeed, in the setting of coupled phase oscillators, one just needs detuned local oscillator frequencies to produce a term in the energy function that is linear in the phases. In this setting, we then find the nudge force to be:

$\begin{equation} F_{\mathrm{nudge}} = - \frac{\partial C}{\partial \phi} \approx - \frac{\partial C}{\partial \phi}\left(\phi_0, \phi^\tau\right) + {\mathcal O}\left(\phi - \phi_0\right).\end{equation} \tag{ C.2 }$

This is different from the nudge force for the full cost function C, which is $- \frac{\partial C}{\partial \phi}(\phi, \phi^\tau)$ . The difference will become more pronounced when β is large. In the figure below, we compare training using the approximate version above and the full version of C, for our default choice of β = 0.1. We observe good correspondence.

Apart from the issue of comparing C and its experimentally more feasible approximation, we also would like to comment on the differences between training using the simple cost function D and the more elaborate construction C. We have of course deliberately chosen our definition of C such that minimization of C is equivalent to minimization of D. We now discuss the deviation of the gradients of these two functions when we are still apart from the minimum. Note that for the distance function we have:

$\begin{equation} \begin{split} \frac{\partial D}{\partial \phi_j}\left(\phi_{0}, \phi^\tau\right) = \sin\left(\phi_{0,j} -\phi_j^\tau \right). \end{split}\end{equation} \tag{ C.3 }$

Compare this to the cost function gradient:

$\begin{equation} \begin{split} \frac{\partial C}{\partial \phi_j}\left(\phi_0, \phi^\tau\right) & = \frac{\sin\left(\phi_{0,j} - \phi_j^\tau\right)}{1 + \cos\left(\phi_{0,j} - \phi_j^\tau\right)}. \end{split}\end{equation} \tag{ C.4 }$

At the early stages of training, the difference of nudge forces given by C and D is significant. However, when the system is trained for a number of iteration and gives output close to the target, we observe that D and C will asymptotically generate the same gradients (with β effectively doubled for D). This can be seen in the panel (b) of the figure below.

To assess the performance of the linear expansion of D and also the deviation of gradients given by D and C, we train 100 networks of 20 nodes for XOR with C and its linear expansion separately. Then, for the networks trained with C, we calculate the parameter gradient given by C and D after each iteration and compute the cosine similarity between parameter gradients given by:

$\begin{equation} \cos \langle g_1, g_2 \rangle = \frac{g_1 \cdot g_2}{|g_1| |g_2|}.\end{equation} \tag{ C.5 }$

Afterwards we average this over the 100 networks. The results are shown in figure C2. Panel (a) shows the decaying cost (averaged over the 100 networks), both for using the full cost function C and its linear expansion. We find that the training is still effective with linear expansion around the free equilibrium. Panel (b) shows the evolution of the average cosine similarity of parameter gradients given by C and D, from which we can conclude that at the late stage of training the two different cost functions give the same gradients.

Figure C2. Refer to the following caption and surrounding text. — **Figure C2.** (a) Training curves for XOR. Training was performed using either the full cost function C or its linear expansion. Resulting distance D averaged over 100 training cases (100 different networks). (b) Comparison of direction of gradient descent given by C (logarithm) and D (cosine), calculated as an averaged cosine similarity, plotted separately for weights and bias gradients.
Download figure:
Standard image High-resolution image

Appendix D: Effects of weight noise

In this section we discuss the response of the system to a situation where its weights are perturbed by random noise. In this way, we assess the flexibility and stability of the network. We first study this response with XOR. For a trained all-to-all connected XY network, we disturb the network with Gaussian noise added to the weights. The scale of the noise is proportional to the standard deviation of the weights:

$\begin{equation} \delta W_{ij} \sim N\left(0, \sigma_r^2 \sigma^2\left(W\right)\right)\end{equation} \tag{ D.1 }$

where $\sigma^2(W)$ is the variance of weights and σ_r is a dimensionless constant which sets the relative scale of the noise.

For XOR, we have produced numerical results for different network sizes. For different sizes, we disturb 100 different trained networks in the above way with different σ_r. Then we continue training the networks. Then we assess the performance of the network by using the mean distance D averaged over 100 different networks. The results are shown in panel (a) of figure D1. From the results we find that the networks are generally stable to small disturbances of the weights, as we see only minute deviations for noise scales smaller than 0.05. When the noise scale is large but not sufficient to significantly reset the structure of connections, the network outputs deviate from the targets at first but very soon converge again to the targets.

Figure D1. Refer to the following caption and surrounding text. — **Figure D1.** Effects of weight noise. (a) Response to Gaussian noise added to the weights, for networks of different sizes when trained for XOR. The noise is suddenly added at some instant and afterwards training is continued. (b) Response to Gaussian noise for the case of the scaled-down MNIST. See text for comments.
Download figure:
Standard image High-resolution image

For the scaled-down-MNIST task, we perturb the network in the same way. The only difference is that we repeat this operation only 10 times. The results are shown in panel (b) of figure D1, from which we find that, assessed using the test error, the networks are still stable for small weight disturbances and tend to converge back to good results soon.

Appendix E: Estimation of the energy consumption of the laser array

To get insights into a realistic situation, we consider a laser array such as [19] and estimate its energy consumption when it is used for the neuromorphic computing tasks envisioned here. The power consumption of such a laser array will mainly be determined by two factors: (i) a sufficiently large pump power is needed to overcome the pump threshold; (ii) the array has to be operated sufficiently high above threshold, to limit the influence of phase noise.

First, we show that the energy contribution from (ii) is negligible compared to the pump power consumption. The phase evolution of two coupled lasers is determined by [83]

$\begin{align} \dot \phi_1 & = \frac{\kappa}{\tau_c} \sqrt{\frac{I_2}{I_1}} \sin\left(\phi_2-\phi_1\right) + \sqrt{\frac{\varepsilon_1}{I_1}} \eta_1^\phi\left(t\right)\end{align} \tag{ E.1 }$

$\begin{align} \dot \phi_2 & = \frac{\kappa}{\tau_c} \sqrt{\frac{I_1}{I_2}} \sin\left(\phi_1-\phi_2\right) + \sqrt{\frac{\varepsilon_2}{I_2}} \eta_2^\phi\left(t\right)\end{align} \tag{ E.2 }$

in which I_j are the respective laser intensities, $\kappa = \pm e^{-d^2/(4\sigma^2)}$ is a geometric factor stemming from the overlap integral of the electric fields between two Gaussian beams at distance d and radius σ, ɛ_j is the respective spontaneous emission rate and $\eta_j^\phi(t)$ describes Gaussian noise which will lead to phase diffusion. Hence, one has to ensure that the phase evolution due to the first time happens on faster time scales than the phase diffusion, i.e.

$\begin{equation} \frac{\tau_c}{\kappa}\frac{\varepsilon_j}{I_j} \ll 1\end{equation} \tag{ E.3 }$

(assuming equal intensities $I_1 = I_2$ ).

The rightmost terms in equation (E.1) represent the phase noise that results in the Schawlow-Townes limit of a laser, i.e. one can use them to calculate the Schawlow-Townes linewidth. Due to equation (E.1), on average, the complex electric fields will be of the form $E_j\propto \langle e^{\mathrm{i}\phi_j(t)} \rangle = \langle e^{\mathrm{i}{\bar \phi}_j(t) + \mathrm{i} \delta\phi_j(t)}\rangle$ . Specifically, for Gaussian noise, we obtain $E_j \propto e^{\mathrm{i}{\bar \phi}_j(t)} e^{-\frac{1}{2}\langle \delta\phi_j(t)\rangle^2} = e^{\mathrm{i}\bar \phi_j(t)} e^{-\frac{1}{2}\frac{\varepsilon_j}{I_j}t}$ , i.e. the phase noise induces a linewidth of $\varepsilon_j/I_j$ .

On the other hand, the Schawlow–Townes limit is:

$\begin{equation} \Delta\nu_\mathrm{L} = \frac{\omega_\mathrm{L}\gamma_\mathrm{cav}^2}{P_\mathrm{out}} = \frac{\varepsilon_j}{I_j}\end{equation} \tag{ E.4 }$

with $\gamma_\mathrm{cav}$ the decay rate of the laser resonator, $\omega_\mathrm{L}$ the laser frequency and $P_\mathrm{out}$ the output power of that laser mode. Together with equation (E.3), we find that the additional pump power required above threshold is

$\begin{equation} P_\mathrm{pump} \gt P_\mathrm{out} \gg \frac{\tau_c}{\kappa} \gamma_\mathrm{cav}^2 \hbar \omega_\mathrm{L} = \frac{\hbar\omega}{\mathcal{F}^2\tau_c \kappa}\end{equation} \tag{ E.5 }$

with $\mathcal{F}$ the resonator finesse. For the cavity considered in [19] $\tau_c\approx 10^{-9}\, \mathrm{s}$ , $\mathcal{F}\approx 60$ and κ ≈ 0.1 so that $P_\mathrm{pump} \gg 10^{-13} \,\mathrm{W}$ . Hence, in conclusion, to avoid phase noise affecting our neuromorphic system, we only require to go to pump powers which lie slightly above threshold and expect the main contribution to the power consumption to stem from the pump power to reach the lasing threshold.

Second, we estimate the pump power at the lasing threshold for the array discussed in [19]. The array in [19] is based on a Nd-YAG gain medium placed in a $160\,\mathrm{cm}$ long cavity. The individual lasers are created by placing a mask inside the cavity. The mask consists of holes of $200\,\mu \mathrm{m}$ diameter separated by $300\,\mu\mathrm{m}$ . The laser has a 90% reflectivity output coupler, i.e. the round-trip power loss is 10%.

The pump power consumption at laser threshold can be estimated as follows [84]

$\begin{equation} P_\mathrm{p,th} = \frac{h\nu_\mathrm{p} A l_\mathrm{rt}}{\eta_\mathrm{p}\tau_2 \sigma_\mathrm{em}}\end{equation} \tag{ E.6 }$

with $\nu_\mathrm{p}$ the pump wavelength, A the beam area in the laser crystal, $l_\mathrm{rt}$ the round-trip power loss, τ₂ the upper state lifetime and $\sigma_\mathrm{em}$ the emission cross-section. For Nd-YAG at $1064\,\mathrm{nm}$ [85], $\tau_2 = 230\,\mu\mathrm{s}$ , $\sigma_\mathrm{em} = 28 \cdot 10^{-20} \,\mathrm{cm}^{2}$ . Assuming, a pump efficiency of 80% and that pumping the gain medium yields a beam of around $10\,\mathrm{mm}$ diameter (the diameter of the gain medium rod is $10\,\mathrm{mm}$ ), we obtain $P_\mathrm{p,th} = 285\,\mathrm{W}$ for around 1500 coupled phase oscillators.

To estimate the energy consumption for performing neuromorphic computations with the phase oscillator array, we need to estimate the energy supplied by the pump for the duration of one equilibration step. According to the phase-coupling equations displayed above, the time for one such equilibration step is on the order of $\tau_c/\kappa$ , with the round trip time τ_c and a geometry factor κ which for the layout here is around 0.1. For the $160\,\mathrm{cm}$ laser cavity in [19], the round trip time is around $10\,\mathrm{ns}$ . If we assume the total time for the equilibration step to be on the $1\,\mu\mathrm{s}$ time scale, which should ensure equilibration, the energy consumption during this time is $E_\mathrm{pass} = 285\,\mu\mathrm{J}$ . Hence, the energy consumption for either one training pass or one inference is given by $E_\mathrm{pass}$ . This value can be brought down further. (i) The largest effect comes from the beam radius since it enters as the square in the beam area in equation (E.6). Hence, decreasing the beam area decreases $E_\mathrm{pass}$ . Considering the diffraction limit of the setup [19] (which is around $10\,\mu\mathrm{m}$ per oscillator [19]), the total beam diameter could be made as small as $400\,\mu\mathrm{m}$ which would bring down the energy consumption per pass to $E_\mathrm{pass} = 20\,\mathrm{nJ}$ . (ii) Furthermore, decreasing the cavity length would further reduce the energy consumption as the round-trip time would be smaller and equilibration correspondingly faster. E.g. a cavity of a length of $20\,\mathrm{cm}$ would further decrease the energy consumption by a factor of 8.

Now we compare this to the typical energy consumption for one pass of a digital machine learning model. A typical digital neural network that would be comparable in size to the phase oscillator network of 1500 physical neurons would have around $5\cdot 10^6$ parameters and require on the order of 10⁷ floating point operations for one pass (either during training or inference). State-of-the-art hardware such as the NVIDIA A100 can perform around $9.7\cdot 10^{12}$ floating point operations per second at a peak power consumption of $250\,\mathrm{W}$ . Hence, to perform 10⁷ floating point operations in the best case takes around $1\,\mu\mathrm{s}$ and the associated energy consumption per inference would be $250\,\mu\mathrm{J}$ . This is comparable to the unoptimized energy consumption of the laser array per pass. As we pointed out above, the energy consumption of an optimized implementations of the array [19] could lie a factor of $10\,000$ lower. Furthermore, we would like to point out another important advantage of neuromorphic hardware based on laser arrays: While digital hardware has to run continuously, even when no computations are performed leading to a high idle energy consumption, the neuromorphic system can be switched on for short durations of time only for the computation.