Abstract
Developing behavioral policies designed to efficiently solve target-search problems is a crucial issue both in nature and in the nanotechnology of the 21st century. Here, we characterize the target-search strategies of simple microswimmers in a homogeneous environment containing sparse targets of unknown positions. The microswimmers are capable of controlling their dynamics by switching between Brownian motion and an active Brownian particle and by selecting the time duration of each of the two phases. The specific conduct of a single microswimmer depends on an internal decision-making process determined by a simple neural network associated with the agent itself. Starting from a population of individuals with random behavior, we exploit the genetic algorithm NeuroEvolution of augmenting topologies to show how an evolutionary pressure based on the target-search performances of single individuals helps to find the optimal duration of the two different phases. Our findings reveal that the optimal policy strongly depends on the magnitude of the particle's self-propulsion during the active phase and that a broad spectrum of network topology solutions exists, differing in the number of connections and hidden nodes.
Export citation and abstract BibTeX RIS
Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 license. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
1. Introduction
Active matter and directed motion have come under the spotlight of several research communities such as biology, biomedicine, robotics, and statistical physics [1â6]. In nature, many micro-organisms are able to convert chemical energy into self-propulsion with the goal of exploring their environment, foraging nutrients, or running away from toxic substances [2, 3, 7]. Paradigmatic examples include the swimming behavior of bacteria such as Escherichia coli [8], phagocytes of the immune system performing chemotactic motion during injury or infection [9, 10], and sperm cells navigating against chemical gradients to find the egg [11]. At larger length-scales, animals constantly have to face challenges such as finding food, a mating partner, or shelter [12, 13], and generally solve these issues by adopting strategies involving smart motion. Artificial and biohybrid microswimmers [7, 14â16] capable of intelligent self-propulsion have potential for revolutionary applications ranging from active drug delivery [17â20] to assisted fertilization [21] and environmental remediation [22].
Notwithstanding the tremendous progress that has been achieved in this research field in the past decade, central problems as those regarding optimal navigation and target-search strategies still have to be thoroughly addressed already at the level of a single agent in a homogeneous environment. Nature has found solutions to these problems in hundreds of millions of years of evolution: Many organisms display robust locomotion performances by adapting their locomotory gaits to the surroundings [23, 24] and several microswimmers sense environmental stimuli and exhibit various tactics to achieve effective navigation in biological fluids [7â11]. In a certain sense, even relatively simple microswimmers are smart with respect to the actions they have to perform to reach their biological goals.
To understand how evolution shaped navigation and search strategies, one can use reinforcement learning (RL) [25] and genetic algorithms [26, 27] to identify optimal and alternative strategies. Recently it has been demonstrated how agents trained with RL (eventually combined with genetic algorithms) are able to find advantageous swimming strategies in several situations such as in viscous solutions [28â30], simple energy landscapes [31], steady flows [32â34], turbulent fluids [35â38], and complex motility landscapes [39]. Notwithstanding their merits, in all these studies, either the goal of the particle is different from reaching a specific target or, if a target region has to be met, its position is fixed and then implicitly learned during the learning process. On the other hand, a crucial question arising in the context of target search is which strategies are optimal to find sparse targets of unknown positions.
The first theoretical studies on search strategies date back to World War II, when the U.S. Navy tried to rationalize search procedures to efficiently hunt submarines of the enemy [40]. When looking for sparse small targets, depending on the searcher's abilities and on the space to be explored, different target-search strategies can be put in place. In the microscopic world, often the agents have only limited or no spatial memory, and search trajectories can be qualified as stochastic, meaning that some characteristics of the stochastic motion typical at this length scale are tuned to optimize the search time [13]. Among random strategies that can be used in a homogeneous environment, Lévy walks [12, 41, 42] and intermittent-search strategies [13, 43, 44] have been extensively studied in various contexts. In particular, intermittent-search strategies rely on the experimental observation that fast movement degrades perception. Thus, these strategies combine phases of diffusive motion allowing target detection and phases of ballistic motion with random orientation that allow moving quickly to a different space region but do not allow detecting the target. It has been shown that the mean search time of intermittent random walks can be minimized under broad conditions [45â47]. Recently, Muñoz-Gil et al [48] have proposed the use of RL techniques to study non-intermittent-search strategies and showed how these can outperform Lévy walks.
Here, for the first time, within the framework of intermittent-search strategies, we address the problem of finding a target of unknown position using a genetic-algorithm approach. This is elaborated for the simple case of a homogeneous environment and considering agents equipped with a simple artificial neural network (ANN) which selects, based on the agent state, an action among a set of possible actions. Specifically, the agent can switch its state between a passive and an active Brownian particle (BP) and, with intermittent-search strategies in mind, has at its disposal only a limited set of actions allowing it to switch between passive and active motion and to choose the duration of the new phase. In our framework, the agent behavior is then deterministically selected on the basis of its state. Our goal is to characterize and understand the behavioral policies which are optimal in solving the target search problem and show that these strategies can be obtained by means of genetic algorithms. A genetic algorithm is here preferred over typical RL methods because having a target with a completely unknown position results in a very sparse reward signal to be used in the latter methods. Given the pivotal role played by the reward function in RL, this is a non-trivial problem to face when willing to adopt such approach [25]. On the other hand, genetic algorithms are known to be less sensitive to issues related to the sparseness of the rewards because they evaluate the full behavior of an agent rather than trying to find the value of the state-action pairs.
2. Model
Our environment consists of a two-dimensional square box of size Lâ×âL with periodic boundary conditions and of a circular target of radius R placed randomly inside this box. Note that, due to the periodic boundary conditions, this environment is equivalent to an infinite domain with a lattice of targets. We consider an agent able to switch its state s between a passive BP and an active BP (ABP). Similarly to intermittent-search strategies, during the BP phase the agent is allowed to find the target while, in the ABP phase, it can more quickly relocate to a different region of the box but cannot sense the target. Every time the agent finds a target (i.e. the distance between its position and the center of the target is smaller than the target radius R), this is destroyed and a new target appears at a new random location inside the box. The equations of motion of the ABP model in a homogeneous environment are a set of Langevin equations that, once discretized according to Itô rule, read
where is the integration step, is the position at time t, and denotes the instantaneous orientation of the driving velocity with constant modulus v. D and DÏ are the translational and rotational diffusion coefficients, respectively. Finally, the components of the vector noise and of the scalar noise ηt are independent random variables, distributed according to a Gaussian with zero average and unit variance. The equations of motion of the standard BP model are readily recovered by setting vâ=â0, thereby decoupling the spatial evolution from the orientational diffusion of the self-propulsion. In the following, we fix the length unit as the size of the square box L and time unit as the typical time required by a passive particle to cover this distance. The magnitude of the activity and the persistence of motion in the ABP phase are respectively measured by the dimensionless Péclet number, and the dimensionless persistence .
With the intent of keeping the setup as simple as possible, we equip our particle with only a limited set actions for each state. Each action is a tuple consisting of two parameter. The first parameter, , is a binary variable determining the next state of the particle: For the next phase is identical to the previous one (BPBP or ABPABP), while for the phase changes (BPABP or ABPBP). The second parameter, aÏ , specifies the time duration of the next phase and here we restrict the agent to select only among different time durations Ïi (), thus implying a total of 10 possible actions to select given the state. When an active phase begins after a passive one, the direction Ï of the self-propulsion velocity is drawn from a uniform distribution in , otherwise it is updated according to equation (2). In the following we set the persistence and select the time duration of the actions such that , see figure 1(a). Thus, in the longest action time, , a BP covers a typical distance of which, because of periodic boundary conditions, is comparable to the possible maximal distance between the particle and the target, . The results reported in this manuscript are obtained by following this choice. However, to check that our results are not particular to this choice, the supplemental material reports also results obtained by varying the number of allowed time durations NÏ .
To learn optimal policies solving the target-search problem, we exploit the evolutionary algorithm NeuroEvolution of augmenting topologies (NEAT) [49, 50]. Therefor we equip each agent with an ANN taking as input the current state (BP or ABP) and outputs an action chosen among the set of actions described above, see figure 1(c). Any ANN is characterized by a certain topology, internal parameters, and activation and response functions, see section Methods for further details. Depending on its inner structure, a given ANN always returns the same action given the input state. Starting from a population of randomly generated individuals (ANNs), the NEAT algorithm then iteratively creates new generations relying on biologically inspired operators such as mutation, crossover, and selection based on the fitness of each individual in the population. In very simple words, only the inner traits of the fittest individuals are transmitted from a generation to the next generation. Finally, the fitness of an individual is defined as the number of targets that it manages to detect in a time equal to , see section Methods for further details.
3. Results
We start by investigating how the adaptive particles evolve in the case where the size of the target is in between the typical distance explored in the passive phase in time Ï5 and the typical distance covered in the active phase in the same time, , see figure 1(a). Within this choice, independently of the phase duration time Ïi , during the active phase the particle is typically relocating to a distance larger than the target size. On the other hand, when the action duration Ï5 is selected, a particle in the passive phase is typically exploring a region smaller than the target size. This situation, representing well the idea of an intermittent search, is here implemented by setting the radius of the target and the Péclet number to and respectively.
The initial population contains randomly generated individuals which, depending on the particular action returned when being in a given state, can be categorized into three different species: Individuals that always behave as a passive particle (BP-like individuals), those that once they visit their active state are unable to go back to the passive one (ABP-like individuals), and finally individuals that switch periodically between the two phases with fixed switching times (switching individuals). The latter species can be further split into sub-species depending on the specific duration of the BP and the ABP phases.
Since the NEAT algorithm dynamically produces new generations by selecting for reproduction those individuals having the largest fitness, for this large value of the Péclet number, we expect that the switching individuals become the dominating species. Indeed, the fraction of switching individuals raises from the initial value of about 0.25 to about 0.9 already at the second generation and reaches a plateau at about 0.95 at the fifth generation, see figure 2(a). Further inspection reveals that the large majority of the switching individuals is represented by those individuals that select Ï5 as the time duration of both the passive and the active state, corresponding to short passive detection phases alternated to short active relocation ones, see inset of figure 2(a).
Download figure:
Standard image High-resolution imageWe evaluate the learning performances also by inspecting the evolution of the time required by the agent to reach the target in subsequent generations and comparing it to the same quantity computed for three different benchmarking models: a fully passive particle; a particle which selects randomly among the possible actions (we will refer to this benchmark as the 'casual particle'); an 'optimal particle' following the optimal strategy. This optimal strategy is obtained by individually checking the possible combinations that make the particle switch to the active state when it is in the passive phase and to the passive state when it is in the active one.
At the beginning of the learning process, for the switching individuals, the time required to find the target is comparable to that of the casual particle. More precisely, the average searching time of the switching individuals is twice that of the casual particle, while the median is about a factor of 0.6 lower than the latter quantity and closer to the searching time of a completely passive particle. This observation is due to the fact that the phase times Ïi () are logarithmically spaced and that, in the initial population of switching individuals, some spend most of their time in the active phase where it is not possible to find the target, and others spend most of their time in the passive phase, thus effectively performing as a BP. If the evolutionary process is successful, after repeated generations this quantity is expected to decrease and achieve a value slightly bigger than that of the optimal particle, with this small difference due to the fact that the NEAT algorithm maintains a certain level of exploration by creating new individuals through mutations and preserving a certain number of unfit individuals for different evolution scenarios. This is in fact what is observed for and , see figure 2(b). Furthermore, since more and more switching individuals enter the most adapted subspecies, also the spread of the search time decreases during the learning process, with at least 75% of the switching individuals behaving as good as the optimal particle after the third generation, see figure 2(b). However, the average stays above the 75th percentile over the whole evolutive process because is dominated by the performances of the worst switching individuals.
Additional insight into how the genetic algorithm encodes learning a successful policy can be obtained by focusing on the topology of the ANNs corresponding to the fittest individuals, i.e. for , those selecting Ï5 as the duration of both the BP and ABP phases (inset of figure 2(a). In the initial population, all the individuals have two input nodes and four output nodes with no hidden nodes in between, see Methods section for details. However, with the progress of the evolution process, new fit individuals with some hidden nodes emerge, see figure 3(a). The fractions of fit individuals with h hidden nodes are consistent with the expected values as obtained by solving the master equation
where indicates the number of individuals with h hidden nodes at the g-th generation and are respectively the probability of adding and deleting an hidden node. This means that the topology of the initial ANN with only two input nodes and four output nodes is already complex enough to provide successful solutions to the target-search problem. In contrast, a similar approach adopted starting from a different initial network topology shows that, when having fewer output nodes, fittest individuals are slightly likely to have a certain number of hidden nodes that contribute in selecting the optimal actions, see supplemental material. This observation agrees with the intuitive expectation that a minimal ANN's complexity is required to find successful strategies.
Download figure:
Standard image High-resolution imageBy looking more in detail at the internal structure of the ANNs associated with the fittest individuals, it appears that there is not a preferred topology once the number of hidden nodes h is fixed. In fact, typical topologies of these networks show a very variagated range of active connections and of their weights among the different nodes, see figures 3(b) and S2 in the supplemental material. In these plots the edges connecting different vertices have a width proportional to the absolute value of the weight of the connection, see methods section for details. However, a complete understanding of the functioning of the particular ANN should include also the properties of the single nodes, including their bias and activation and aggregation functions, see methods section for details. Note also that, typically, emergent hidden nodes are connected to only one input and one output.
An important question is how the target-search strategy depends on the activity of the particle. To address this issue, we carry out a similar analysis for a different Péclet numbers. When we do not expect major differences to the previously considered case (). In fact, for high Péclet numbers, the self-propulsion velocity is enough to allow relocating at a distance larger than the target size even in the shortest time Ï5. Then, intuitively, performing a motion alternating between short BP phases and short ABP phases is the most promising strategy. In contrast, for very low Péclet numbers, if the action time is smaller than the time unit Ï, the typical distance travelled by simple diffusion is always greater than the distance covered by self-propulsion, see figure 1. Then, selecting active phases becomes superfluous and the optimal strategy is simply maintaining the particle in a BP-like phase. These expectations are confirmed by checking, after about 10 generations, the fraction of individuals in different subspecies and by comparing the overall performance of the population to those of the passive and of the optimal particle for different Péclet numbers. This is done in figure 4(a), reporting the average search times, and in figure 4(b), showing the relative distribution of individuals across subspecies. In the latter panel, the phase durations associated with the optimal particle are also highlighted with a black frame in the table.
Download figure:
Standard image High-resolution imageMore interesting is the behavior for intermediate activity (). In this range, the optimal particle outperforms the simple passive particle and the optimal actions correspond to a behavioral policy that again displays the shortest passive phase (i.e. BP-like phases with duration Ï5) but, to allow for significant relocation, selects a longer duration of the ABP phase, namely the duration Ï2 for the considered cases, see figure 4(b). Consistently, for and 70, the population produced by NEAT at the 9th generation shows a majority of individuals belonging to the sub-species selecting the same actions as those selected by the optimal particle. However, for these values of activity, the genetic algorithm still considers as possible candidates for the optimal solution also individuals of other sub-species, especially those with a short passive phase. This results in the higher spread of the search times reported in figure 4(a). Finally, at the optimal solution corresponds to an optimal particle selecting Ï3 as the duration of both the passive and the active phase but the NEAT algorithm in this case develops a population with a majority of BP-like individuals. The reason for this small inconsistency is likely due to the fact that, in this case, the average search time of the passive particle is very close to that of the optimal particle and, by chance, the first evolutionary lineage is preferred.
A final remark is in order: All results reported so far are obtained by strongly reducing the amount of possible ABP and BP phase durations. In particular, we allowed only different durations that span a large time range. By doing so, the computational time required to run the NEAT algorithm is comparable with directly evaluating the performances of the completely BP and of the possible switching individuals. However, any integer multiple of the integration time step could serve as a possible phase duration and one may consequently increase NÏ , thus letting the ANNs select phase durations among a more fine-grained palette. Indeed, increasing NÏ is desiderable to explore a larger set of switching individuals, thus allowing to find even better phases durations that increase the target-search performances. However, by doing so, a brute force check of all possible switching individuals becomes quickly inefficient, with computational costs increasing as , in favor of our choice of using the NEAT algorithm, which, in contrast, has unaltered computational costs. The supplemental material reports the equivalent of figures 2 and 4 obtained by varying NÏ , namely for , 20, and 50. The learning performances obtained by the NEAT algorithm in these cases are comparable to the case study reported in the main text and, for large Pèclet numbers, the average time to reach the target slightly outperforms that of the fittest individual when this is selected among the set of 25 switching individuals used in this section, see figures S6 and S8 in the supplemental material. However, the fluctuations of the average time to reach the target increase with NÏ . Furthermore, with increasing number of allowed phase durations, more actions configure themselves as a valid candidate to become the optimal action selected by the genetic algorithm.
4. Conclusions
Our findings demonstrate that genetic algorithms are a powerful tool to address the problem of finding targets of unknown positions for particles able to switch their behavior between a simple passive BP and an ABP. In particular, we equipped the particle with a neural network receiving the current state of the particle itself as the only input and returning as an output a decision regarding if and when the agent should switch its phase. We then showed that the algorithm NEATs is able to evolve an initial population of neural networks taking random decisions towards a population in which the majority of individuals are optimized to solve the target-search problem.
In principle, similar results on target-search performances of intermittent passive-active BPs could be obtained by resorting on RL algorithms [25]. However, in the RL framework, setting a target having a different unknown position at the beginning of each target-search episode would result in a very sparse reward function, which is generally a non-trivial problem in RL [25]. More specifically, since in our case the state of the agent is a simple binary variable and the rewards are extremely sparse, the reward signal has only a very low correlation with the particular state-action pair encountered when the target is found, making typical action-value methods such as Q-learning or SARSA [25] fail in learning successful strategies. If willing to follow any RL method, algorithms taking into account long sequences of visited state-action pairs should then be preferred because these methods, similar to the genetic algorithm, seek to maximize the performances by directly evaluating the outcome of a given policy. Possible algorithms of this kind include policy-gradient and actor-critic methods [25] and the projective simulations algorithm [51]. These considerations do not exclude that successful results could be obtained by using more elaborated versions of the above mentioned action-value methods and/or by differently defining the states and the actions, as proposed in reference [48] for non-intermittent searchers. On the other hand, genetic algorithms bypass the problem of reward sparseness by giving the agents a fitness function dependent on the overall performance in assessing some task, and thus they are particularly suited to our case.
In the current setup, the output of any given individual ANN is fixed once the input is given, meaning that, given its current state, a particle always chooses deterministically the same action. This is different from the typical intermittent-search strategies discussed in reference [13], where an agent draws the phase durations from a certain distribution. However, similarly to the most general case [13, 45], also our results show that there is an optimal duration of the active relocation phase which depends mainly on the amount of the activity and that, for very low activity, having an active phase is not any more functional to improve the target-finding efficiency. A natural step forward to go even further in the direction of a standard intermittent search model would be to adapt our algorithm in such a way that, instead of learning the optimal phase durations, it optimizes the parameters of a certain time distribution. Another possible extension of our work would be to link the output of the ANNs to a transition probability rather than to a deterministic action, with different individuals thus corresponding to different transition matrices.
Our paper provides a first attempt to use machine-learning methods to investigate the problem of finding targets of unknown positions in a simple homogeneous environment and paves the way to further research on this important problem. Having a minimal model with only two distinct phases is a choice that serves as our proof of concept that genetic algorithms are powerful tools to investigate target-search strategies in stochastic systems. However, in nature, searchers may have multiple dynamic modes. For example, dendritic cells searching for infections combine three distinct migration modes in their motion [52] and some DNA-binding proteins also have more than two dynamic states for the search [53]. This provides a solid biological motivation for a first generalization of our work to a case in which three or more distinct different phases are considered as possible states of the agent. Other possible topics worth future investigation include multiple and/or motile targets problems [13], target search with resetting events [54â56], and extensions to more realistic scenarios involving the presence of boundaries, obstacles, and energy barriers [57â60]. To follow these goals, both the state and the actions of the agent may be made arbitrarily complex: For example, the agent could gather sensorimotor cues from the environment and, based on them, modify its behavior by control over some motility parameters. Alternatively, similarly to what has been done recently in a different context [48], the agent can be equipped with the ability to sense the duration of its current phase. Finally, agents having a limited memory of the visited locations can also be investigated.
5. Methods
To investigate how an evolutionary pressure allows the adaptive particles to develop successful target-search strategies, we resort to the genetic algorithm NEAT [49, 50, 61]. To do so, a simple ANN is associated with our adaptive particle. The role of this network is to take as input the state of the particle (BP or ABP) and return as output the action to be performed as described in the Model section. Starting from a population of 103 individuals, the NEAT algorithm then iteratively creates new generations relying on biologically inspired operations such as mutation, crossover, and selection based on the fitness of each individual in the population. This fitness is defined as the number of targets that the individual manages to detect in a time equal to . The evolution process is based on the principle of complexification of existing networks [49]: not only the node biases and the edge weights are adjusted to optimize the individuals' fitness, but this goal is also reached by changing the topology of the network, i.e. by adding or deleting new nodes and enabling or disabling some connections, see text below. The number of individuals in each generation is kept fixed.
More specifically, we construct an initial population of networks having two input nodes and four output nodes, see figure 1(c). These are the nodes through which the ANN interacts with the external environment and they cannot be created or destroyed. The two input nodes recognize the state s of the particle and pass to the output nodes the signals and which is further multiplied by the weight of the connection between the specific pair of input and output nodes. This signal is a simple characteristic function, if sâ=âS and otherwise. The four output nodes aggregate by summation the signals coming from the various input nodes, add a bias, and return an output value according to a certain activation function. Formally, the output value of node j is given by , where xk is the signal coming from node k, wkj is the weight of the connection between k and j, bj is the bias of node j, and f is the activation function, in our case a modified clamped function, with if , if , and otherwise. Each output node j then returns an output real variable and these four output values are then together determining the action taken by the agent as follows. If the current state of the agent is BP (ABP) the first (third) node determines the next state, being this again BP (ABP) if the node output value is smaller than 0.5 or changing to ABP (BP) otherwise. The two options correspond respectively to , see model section. The duration of the next phase, aÏ , is instead determined by the second (fourth) output node that selects the phase duration Ïi with i the integer part of , being xj is the output signal of the node and the number of allowed phase durations, see model section.
The individuals in the initial population are selected randomly (i.e. random biases and random connection weights) but all share the just described topology. However, in subsequent generations individuals with new hidden nodes emerge. These nodes receive the signals coming from the input nodes and eventually from other hidden nodes and, in the same fashion as described for the output nodes, return an output value which is collected as input signal from the output nodes and eventually from other hidden nodes. During mutations, hidden nodes are generated with a probability and deleted with probability . Following standard practice [50, 61], we set .
The initial network topology with two input nodes and four output nodes works particularly well in our target-search problem. However, while in the main text we report only results obtained by starting with this initial setup, we also tested other topologies as well as a different number of phase durations, see supplemental material.
Acknowledgments
H K acknowledges funding from the European Union's Horizon 2020 research and innovation programme under the Marie SkÅodowska-Curie grant agreement No 847476; M C is supported by FWF: P 35872-N; T F acknowledges funding by FWF: P 35580-N.
Data availability statement
All data that support the findings of this study are included within the article (and any supplementary files).
Author contributions statement
M C and T F conceived the research, M C developed the software, and M C and H K analyzed the results. All authors wrote and reviewed the manuscript.
Supplementary data (1.3 MB PDF)